Friday, 4 August 2017

Graph Databases and Neural Nets IV

Writing a query engine for a database seems pretty simple on the surface.  However, when you really dig into it, it sucks and you start to question why you are writing a query engine.  I mean, don't you have better things to do?  Don't you have a family that is wondering why you are locked away in the attic in the dark, mumbling to yourself, occasionally screaming, and more occasionally throwing and breaking things?  Whatever.

The basic problem here is writing a query engine that allows for a wide array of conditions.  Now, for my engine, I wanted to provide at bare minimum the ability to query exact values,  >, <, >=, <=, IN, NOT IN and KEY EXISTS.  Sounds pretty simple, right?  How hard could that be?  Well, doing them individually is actually very very simple.  But, when you have complex queries against multiple keys and also add in the ability to use AND and OR for combinations of conditions, it starts to get pretty ugly.

I have done it, of course, and it works, but it took two tries to get it working and now, it is a horrible mess and needs to be completely rewritten and optimized and that is going to lead to errors and screaming and throwing and breaking, but I will get it working and working well.

Once that is working, I will move on to things like DATES, REGULAR EXPRESSIONS and then the ability to embed function calls within the queries (WHY?  BECAUSE)

This is a slow process.  It is slow because I am only allowing myself to take small steps between tests.  Tests take time and patience (and an understanding of what the hell needs to be tested, which I often do not entirely have.)  So, baby steps.  This approach has led to fewer bugs overall.  However, it goes against my natural inclination to just barrel through things to "get it done."  Frustrating.  Very frustrating, but it is paying off.

So, here is what is working now:
  1. Creation of networks, nodes, connections and neurons
  2. Modification of networks, nodes, connections and neurons
  3. Ability to find connections for any object (node, connection, neuron)
    1. ==, >, <, >=, <=, IN, NOT IN, KEY EXISTS queries working and combinable
    2. AND and OR allowed but only globally for a query
  4. Indexes for object names, types and keys
  5. Neuron execution (Recursive, traverses network, can be slow.)
  6. Path analysis. (This is pretty slow for complex networks if depth > 2)
  7. Balance analysis (relationship between incoming and outgoing connections, a value between -1 and 1)
  8. Common connection queries (given a group of objects what connections are common to them?)  This allows for strict or fuzzy matching and is really just a way of clustering data.
  9. Deep connections.  A depth limited (configurable) search for connections to a given object
It isn't really a ton of code.  There are four files (soon to be more when I break out the query stuff into its own library.)  The code is clean except for the query stuff.  Variable naming is pretty solid.  I follow a set of patterns to keep it from becoming an unreadable nightmare.  So, in pretty good shape so far, but there is so much more to do.  I'm probably less than 30% done with it.



Saturday, 22 July 2017

Graph Databases and Neural Nets III

So, after a lot of pain and suffering, I have worked out most of the issues with this system.  The tricky part, of course, was neuron execution.  Here is the problem.  If I execute a neuron's code in this system, that code takes the object the neuron is attached to and an options object as arguments.  The code then returns a potentially modified options object.  I could go into a long explanation for this, but I won't.  Suffices to say, I wanted to maintain some sort of state for the results so that as more and more neurons fired, they would get this options object and be able to make decisions based on its values.  Now, when a neuron fires, it basically runs its own code, then it looks through all of its connections for other neurons to send the data to so they can fire and so on.  Now, I know you are saying, "Couldn't that go on forever if there are loops in the network?"  Yes, it could, but the system allows max depth to be set.  Anyway, the neuron essentially finds other neurons and those run, but they run in series and because options potentially changes each time a neuron fires, in this state of things, I do not pass the same options each time I send to another neuron.  What does that mean?

Let's say I execute neuron 1.  Neuron 1 is connected to neuron 2 and neuron 3.  So, now, I pass the modified options object from the neuron 1 fire to neuron 2 and it modifies the object and for now let's say it isn't connected to anything else, so we drop down to neuron 3 which has the options object that was modified by neuron 2.  Now, what if I wanted to pass the same options object that came from the neuron 1 fire to both neuron 2 and neuron 3?  I might want to do that, right?  In fact, I might want to do that more often than not depending on the type of data etc. I am dealing with.  So, I think I just have to make this an option (the default option.)  I will get around to it eventually.

Another issue I have now is that I don't have a formal output process for neurons.  So, right now, they just follow connections until they run out of connections or hit their max depth.  What happens after that?  Well, technically, the options object is returned at the end of the run, so you can do what you want with that.  However, it seems that there is a case here for exit procedures of some sort.  I fire a neuron and this cascades to some point and then I do SOMETHING with the options object at the end of the run.  Now, that can just be baked into whatever code you write that uses this module, or it could be something that is a part of the module itself.  I don't know.  Honestly, I am beginning to believe that I have gone completely off the rails.  I am questioning everything.  Where have I gone wrong?  I was happy once.  I had friends (not really.)  I believed I could do anything (mania.)  But now, I am lost.

Whatever...

The graph database part of this has reached a point where the only thing left to do is allow advanced key queries.  For example, queries like "node->{key}->{some key} > 10" or "node->{key}->{some key} in ['a','b','c']"  Well, sub-queries would also be nice, but then you get into nasty recursion and I really don't have the patience for that right now.  I'll leave that to the other poor suckers I've dragged into this nightmare.  Yeah, I said it.  Nightmare.  This is the worst thing that has happened to me...EVER.  Not really.  It is the second worst thing.  The worst thing was when I got a cluster of ticks on a rather sensitive part of my body while on a camping trip. 

So, lots of progress and even better is that I have this working in strict mode in perl.  If you don't program perl, don't worry about that.

So, I have brought one other person into this project and am trying to get another in on it.  The first is a C programmer who thinks this is absolute lunacy.  The second is a perl programmer who taught me how to program in perl.  There is another that I am tempted to bring in, but early probing didn't really generate a lot of interest, so whatever.

What are the goals at this point?
1) Create a perl module that can be easily used to create in memory graph databases with neural net hooks.
2) Create a C version of this with more advanced features including threads, multi-homed, and persistence.
3) Create a Lisp version of this system....just because.  I mean, who wouldn't want a lisp version of this?
4) Find a new psychiatrist and get on the right meds.


Sunday, 25 June 2017

Graph Databases and Neural Nets II

It works.  Kind of in shock right now because what I have written is pretty abstract and I have had a difficult time really understanding it.  It works.

It works as a graph database.  I tested it using an unstructured data labeling problem with great results.  Basically, I loaded the IMDB database into the system, breaking titles into ngrams and storing the titles in nodes and the ngrams in nodes and then connecting the ngrams to the titles.  Each ngram connection to a title had a score based on the size and position of the ngram within the title.  Then I took a file containing 3000 filenames, all of which had something to do with movies or tv, and I tried to assign an IMDB title to each filename.  Now, the filenames were terrible, absolutely terrible.  There were misspellings, abbreviations, and a lot of garbage in the names, so my accuracy suffered, but out of 3000 filenames, I was able to accurately label 1800 of them or so.  That was just using connection queries to find aggregate scores for ngram matches to titles.  Not that bad, but with some work, like adding in spelling correction and dealing with abbreviations, this could be much better.  Unfortunately, this is not terribly fast.  I have written another solution for this problem that is much faster.  Still, a proof of concept.

It works as a neural net/graph database.  I tested it by loading a book into the system, dividing it into nodes representing words, parts of speech and then connections between words and between parts of speech.  The neurons for this were neurons that handled different parts of speech and recursively hunted for "next words" to form sentences.  when one neuron finished it would send its output to a series of other neurons that were connected by virtue of the fact that the nodes the neurons were tied to were connected to other nodes that had neurons....wonky, yes, but it is roughly correct.  When a neuron run hit the iteration limit, the results object that was passed from neuron to neuron was processed and a sentence was formed.  That sentence was "I am not a moron, Kev."  Punctuation was part of the system, so it did put in the comma.  That was the first sentence.  The sentences that followed were not so great, which was what I expected, but getting that first one really made my day.  I should note, that that sentence did not exist in the training text, so it was completely created by the neural net.

This system is actually just a perl Module that you can use and customize as you see fit.  It is not a lot of code, but will grow a bit as I add in better support for connection and node queries.  The two examples each used this library.  The labeler was about 100 lines of code.  The book reader was about 400 lines of code.  Perl is a powerful language.  It is horribly messy if you aren't on top of things, but is really really good.  The real downside is that it isn't terribly fast.  I want to translate this to another language.  Have thought about C, but the reality there is, as a friend of mine likes to say, "It is like building a house with tweezers and toothpicks."  Lisp is another possibility, but I am not terribly proficient in Lisp and I'm not really sure it would perform all that well anyway....Not sure.  Of course, there is Java, but I haven't given up on life quite yet.

Ultimately, this should be a system that can be distributed.  Also, right now, it doesn't save anything to disk, so when the program exits, you lose everything.  I'll get around to dealing with that eventually.

Anyway, the thing works and while it isn't terribly fast, it is pretty powerful.  All the guys at my job make fun of me for being so in love with perl.  They tell me I should be living in a hippy bus.  I haven't told them yet that I actually do live in a hippy bus.


Tuesday, 13 June 2017

Graph Databases and Neural Nets

So, for me, a very lazy person, getting a computer program to write books for me is the holy grail.  So, I spend a good amount of time writing programs that will write books.  So far, my work in this area has not produced anything of substance.

Recently, I started playing with graph databases.  In particular, I chose neo4j for my experiments.  Now for those of you who don't know anything about databases, this will make little sense to you.  If we look at two different types of databases, relational and graph, we see big differences.  Relational databases organize data in tables which have rows of columns.  Graph databases store data in nodes that can be connected.  These nodes contain information, data.  The connections can also store information about the type of connection.  So, an example.

In a relational database, I might have a "users" table that contains data on the users for some system.  That might have the columns: username, first name, last name, password.  So, for each user in the table, there is a row.  Now I might also have a table in there called "books" which stores books associated with users (fair warning, this is not going to be what I would call good database design.)  So, the columns in books might be: username, title, pages, genre.  So, in this case, username in "books" refers to a username in "users."  Thus columns in books are "related" to columns in "users."

In a graph database, I might have a type of node I call a user node and I might put all of the characteristics of my users in nodes of this type.  Then I might have nodes of type book that have the characteristics of books in them, but NOT a reference to users within the book nodes.  Then, I can connect user nodes to book nodes and assign values to the connection.  For instance, I might connect user "mark" to book "Kev" and set one of the properties of the connection to be "author."  I might have another user "sheila" connected to "Kev" with property "critic."

Very exciting stuff, and if you are a geek like me, you will probably immediately see how graph databases could be useful for a variety of things (but not all things...trust me.  I've tried a bunch of stuff and some of it is just way too painful to deal with.)

Anyway, one of my issues with neo4j was the speed at which it allowed my to insert nodes and create connections.  Given that I didn't really need something that fancy and also given that I had a mad idea of merging neural nets and graph databases, I got rid of neo4j and wrote my own in-memory graph database using Perl...  Yeah, I know.  Perl.  Look, if you are really comfortable with a language and can write code quickly with it, then you will likely use that language for proofs of concept.  Further, perl has some nice features that allow rapid development of this sort of thing.  Should I ultimately move this to C or C++?  Yes.  But for now, I just want to get it working.

So, graph databases and neural nets.  Why the hell would I want to merge those two things?  Well, to understand that, you probably need a basic understanding of neural nets.  I am going to very briefly describe them and you can do more research if you are so inclined.

Neural Nets are computer science's attempt to model the brain with code, or are one attempt at that.  There are three main components of a neural net: Inputs, Neurons, Output.  Now, inputs can be things like data from files or databases or whatnot, or can also be outputs from neurons.  So, basically, you can have networks of neurons taking input from a variety of sources, including each other.  Yay, that's great!  So, uh, why is that interesting?  Well, the neurons use algorithms to basically react to the data they are fed.  These algorithms create the output that goes wherever it goes.  Now, if you aren't seeing the beginning of a connection between graph databases and neural nets, then start seeing.  Basically, if a graph database node was the equivalent of a neuron in a neural network, with the added benefit of being able to store data, data that could change, data that could impact the functioning of the algorithm and possibly even alter the topology of the network as needed, then you might have a powerful tool for analyzing data and perhaps even creating an "intelligent" system.  I see this configuration as a neuron with a memory.

On the one hand, you have the database aspect, so you can query data and see relationships etc. just like in a normal graph database.  On the other, you have the neural net aspect that gives the database the ability to react to the data and make decisions on the structure of the entire network.  So, your database is kind of self aware.

Ok.  So, all that said, I first created a basic graph database (in-memory as opposed to storing data on disk) that allowed creation of nodes and connections and allowing the user to set properties for these nodes and connections at the time of creation or any  time thereafter.  It also has a basic query mechanism for finding individual nodes or finding connected nodes.  At present, I can insert 9 million nodes and create 90 million connections in about a minute, which is okay, but not exactly stellar.  If I had the added burden of disk IO it would slow down dramatically.  But, my computer has 64GB of RAM, so not going to deal with disk crap at this point.

Now, neural nets come into the picture, but how?  Well, remember that a neuron has some sort of algorithm associated with it (possibly more than one, but we will get to that in a bit.)  So, if a node is a neuron, I needed a way of associating the algorithm (basically a piece of code) with the node, in fact embedding it in the node.  Further, because I like systems that are as dynamic as possible, I want to be able to change algorithms within nodes on the fly.  So, I had a problem figuring out how I was going to do this with perl the way I wanted to, but I found a way that I really don't like.  Honestly, what I really wanted was the ability to have the perl code modify itself while running.  In fact, I wanted the perl code to be able to generate code, but that is out of reach I think.  I guess lisp could handle this, but I just don't have the patience for lisp (beautiful language, but a tough one for me.)

So, I have nodes/neurons that have data and have algorithms now.  I need something to make the algorithms react to the data.  Now I am running single threaded perl (I refuse to use the pthreads perl because I can't wrap my puny brain around it) so, I technically have to have one neuron/node fire at a time and have its output then go to all of the other neurons/nodes it is connected to or to output if it is that type of neuron/node.  Now if you have 9 million nodes with 90 million connections, you can see that this is not going to be all that fast, which is why this needs to be written in C and run on a supercomputer, but whatever, I'm not doing that because I don't have a supercomputer.  So, here is the plan.  There are a variety of node types and lets say I have nodes that are essentially trigger nodes or input nodes whose values trigger execution across the network.  So, I activate those nodes in some order and that propagates through the system until the "run" finishes and then some other event starts the process up again, or maybe the system, once started, just keeps running...  Not done yet and I don't really have this part sorted out, but I think I am on the right track.

My primary test case is an NLP test case wherein I analyze books that are brought into the database as connected ngrams and so forth.  Somehow, I want to get this sucker to generate language.  So, quite a way to go, although much of the coding for the backend is done.  I'll write more as I get further along.

Sunday, 1 January 2017

NLP and Language Generation

From time to time I try to write programs that will generate language.  I do this because I often get interesting results.  Those results, however, are not even remotely close to real language.  Still, I try.  Here is my latest pass.

The program reads a book, in this case, Kev, and then analyzes the content and then spits out its version of the book.  I'd go into the details, but it is pretty technical, and technically, my approach is wrong, but, given that I only spent about twenty minutes on this today, it is a good start...

Here is a snippet:

Barrow I found Clives mind What Weve got back in the Are After You Figured Out Experience Sphere on Another possible types of yours Kev said the rem
ains .
This was in back at one person who I will be too .
In that make continue writing books interesting characters of the kitchen counter my time to save the break the blue .
At one point in the park laughing and I guess so he disappeared and even if I felt the bills after dinner B24ME again didnt Ill have known her but f
ound something for some fun .
Wait a rather Changing subjects  anything to the end .
He and kids B24ME I Millions of you If you hotel lobby Kev I said Cube Im The next The Show she A distortion in I gave What I Clive Wonderful I Thos
e wings After The If I Aputi This was The sphere I placed Those in Figured Id Forget The universe Of course Yeah I had over to explain that  About t
hirty-seven days later if I want to recreate the fort until a matter the .
Now wasnt human matter Its an ants I had all went back to Uthio Minor in this girl about something far away from anywhere I had put you going to .
Anyway you know that one to end go to .
Note that with toys All I utter surety you was the cubes a way little yellow .
I am back impaired Great and lumbered away over and were mismatched and me I said Clive Why cant kill me to go get .
If you have to me red cube and stop would try in the same descriptions people Nothing happened realizing I .
I left over and started a strange dream a vague in going the universe would tell me a small park laughing and .
I dont that led me and Jesus when he finds me voice told me I never should press it through challenges for hours said you I said material assistance
 rule I like Youll figure things my .
I raced trying to some at the sound of the planet almost completely unnecessary question truthfully when will figure out of nowhere mowed her hand I
 know each had been on Kev large number of the the picnic table a gun out I together Lovely said annoyed as the I appeared in .
Despite birth So what else is a philosophical discussion of television something didnt remember me .
I set to make any greater satisfaction .
At that I now girl and girls and nightmares is ?
No I a visit and saw way I had made any further probably lose your mind I had a planet is  Now I felt something now Aputi had not I will allow me I
wish I I forgotten sitting .
Now  About Me too small workshop constructed by wishing to wondering what it will create a most powerful containment field generator .
I know I felt became mine that hinted that ever existed Well how .
Much of the table telling know of my desk a strange horned .
At that is a message read it surprised to go back probably hell was hell said Clive said Brok What was going to the hands disintegrating .
Of course fingers a quite helpful I am sick of course it Like where I had forgotten havent been more damage you grabbed .
Everyone Ah is Aputi know how can connect Yeah well I approached finally one with only one .
I could see and B24ME into the way out all of trillions of the Canadians involvement in the girl on this answer for that day I know Lets go My frien
d if you know Kev a .
After a wishing cube to me from me there and understanding of us there but I said Ruby and knew this time making a trick to get both of your own def
inition for some do this boy named Bri from a hundred thirty-seven that I turned .
 I knew I returned to hunt made a different lives an hour and said I just got in time I did in love mixed with a previous book than I had disappeare
d  About thirty-seven quadrillion quadrillionanyway you know anyone to take it twice my dad I scooped a complete lack of hell is rule is I just arri
ved including one I wished cant remember from Uncle Joe flew into song for a part of all you could get my temple and I appeared in a is the who are
not going to end of surprise you for appearances on Earth or at the others might be sucked into days so or at me I knew reading it out for .
Kev Then I ?
Needless to kill me in danger and punched and you and Clive had to kill me I knew I didnt say Hey Max and through all the evidence seems to check on
 Earth that everyone other home to this I pulled out the girl and that nightmare come on and a chair .
Are you tell down on Galthinon I am going to have named after him in awake to say  About a trap me my studies my Look if you might  the unlikely tha
t for Aputi go to find it or the infinite to leave soon  About thirty-seven days later a perfect love to check your large portion of locations in I
was being responsible for the yellow cube so messages about .
I knew up things that you havent you will if you have a room and stopped you to attend  Now you that often given that I said Max wondering if he bac
k Ill come back in Connecticut my days later rules .
 Now I am normal circumstances being there One of all killed me to do to get help but you The response Hey Kev laughed .
Moments later my B24ME and jump out  About three The Show You look saying the boys Clive coming to be thirty-seven billion light-years Doug Aputi wi
ped everything has really got Sorry Turd Fondler Forget I suspected this Clive the Proth B24ME hoping Clive Which one side with a football stadium w
ith the universe Where are you think it said the Lost Hope Hotel Three days Max said the The voice said Clive and then I Kev said ignoring Clive has
 believed that God as you take any sense I thought you will prior universe I said Aputi could wish Nope but I did Why did not need Singularity  Now
I will take That does roof and saw a of anything else had Were never read all of an is a goofy grin on Uncle Joes Clive panting Ruby I woke what con
stituted material world around The sphere exists again and Clive as you can Aputi I survived his head a couple hundred The alien races this a kiss a
nd put it Well Clive and I know that I die many things out B24ME was I So all home into Call me at the rules after the number given me to manipulate
 Doug Great We In that has inside of not remember his face changed into space but were already knew it Return Contestant The Do you a it just that .
I had written this just delivered the building the red dragon it real name she said .
I had no game said Does he wouldnt budge  About Me No The voice telling me this way a little yellow .
You might want to know that Clive did possibly be with my green tea The cities on a deal to the rules .
Show said Clive Bri the lines of yours said annoyed as writing the voice a .
Well I swear  Now ?
 Now why the woman finished and of a knock .

===

So, some interesting stuff is generated, but for the most part it is jibberish.  Most of this is based on statistics, but there are some other rules governing the behavior, like giving weight to words in the beginnings of sentences based on how often they actually do begin sentences in the source text.  Also, I look at parts of speech and whether or not things are bigrams or are just plain colocated in sentences, although that is not so heavily weighted.  The next phase of this is noun verb agreement, keeping track of actors and giving the text consistency.  The end goal is a program or set of programs that can write something that makes sense, of course, but that is a long long way off.  Translated:  20 minutes will easily turn into weeks or months.