Friday, 4 August 2017

Graph Databases and Neural Nets IV

Writing a query engine for a database seems pretty simple on the surface.  However, when you really dig into it, it sucks and you start to question why you are writing a query engine.  I mean, don't you have better things to do?  Don't you have a family that is wondering why you are locked away in the attic in the dark, mumbling to yourself, occasionally screaming, and more occasionally throwing and breaking things?  Whatever.

The basic problem here is writing a query engine that allows for a wide array of conditions.  Now, for my engine, I wanted to provide at bare minimum the ability to query exact values,  >, <, >=, <=, IN, NOT IN and KEY EXISTS.  Sounds pretty simple, right?  How hard could that be?  Well, doing them individually is actually very very simple.  But, when you have complex queries against multiple keys and also add in the ability to use AND and OR for combinations of conditions, it starts to get pretty ugly.

I have done it, of course, and it works, but it took two tries to get it working and now, it is a horrible mess and needs to be completely rewritten and optimized and that is going to lead to errors and screaming and throwing and breaking, but I will get it working and working well.

Once that is working, I will move on to things like DATES, REGULAR EXPRESSIONS and then the ability to embed function calls within the queries (WHY?  BECAUSE)

This is a slow process.  It is slow because I am only allowing myself to take small steps between tests.  Tests take time and patience (and an understanding of what the hell needs to be tested, which I often do not entirely have.)  So, baby steps.  This approach has led to fewer bugs overall.  However, it goes against my natural inclination to just barrel through things to "get it done."  Frustrating.  Very frustrating, but it is paying off.

So, here is what is working now:
  1. Creation of networks, nodes, connections and neurons
  2. Modification of networks, nodes, connections and neurons
  3. Ability to find connections for any object (node, connection, neuron)
    1. ==, >, <, >=, <=, IN, NOT IN, KEY EXISTS queries working and combinable
    2. AND and OR allowed but only globally for a query
  4. Indexes for object names, types and keys
  5. Neuron execution (Recursive, traverses network, can be slow.)
  6. Path analysis. (This is pretty slow for complex networks if depth > 2)
  7. Balance analysis (relationship between incoming and outgoing connections, a value between -1 and 1)
  8. Common connection queries (given a group of objects what connections are common to them?)  This allows for strict or fuzzy matching and is really just a way of clustering data.
  9. Deep connections.  A depth limited (configurable) search for connections to a given object
It isn't really a ton of code.  There are four files (soon to be more when I break out the query stuff into its own library.)  The code is clean except for the query stuff.  Variable naming is pretty solid.  I follow a set of patterns to keep it from becoming an unreadable nightmare.  So, in pretty good shape so far, but there is so much more to do.  I'm probably less than 30% done with it.

Saturday, 22 July 2017

Graph Databases and Neural Nets III

So, after a lot of pain and suffering, I have worked out most of the issues with this system.  The tricky part, of course, was neuron execution.  Here is the problem.  If I execute a neuron's code in this system, that code takes the object the neuron is attached to and an options object as arguments.  The code then returns a potentially modified options object.  I could go into a long explanation for this, but I won't.  Suffices to say, I wanted to maintain some sort of state for the results so that as more and more neurons fired, they would get this options object and be able to make decisions based on its values.  Now, when a neuron fires, it basically runs its own code, then it looks through all of its connections for other neurons to send the data to so they can fire and so on.  Now, I know you are saying, "Couldn't that go on forever if there are loops in the network?"  Yes, it could, but the system allows max depth to be set.  Anyway, the neuron essentially finds other neurons and those run, but they run in series and because options potentially changes each time a neuron fires, in this state of things, I do not pass the same options each time I send to another neuron.  What does that mean?

Let's say I execute neuron 1.  Neuron 1 is connected to neuron 2 and neuron 3.  So, now, I pass the modified options object from the neuron 1 fire to neuron 2 and it modifies the object and for now let's say it isn't connected to anything else, so we drop down to neuron 3 which has the options object that was modified by neuron 2.  Now, what if I wanted to pass the same options object that came from the neuron 1 fire to both neuron 2 and neuron 3?  I might want to do that, right?  In fact, I might want to do that more often than not depending on the type of data etc. I am dealing with.  So, I think I just have to make this an option (the default option.)  I will get around to it eventually.

Another issue I have now is that I don't have a formal output process for neurons.  So, right now, they just follow connections until they run out of connections or hit their max depth.  What happens after that?  Well, technically, the options object is returned at the end of the run, so you can do what you want with that.  However, it seems that there is a case here for exit procedures of some sort.  I fire a neuron and this cascades to some point and then I do SOMETHING with the options object at the end of the run.  Now, that can just be baked into whatever code you write that uses this module, or it could be something that is a part of the module itself.  I don't know.  Honestly, I am beginning to believe that I have gone completely off the rails.  I am questioning everything.  Where have I gone wrong?  I was happy once.  I had friends (not really.)  I believed I could do anything (mania.)  But now, I am lost.


The graph database part of this has reached a point where the only thing left to do is allow advanced key queries.  For example, queries like "node->{key}->{some key} > 10" or "node->{key}->{some key} in ['a','b','c']"  Well, sub-queries would also be nice, but then you get into nasty recursion and I really don't have the patience for that right now.  I'll leave that to the other poor suckers I've dragged into this nightmare.  Yeah, I said it.  Nightmare.  This is the worst thing that has happened to me...EVER.  Not really.  It is the second worst thing.  The worst thing was when I got a cluster of ticks on a rather sensitive part of my body while on a camping trip. 

So, lots of progress and even better is that I have this working in strict mode in perl.  If you don't program perl, don't worry about that.

So, I have brought one other person into this project and am trying to get another in on it.  The first is a C programmer who thinks this is absolute lunacy.  The second is a perl programmer who taught me how to program in perl.  There is another that I am tempted to bring in, but early probing didn't really generate a lot of interest, so whatever.

What are the goals at this point?
1) Create a perl module that can be easily used to create in memory graph databases with neural net hooks.
2) Create a C version of this with more advanced features including threads, multi-homed, and persistence.
3) Create a Lisp version of this system....just because.  I mean, who wouldn't want a lisp version of this?
4) Find a new psychiatrist and get on the right meds.

Sunday, 25 June 2017

Graph Databases and Neural Nets II

It works.  Kind of in shock right now because what I have written is pretty abstract and I have had a difficult time really understanding it.  It works.

It works as a graph database.  I tested it using an unstructured data labeling problem with great results.  Basically, I loaded the IMDB database into the system, breaking titles into ngrams and storing the titles in nodes and the ngrams in nodes and then connecting the ngrams to the titles.  Each ngram connection to a title had a score based on the size and position of the ngram within the title.  Then I took a file containing 3000 filenames, all of which had something to do with movies or tv, and I tried to assign an IMDB title to each filename.  Now, the filenames were terrible, absolutely terrible.  There were misspellings, abbreviations, and a lot of garbage in the names, so my accuracy suffered, but out of 3000 filenames, I was able to accurately label 1800 of them or so.  That was just using connection queries to find aggregate scores for ngram matches to titles.  Not that bad, but with some work, like adding in spelling correction and dealing with abbreviations, this could be much better.  Unfortunately, this is not terribly fast.  I have written another solution for this problem that is much faster.  Still, a proof of concept.

It works as a neural net/graph database.  I tested it by loading a book into the system, dividing it into nodes representing words, parts of speech and then connections between words and between parts of speech.  The neurons for this were neurons that handled different parts of speech and recursively hunted for "next words" to form sentences.  when one neuron finished it would send its output to a series of other neurons that were connected by virtue of the fact that the nodes the neurons were tied to were connected to other nodes that had neurons....wonky, yes, but it is roughly correct.  When a neuron run hit the iteration limit, the results object that was passed from neuron to neuron was processed and a sentence was formed.  That sentence was "I am not a moron, Kev."  Punctuation was part of the system, so it did put in the comma.  That was the first sentence.  The sentences that followed were not so great, which was what I expected, but getting that first one really made my day.  I should note, that that sentence did not exist in the training text, so it was completely created by the neural net.

This system is actually just a perl Module that you can use and customize as you see fit.  It is not a lot of code, but will grow a bit as I add in better support for connection and node queries.  The two examples each used this library.  The labeler was about 100 lines of code.  The book reader was about 400 lines of code.  Perl is a powerful language.  It is horribly messy if you aren't on top of things, but is really really good.  The real downside is that it isn't terribly fast.  I want to translate this to another language.  Have thought about C, but the reality there is, as a friend of mine likes to say, "It is like building a house with tweezers and toothpicks."  Lisp is another possibility, but I am not terribly proficient in Lisp and I'm not really sure it would perform all that well anyway....Not sure.  Of course, there is Java, but I haven't given up on life quite yet.

Ultimately, this should be a system that can be distributed.  Also, right now, it doesn't save anything to disk, so when the program exits, you lose everything.  I'll get around to dealing with that eventually.

Anyway, the thing works and while it isn't terribly fast, it is pretty powerful.  All the guys at my job make fun of me for being so in love with perl.  They tell me I should be living in a hippy bus.  I haven't told them yet that I actually do live in a hippy bus.