Sunday, 25 June 2017

Graph Databases and Neural Nets II

It works.  Kind of in shock right now because what I have written is pretty abstract and I have had a difficult time really understanding it.  It works.

It works as a graph database.  I tested it using an unstructured data labeling problem with great results.  Basically, I loaded the IMDB database into the system, breaking titles into ngrams and storing the titles in nodes and the ngrams in nodes and then connecting the ngrams to the titles.  Each ngram connection to a title had a score based on the size and position of the ngram within the title.  Then I took a file containing 3000 filenames, all of which had something to do with movies or tv, and I tried to assign an IMDB title to each filename.  Now, the filenames were terrible, absolutely terrible.  There were misspellings, abbreviations, and a lot of garbage in the names, so my accuracy suffered, but out of 3000 filenames, I was able to accurately label 1800 of them or so.  That was just using connection queries to find aggregate scores for ngram matches to titles.  Not that bad, but with some work, like adding in spelling correction and dealing with abbreviations, this could be much better.  Unfortunately, this is not terribly fast.  I have written another solution for this problem that is much faster.  Still, a proof of concept.

It works as a neural net/graph database.  I tested it by loading a book into the system, dividing it into nodes representing words, parts of speech and then connections between words and between parts of speech.  The neurons for this were neurons that handled different parts of speech and recursively hunted for "next words" to form sentences.  when one neuron finished it would send its output to a series of other neurons that were connected by virtue of the fact that the nodes the neurons were tied to were connected to other nodes that had neurons....wonky, yes, but it is roughly correct.  When a neuron run hit the iteration limit, the results object that was passed from neuron to neuron was processed and a sentence was formed.  That sentence was "I am not a moron, Kev."  Punctuation was part of the system, so it did put in the comma.  That was the first sentence.  The sentences that followed were not so great, which was what I expected, but getting that first one really made my day.  I should note, that that sentence did not exist in the training text, so it was completely created by the neural net.

This system is actually just a perl Module that you can use and customize as you see fit.  It is not a lot of code, but will grow a bit as I add in better support for connection and node queries.  The two examples each used this library.  The labeler was about 100 lines of code.  The book reader was about 400 lines of code.  Perl is a powerful language.  It is horribly messy if you aren't on top of things, but is really really good.  The real downside is that it isn't terribly fast.  I want to translate this to another language.  Have thought about C, but the reality there is, as a friend of mine likes to say, "It is like building a house with tweezers and toothpicks."  Lisp is another possibility, but I am not terribly proficient in Lisp and I'm not really sure it would perform all that well anyway....Not sure.  Of course, there is Java, but I haven't given up on life quite yet.

Ultimately, this should be a system that can be distributed.  Also, right now, it doesn't save anything to disk, so when the program exits, you lose everything.  I'll get around to dealing with that eventually.

Anyway, the thing works and while it isn't terribly fast, it is pretty powerful.  All the guys at my job make fun of me for being so in love with perl.  They tell me I should be living in a hippy bus.  I haven't told them yet that I actually do live in a hippy bus.