Sunday, 25 June 2017

Graph Databases and Neural Nets II

It works.  Kind of in shock right now because what I have written is pretty abstract and I have had a difficult time really understanding it.  It works.

It works as a graph database.  I tested it using an unstructured data labeling problem with great results.  Basically, I loaded the IMDB database into the system, breaking titles into ngrams and storing the titles in nodes and the ngrams in nodes and then connecting the ngrams to the titles.  Each ngram connection to a title had a score based on the size and position of the ngram within the title.  Then I took a file containing 3000 filenames, all of which had something to do with movies or tv, and I tried to assign an IMDB title to each filename.  Now, the filenames were terrible, absolutely terrible.  There were misspellings, abbreviations, and a lot of garbage in the names, so my accuracy suffered, but out of 3000 filenames, I was able to accurately label 1800 of them or so.  That was just using connection queries to find aggregate scores for ngram matches to titles.  Not that bad, but with some work, like adding in spelling correction and dealing with abbreviations, this could be much better.  Unfortunately, this is not terribly fast.  I have written another solution for this problem that is much faster.  Still, a proof of concept.

It works as a neural net/graph database.  I tested it by loading a book into the system, dividing it into nodes representing words, parts of speech and then connections between words and between parts of speech.  The neurons for this were neurons that handled different parts of speech and recursively hunted for "next words" to form sentences.  when one neuron finished it would send its output to a series of other neurons that were connected by virtue of the fact that the nodes the neurons were tied to were connected to other nodes that had neurons....wonky, yes, but it is roughly correct.  When a neuron run hit the iteration limit, the results object that was passed from neuron to neuron was processed and a sentence was formed.  That sentence was "I am not a moron, Kev."  Punctuation was part of the system, so it did put in the comma.  That was the first sentence.  The sentences that followed were not so great, which was what I expected, but getting that first one really made my day.  I should note, that that sentence did not exist in the training text, so it was completely created by the neural net.

This system is actually just a perl Module that you can use and customize as you see fit.  It is not a lot of code, but will grow a bit as I add in better support for connection and node queries.  The two examples each used this library.  The labeler was about 100 lines of code.  The book reader was about 400 lines of code.  Perl is a powerful language.  It is horribly messy if you aren't on top of things, but is really really good.  The real downside is that it isn't terribly fast.  I want to translate this to another language.  Have thought about C, but the reality there is, as a friend of mine likes to say, "It is like building a house with tweezers and toothpicks."  Lisp is another possibility, but I am not terribly proficient in Lisp and I'm not really sure it would perform all that well anyway....Not sure.  Of course, there is Java, but I haven't given up on life quite yet.

Ultimately, this should be a system that can be distributed.  Also, right now, it doesn't save anything to disk, so when the program exits, you lose everything.  I'll get around to dealing with that eventually.

Anyway, the thing works and while it isn't terribly fast, it is pretty powerful.  All the guys at my job make fun of me for being so in love with perl.  They tell me I should be living in a hippy bus.  I haven't told them yet that I actually do live in a hippy bus.


Tuesday, 13 June 2017

Graph Databases and Neural Nets

So, for me, a very lazy person, getting a computer program to write books for me is the holy grail.  So, I spend a good amount of time writing programs that will write books.  So far, my work in this area has not produced anything of substance.

Recently, I started playing with graph databases.  In particular, I chose neo4j for my experiments.  Now for those of you who don't know anything about databases, this will make little sense to you.  If we look at two different types of databases, relational and graph, we see big differences.  Relational databases organize data in tables which have rows of columns.  Graph databases store data in nodes that can be connected.  These nodes contain information, data.  The connections can also store information about the type of connection.  So, an example.

In a relational database, I might have a "users" table that contains data on the users for some system.  That might have the columns: username, first name, last name, password.  So, for each user in the table, there is a row.  Now I might also have a table in there called "books" which stores books associated with users (fair warning, this is not going to be what I would call good database design.)  So, the columns in books might be: username, title, pages, genre.  So, in this case, username in "books" refers to a username in "users."  Thus columns in books are "related" to columns in "users."

In a graph database, I might have a type of node I call a user node and I might put all of the characteristics of my users in nodes of this type.  Then I might have nodes of type book that have the characteristics of books in them, but NOT a reference to users within the book nodes.  Then, I can connect user nodes to book nodes and assign values to the connection.  For instance, I might connect user "mark" to book "Kev" and set one of the properties of the connection to be "author."  I might have another user "sheila" connected to "Kev" with property "critic."

Very exciting stuff, and if you are a geek like me, you will probably immediately see how graph databases could be useful for a variety of things (but not all things...trust me.  I've tried a bunch of stuff and some of it is just way too painful to deal with.)

Anyway, one of my issues with neo4j was the speed at which it allowed my to insert nodes and create connections.  Given that I didn't really need something that fancy and also given that I had a mad idea of merging neural nets and graph databases, I got rid of neo4j and wrote my own in-memory graph database using Perl...  Yeah, I know.  Perl.  Look, if you are really comfortable with a language and can write code quickly with it, then you will likely use that language for proofs of concept.  Further, perl has some nice features that allow rapid development of this sort of thing.  Should I ultimately move this to C or C++?  Yes.  But for now, I just want to get it working.

So, graph databases and neural nets.  Why the hell would I want to merge those two things?  Well, to understand that, you probably need a basic understanding of neural nets.  I am going to very briefly describe them and you can do more research if you are so inclined.

Neural Nets are computer science's attempt to model the brain with code, or are one attempt at that.  There are three main components of a neural net: Inputs, Neurons, Output.  Now, inputs can be things like data from files or databases or whatnot, or can also be outputs from neurons.  So, basically, you can have networks of neurons taking input from a variety of sources, including each other.  Yay, that's great!  So, uh, why is that interesting?  Well, the neurons use algorithms to basically react to the data they are fed.  These algorithms create the output that goes wherever it goes.  Now, if you aren't seeing the beginning of a connection between graph databases and neural nets, then start seeing.  Basically, if a graph database node was the equivalent of a neuron in a neural network, with the added benefit of being able to store data, data that could change, data that could impact the functioning of the algorithm and possibly even alter the topology of the network as needed, then you might have a powerful tool for analyzing data and perhaps even creating an "intelligent" system.  I see this configuration as a neuron with a memory.

On the one hand, you have the database aspect, so you can query data and see relationships etc. just like in a normal graph database.  On the other, you have the neural net aspect that gives the database the ability to react to the data and make decisions on the structure of the entire network.  So, your database is kind of self aware.

Ok.  So, all that said, I first created a basic graph database (in-memory as opposed to storing data on disk) that allowed creation of nodes and connections and allowing the user to set properties for these nodes and connections at the time of creation or any  time thereafter.  It also has a basic query mechanism for finding individual nodes or finding connected nodes.  At present, I can insert 9 million nodes and create 90 million connections in about a minute, which is okay, but not exactly stellar.  If I had the added burden of disk IO it would slow down dramatically.  But, my computer has 64GB of RAM, so not going to deal with disk crap at this point.

Now, neural nets come into the picture, but how?  Well, remember that a neuron has some sort of algorithm associated with it (possibly more than one, but we will get to that in a bit.)  So, if a node is a neuron, I needed a way of associating the algorithm (basically a piece of code) with the node, in fact embedding it in the node.  Further, because I like systems that are as dynamic as possible, I want to be able to change algorithms within nodes on the fly.  So, I had a problem figuring out how I was going to do this with perl the way I wanted to, but I found a way that I really don't like.  Honestly, what I really wanted was the ability to have the perl code modify itself while running.  In fact, I wanted the perl code to be able to generate code, but that is out of reach I think.  I guess lisp could handle this, but I just don't have the patience for lisp (beautiful language, but a tough one for me.)

So, I have nodes/neurons that have data and have algorithms now.  I need something to make the algorithms react to the data.  Now I am running single threaded perl (I refuse to use the pthreads perl because I can't wrap my puny brain around it) so, I technically have to have one neuron/node fire at a time and have its output then go to all of the other neurons/nodes it is connected to or to output if it is that type of neuron/node.  Now if you have 9 million nodes with 90 million connections, you can see that this is not going to be all that fast, which is why this needs to be written in C and run on a supercomputer, but whatever, I'm not doing that because I don't have a supercomputer.  So, here is the plan.  There are a variety of node types and lets say I have nodes that are essentially trigger nodes or input nodes whose values trigger execution across the network.  So, I activate those nodes in some order and that propagates through the system until the "run" finishes and then some other event starts the process up again, or maybe the system, once started, just keeps running...  Not done yet and I don't really have this part sorted out, but I think I am on the right track.

My primary test case is an NLP test case wherein I analyze books that are brought into the database as connected ngrams and so forth.  Somehow, I want to get this sucker to generate language.  So, quite a way to go, although much of the coding for the backend is done.  I'll write more as I get further along.