Tuesday, 13 June 2017

Graph Databases and Neural Nets

So, for me, a very lazy person, getting a computer program to write books for me is the holy grail.  So, I spend a good amount of time writing programs that will write books.  So far, my work in this area has not produced anything of substance.

Recently, I started playing with graph databases.  In particular, I chose neo4j for my experiments.  Now for those of you who don't know anything about databases, this will make little sense to you.  If we look at two different types of databases, relational and graph, we see big differences.  Relational databases organize data in tables which have rows of columns.  Graph databases store data in nodes that can be connected.  These nodes contain information, data.  The connections can also store information about the type of connection.  So, an example.

In a relational database, I might have a "users" table that contains data on the users for some system.  That might have the columns: username, first name, last name, password.  So, for each user in the table, there is a row.  Now I might also have a table in there called "books" which stores books associated with users (fair warning, this is not going to be what I would call good database design.)  So, the columns in books might be: username, title, pages, genre.  So, in this case, username in "books" refers to a username in "users."  Thus columns in books are "related" to columns in "users."

In a graph database, I might have a type of node I call a user node and I might put all of the characteristics of my users in nodes of this type.  Then I might have nodes of type book that have the characteristics of books in them, but NOT a reference to users within the book nodes.  Then, I can connect user nodes to book nodes and assign values to the connection.  For instance, I might connect user "mark" to book "Kev" and set one of the properties of the connection to be "author."  I might have another user "sheila" connected to "Kev" with property "critic."

Very exciting stuff, and if you are a geek like me, you will probably immediately see how graph databases could be useful for a variety of things (but not all things...trust me.  I've tried a bunch of stuff and some of it is just way too painful to deal with.)

Anyway, one of my issues with neo4j was the speed at which it allowed my to insert nodes and create connections.  Given that I didn't really need something that fancy and also given that I had a mad idea of merging neural nets and graph databases, I got rid of neo4j and wrote my own in-memory graph database using Perl...  Yeah, I know.  Perl.  Look, if you are really comfortable with a language and can write code quickly with it, then you will likely use that language for proofs of concept.  Further, perl has some nice features that allow rapid development of this sort of thing.  Should I ultimately move this to C or C++?  Yes.  But for now, I just want to get it working.

So, graph databases and neural nets.  Why the hell would I want to merge those two things?  Well, to understand that, you probably need a basic understanding of neural nets.  I am going to very briefly describe them and you can do more research if you are so inclined.

Neural Nets are computer science's attempt to model the brain with code, or are one attempt at that.  There are three main components of a neural net: Inputs, Neurons, Output.  Now, inputs can be things like data from files or databases or whatnot, or can also be outputs from neurons.  So, basically, you can have networks of neurons taking input from a variety of sources, including each other.  Yay, that's great!  So, uh, why is that interesting?  Well, the neurons use algorithms to basically react to the data they are fed.  These algorithms create the output that goes wherever it goes.  Now, if you aren't seeing the beginning of a connection between graph databases and neural nets, then start seeing.  Basically, if a graph database node was the equivalent of a neuron in a neural network, with the added benefit of being able to store data, data that could change, data that could impact the functioning of the algorithm and possibly even alter the topology of the network as needed, then you might have a powerful tool for analyzing data and perhaps even creating an "intelligent" system.  I see this configuration as a neuron with a memory.

On the one hand, you have the database aspect, so you can query data and see relationships etc. just like in a normal graph database.  On the other, you have the neural net aspect that gives the database the ability to react to the data and make decisions on the structure of the entire network.  So, your database is kind of self aware.

Ok.  So, all that said, I first created a basic graph database (in-memory as opposed to storing data on disk) that allowed creation of nodes and connections and allowing the user to set properties for these nodes and connections at the time of creation or any  time thereafter.  It also has a basic query mechanism for finding individual nodes or finding connected nodes.  At present, I can insert 9 million nodes and create 90 million connections in about a minute, which is okay, but not exactly stellar.  If I had the added burden of disk IO it would slow down dramatically.  But, my computer has 64GB of RAM, so not going to deal with disk crap at this point.

Now, neural nets come into the picture, but how?  Well, remember that a neuron has some sort of algorithm associated with it (possibly more than one, but we will get to that in a bit.)  So, if a node is a neuron, I needed a way of associating the algorithm (basically a piece of code) with the node, in fact embedding it in the node.  Further, because I like systems that are as dynamic as possible, I want to be able to change algorithms within nodes on the fly.  So, I had a problem figuring out how I was going to do this with perl the way I wanted to, but I found a way that I really don't like.  Honestly, what I really wanted was the ability to have the perl code modify itself while running.  In fact, I wanted the perl code to be able to generate code, but that is out of reach I think.  I guess lisp could handle this, but I just don't have the patience for lisp (beautiful language, but a tough one for me.)

So, I have nodes/neurons that have data and have algorithms now.  I need something to make the algorithms react to the data.  Now I am running single threaded perl (I refuse to use the pthreads perl because I can't wrap my puny brain around it) so, I technically have to have one neuron/node fire at a time and have its output then go to all of the other neurons/nodes it is connected to or to output if it is that type of neuron/node.  Now if you have 9 million nodes with 90 million connections, you can see that this is not going to be all that fast, which is why this needs to be written in C and run on a supercomputer, but whatever, I'm not doing that because I don't have a supercomputer.  So, here is the plan.  There are a variety of node types and lets say I have nodes that are essentially trigger nodes or input nodes whose values trigger execution across the network.  So, I activate those nodes in some order and that propagates through the system until the "run" finishes and then some other event starts the process up again, or maybe the system, once started, just keeps running...  Not done yet and I don't really have this part sorted out, but I think I am on the right track.

My primary test case is an NLP test case wherein I analyze books that are brought into the database as connected ngrams and so forth.  Somehow, I want to get this sucker to generate language.  So, quite a way to go, although much of the coding for the backend is done.  I'll write more as I get further along.