With my Sparse Distributed Memory (SDM) basic implementation finally done, in which the Up/Down Counters won’t overflow, I’m ready to embark something bigger. At the moment, my implementation stores everything in random-access-memory (RAM), which is very limiting. I have utilized SDM with my reinforcement learning (rl), solving entailments in a set of 11,000 metrics, which occupies all my 16gb of ram. This is not ideal since, even with SDM, there are many collisions, granted they are better than random hashing collision provided by Tile Coding. Also, I can only do machine learning on a smaller and sparse set of data, due to space. To amend all of this, it is clear what the next step is. Utilize database systems.
SDM Write/Read operations
Since all the “bottleneck” stems from the space requirements of the Up/Down Counters, it makes sense to somehow store all that stuff in the database. For a quick review of what Up/Down Counters look like, see the figure below.
What I want is to store each row in the database.
Considering that Up/Down Counters row represents a feature vector which is probably not that big, this might be sufficient. The problem occurs if the feature vector is actually huge, then we probably need a smarter way of segmenting each row into cells.
Now we are more robust, allowing us to store bigger data. One thing to note is that storing each row will be faster, due to less “context-switching” when switching to another DB entry. Cell solution can do the same by setting the column width to something equal to the size of data. Our final schema is represented by the image below.
- Each UpDownCounter instance has an DB entry.
- Contains the size of each row, DataSize.
- Each UpDownCounter instance has many associated Cell.
- Contains the ChunkSize of the cell. Although can be saved in UpDownCounter, imagine, having some cell bigger than the other. This should allow for that possibility, although I don’t intend to use it (I don’t mind the redundancy that much).
- And the data itself, BinData.
- The row, col stores where this cell is located.
Choosing A Database
Since I don’t intend to uphold the CAP theorem, that is, the possibility of SDM incrementing the same counter at the same time should be possible, and I want horizontal scalability, I must choose one of the NoSQL databases. I’ve looked at MongoDB and it’s C++ client (Note, my SDM implementation is in C++) and has found it difficult to build and install. I’ve also looked at Cassandra and it seems like a better fit for my problem. It was easy, way faster, and way scalable.
Conclusion
That’s it! We are done. See you in part 2 when I have a basic working implementation of this.