If there is one thing that has bothered me about my choice of CouchDB as the main storage system for Sincerial, it’s the lack of an automatic system for shard management. In the early days of a startup, a single server is probably capable of handling all the necessary data. However, a successful service that is built around the harvesting and analysis of data will sooner or later have to shard the dataset across multiple servers. And for a new service, the sooner you have to start sharding, the better. Several good distributed storage systems exist. Google’s original bigtable, HBase (Hadoop’s bigtable equivalent), Cassandra and more solve this particular problem, but there is more to choosing and running a storage system than just data volume scaling. CouchDB has other strengths that made it a good choice for our application, but that is not the topic of this post.
CouchDB-lounge originally written by Kevin Ferguson (macfergus), Vijay Ragunathan (lukatmyshu) and Shaun Lindsay (srlindsay) at Meebo.com will handle distribution of requests to the right servers in a cluster of servers, but you still have to handle resharding or repartitioning manually. CouchDB-lounge consists of two components, dumbproxy handling reading and writing of documents and smartproxy handling views. These require Nginx and Twisted (a Python framework) respectively. If you overshard appropriately, you can scale your data volume a long way before you have to start resharding.
While manually repartitioning a CouchDB database is doable, I’d rather have an automatic way of doing it since I don’t want to make mistakes. In addition, the Sincerial system uses Ruby running with Phusion Passenger in Apache and I didn’t want to add two more frameworks on top of that. This might sound like a not-invented here excuse, but it isn’t or at least I don’t think it is.
When I started developing Pillow, I chose to do so in erlang to match couchdb. The reason was two-fold. First of all I was curious about erlang and I like functional programming. Secondly, CouchDB was written in erlang and there had to be a reason for that. Now I’ve released version 0.3 of Pillow. This version supports automatic resharding, routing of requests to the right shard and views. Reducers need to be written in erlang, but a summing reducer is in place and mappers without reducers are supported out of the box. As such, this version of Pillow has all the functionality I set out to develop, but it does not support the full CouchDB API.
The bulk document API is not supported. And I haven’t tried running standalone CouchApps. The reason being that I am focusing on our needs. I intend to have full support of the CouchDB API eventually, but it might be that I integrate more tightly with CouchDB and use CouchDB as a library to make this happen. This will make supporting javascript reducers easier as well. It would of course be interesting to have Pillow become an integral part of CouchDB as well providing one can still access each CouchDB server directly for maintenance purposes. The latter being one of the reasons I’m in general a bit sceptical to distributed systems that hide the inner workings since often hard to fix problems that may occur.