The CouchDB indexer – lightweight search engine in hours

Have you ever been in a situation where you needed to create a reverse lookup index of some documents you had lying around?

A reverse lookup index is the kind of index used by the search engines (or Googles if you like) of this world. Creating a reverse lookup index isn’t hard, but you would normally expect to spend a couple of weeks writing code to create one when needed unless you decide to use an existing indexer like lucene. Using lucene is a bit heavy if the indexing is not core to your application. If you store your documents in CouchDB, you can create an indexer writing just 4 lines of JavaScript. You should have at least one more line for safeguarding your input values, but a search engine indexer in 5 lines of JavaScript is IMHO still pretty good.

The prerequisite for this indexer is that the documents you want to index all have a vector field containing a document vector in the form of a hash mapping term to term weight {<term0> => <tweight0>, <tterm1> => <tweight1>, …,<ttermn> => <tweightn>}. The weight could be just the number of times a term occurs in the document or a some metric indicating how important a word is to a document relative to all possible documents. The traditional weight used in search engines is tf-idf (term frequency multiplied by inverse document frequency). Head over to Wikipedia if you want to learn more about tf-idf. Of course, if you just want to find all documents matching a query, you can ignore the weights completely.

If you have document vectors, you have all the input data you need to create a reverse lookup index. This is the mapper if you just want to get all documents matching a query. Note that it completely ignores the term weights:


function(doc) {
    if (!('vector' in doc)) return;
    var vector = doc.vector;
    for (var term in vector) {
        emit(term, doc._id);
    }
}

The function operates on each document in the database. The first statement is a simple safeguard ensuring that we don’t try to access vector if the document doesn’t have that property. Since CouchDB is schema free, different types of documents with different fields may be stored in the same database. If the vector property is there, we store it in a variable called vector. For each element in vector, we emit the key which is the term and the id of the document we are operating on. If you run just this mapper, you will get a list of term to single document id mappings. This is a major step forward since we now have a reverse mapping of the database.

To get the reverse database map into something that is quick and easy to lookup, we need a reducer. It’s purpose is to convert the list of term, document id pairs into a single term to document id list.


function(keys, values) {
    var docs = [];
    for (var i = 0; i < values.length; ++i) {
        docs.push(values[i]);
    }
    return docs;
}

In this situation, we don’t care about the keys. CouchDB handles that for us. We need an array to store all the document ids. Then we iterate over all the values and push them into our array. Finally we return the array. CouchDB ensures that this is only called once for a single term ensuring we end up with a single document id list for each term.

This index can be used to find all documents matching a given set of terms. Note that there is not much sophistication in this method so the only rank score you can get is the number of matching terms. Adding term weights to the index will give you something to use for ranking. Change the emit line in the mapper to

emit(term, [doc._id, vector[term]]);

This will give you a list of document id, term weight pairs for each term instead of just the document ids.

That mapper is 7 lines of code, 4 lines if you don’t count the function declaration line and lines only containing curly braces. Ignoring the safe guard as well, the mapper body can be reduced to this single line by also skipping the temporary variable assignment:

for (var term in doc.vector) emit(term, [doc._id, doc.vector[term]]);

In the same manner, the reduce function may be reduced to a body of just 3 lines of code

var docs = [];
for (var i = 0; i < values.length; ++i) docs.push(values[i]);
return docs;

That’s the power and beauty of CouchDB map reduce. You can write a search engine indexer in 4 lines of JavaScript. Granted, you need to create vectors of your documents in advance, but that’s just a matter of parsing a text string and splitting on whitespace and/or punctuation into an array and reducing the term array into a hash of <term> => <weight> pairs. Sure, you can do this with a map reduce as well, but that might be overkill since you will only be operating on a single document at a time.

One last important point is that while Futon, the CouchDB browser client will show the expected results, you have to explicitly tell CouchDB to group the result if you want to use this in your application. My database is called pages, the design is called demos and the view index making the output of the map reduce available as json at
http://localhost:5984/pages/_design/demos/_view/index?group=true
Thanks to J. Chris Anderson for clarifying and pointing out the grouping query usage.

About these ads

9 thoughts on “The CouchDB indexer – lightweight search engine in hours

  1. Did you test this approach on a non-trivial database? I tried following your example on a database with ~3000 documents and get reduce_over_flow errors.

  2. knuthellan says:

    I only ran this with about 100 documents for a demo application showing off something else, but I will definitely run a test as well. If couchdb map reduce in general breaks down at 3000 documents that means it needs fixing. Could of course be a combined factor of document length and number of documents, but 3000 documents is still surprisingly low and should probably be much higher.

  3. knuthellan says:

    I’ve tried to torture CouchDB with this particular map reduce the last couple of days and it seems like it’s too much for it at least in the standard configuration.

    I used pages from the Norwegian wikipedia and while I’ve never seen the reduce_over_flow error, it quickly becomes dead slow. I started with 10000 articles and the map reduce seemed to get almost stuck at a little above 300 tasks (out of 10000). I stopped it and tested with fewer documents, but I didn’t find the pain point, but it’s somewhere between 100 and 10000, based on the point it got stuck earlier I guess it’s around 300.

    It might be that the reduce_over_flow you see is a hint that you need to handle rereduce.

    Change the reduce function to

    function(keys, values, rereduce) {
    var docs = [];
    if (rereduce) {
    for (var i = 0; i < values.length; ++i) {
    for (var j = 0; j < values[i].length; ++j) docs.push(values[i][j]);
    }
    return docs;
    }
    for (var i = 0; i < values.length; ++i) docs.push(values[i]);
    return docs;
    }

    and handle rereduce.

  4. knuthellan says:

    Just have to say that as of CouchDB 0.10, this approach will probably not work. That being said, it’s still pretty easy to do an indexer in CouchDB, but you either have to do more work in the mapper or summarize outside of CouchDB. While this takes some elegance out of the solution, I believe it will be more viable for larger data sets.

  5. Pingback: Quora
  6. knuthellan says:

    In writing that Quora post, I revisited this blog post and realized that it is indeed outdated. My presentation at JavaZone 2010, http://bit.ly/cO1AE5, shows how to do this with just a summing reducer and I admit I should probably write a new blog post about that.

  7. knuthellan says:

    In writing that Quora post, I revisited this blog post and realized that it is indeed outdated. My presentation at JavaZone 2010, http://bit.ly/cO1AE5, shows how to do this with just a summing reducer and I admit I should probably write a new blog post about that.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s