Case statement pitfall when migrating to Ruby 1.9

I have been using Rubinius 2.0 to run machine learning experiments with libsvm lately. When running in Ruby 1.9.2, I noticed that my classifier always classified all samples as negative. I though this was caused by issues with libsvm-ruby-swig so I recompiled libsvm-ruby-swig from scratch including rerunning swig, but nothing changed. Next, I changed to use libsvmffi instead, but the result was the same. Realizing that I actually had some tests running av very simple classifier and that these tests passed on 1.9.2 made me look closer at the code. What I found was that the behavior of the Ruby case statement has changed from 1.8.7 to 1.9.2.

For if statements, 1 is equal to 1.0 in both 1.8.7 and 1.9, but while 1 matches 1.0 in 1.8.7 case statements, it does not in 1.9.2.

Code snippet that shows the difference:

#!/usr/bin/env ruby

puts case 1.0
when 1
  "yay"
else
  "nay"
end

First the output of irb when running 1.8.7:

$ rvm use ruby-1.8.7
Using /usr/local/rvm/gems/ruby-1.8.7-p334
$ ./case.rb
yay

And the same in 1.9.2:

$ rvm use ruby-1.9.2
Using /usr/local/rvm/gems/ruby-1.9.2-p180
$ ./case.rb
nay

Needless to say, I was puzzled by this result, but I was more surprised by the 1.8.7 behavior than 1.9.2. My assumption when I wrote the code was that I was dealing with integer values and since it worked, I forgot about it. Next time you see different behavior between 1.8.7 and 1.9.2 it might be worth reviewing case statements.


CouchDB and the web

This is my presentation from JavaZone 2010

Note that during my presentation, I showed the view section and basic replication directly in Futon instead of showing the fallback in the slides. What I did show was mostly the same, but naturally I showed some variations on the mappers as well.


CouchDB on Amazon EC2 CentOS server with Sprinkle

Read the Getting Started part of Till Klampäckel’s CouchDB on Ubuntu on AWS blog post for some general information. I see no reason to repeat those things here.

Till stresses the need for a security group opening port 80, but you should also enable ssh at port 22, otherwise it will be impossible to isntall anything. The AMI I use is rightscale’s CentOS 5.2 i386 v4.2.4. If you need a 64-bit image, that should work just as well.

Make sure you have Sprinkle installed on you the system you are installing from. Put this in your spinkle file and name it something reasonable. I called it couchdb.rb. If not, gem install sprinkle. Sprinkle is written in Ruby so if you don’t have Ruby, you should start by installing that.


# Sprinkle provisioning and deployment for CouchDB on
# an Amazon EC2 CentOS server

package :spidermonkey do
  source 'http://ftp.mozilla.org/pub/mozilla.org/js/js-1.7.0.tar.gz' do
    custom_dir 'js/src'
    custom_install "make BUILD_OPT=1 -f Makefile.ref && cp *.{h,tbl} /usr/include/ && cd Linux_All_OPT.OBJ && cp *.h /usr/include/ && mkdir -p /usr/{bin,lib}/ && cp js /usr/bin/ && cp libjs.so /usr/lib/"
end

  verify do
    has_executable 'js'
    has_file '/usr/include/jsapi.h'
    has_file '/usr/lib/libjs.so'
  end
end

package :erlang_dependencies do
  yum %w( ncurses-devel openssl-devel)
end

package :erlang do
  description 'Erlang, the programming language'
  source 'http://erlang.org/download/otp_src_R13B01.tar.gz'

  verify do
    has_executable '/usr/local/bin/erl'
  end

  requires :erlang_dependencies
end

package :couchdb_dependencies do
  yum %w( curl curl-devel icu libicu-devel )
end

# - CouchDB 0.9.1
package :couchdb, :provides => :database do
  description 'CouchDB'
  version '0.9.1'
  source 'http://mirrorservice.nomedia.no/apache.org/couchdb/0.9.1/apache-couchdb-0.9.1.tar.gz' do
    post :install, 'adduser -r -d /usr/local/var/lib/couchdb -M -s /bin/bash -c "CouchDB Administrator" couchdb'
    post :install, 'touch /usr/local/var/log/couchdb/couch.log'
    post :install, 'chown couchdb /usr/local/var/log/couchdb/couch.log'
    post :install, 'mkdir -p /usr/local/var/lib/couchdb'
    post :install, 'chown couchdb /usr/local/var/lib/couchdb'
    post :install, '/usr/local/etc/rc.d/couchdb start'
    post :install, 'ln -s /usr/local/etc/rc.d/couchdb /etc/init.d/couchdb'
    post :install, 'chkconfig --add couchdb'
  end

  verify do
    has_executable '/usr/local/bin/couchdb'
  end

  requires :couchdb_dependencies
  requires :erlang
  requires :spidermonkey
end

package :rubygems do
  description 'Ruby Gems Package Management System'
  yum 'rubygems'
end

package :couchrest do
  description 'Rest API for CouchDB'
  version '0.33'
  gem 'couchrest'
end

policy :db, :roles => :db do
  requires :database
  requires :couchrest
end

# Deployment
deployment do
  delivery :capistrano do
    set :user, 'root'
    set :use_sudo, false
    set :run_method, :run

    role :db, 'ec2-x-y-z-w.eu-west-1.compute.amazonaws.com'
  end

  # source based package installer defaults
  source do
    prefix '/usr/local' # where all source packages will be configured to install
    archives '/usr/local/sources' # where all source packages will be downloaded to
    builds '/usr/local/build' # where all source packages will be built
  end
end

Replace ec2-x-y-z-w.eu-west-1.compute.amazonaws.com with the public DNS name listed on the Amazon Web Services instance view. You don’t need rubygems and couchrest unless you are going to use Ruby, but I decided to leave them since CouchRest is a nice libarary to use when talking to CouchDB from Ruby.

Run it in a shell with sprinkle -s couchdb.rb. Might be interesting to check the powder cloud like this first: sprinkle -cts couchdb.rb. The expected cloud looks liek this:

--> Cloud hierarchy for policy db

Policy db requires package database
Selecting couchdb for virtual package database
  Package couchdb requires couchdb_dependencies
  Package couchdb requires erlang
    Package erlang requires erlang_dependencies
  Package couchdb requires spidermonkey

Policy db requires package couchrest
  Package couchrest requires rubygems

Set up an SSH tunnel to get the remote futonI tend to use 5994 locally to avoid conflicts with the local CouchDB. ssh -L 5994:localhost:5984 root@ec2-x-y-z-w.eu-west-1.compute.amazonaws.com where once again, ec2-x-y-z-w.eu-west-1.compute.amazonaws.com should be replaced with the public DNS name listed on the Amazon Web Services instance view. Point your browser to http://localhost:5994/_utils/ for that familiar futon view.


The CouchDB indexer – lightweight search engine in hours

Have you ever been in a situation where you needed to create a reverse lookup index of some documents you had lying around?

A reverse lookup index is the kind of index used by the search engines (or Googles if you like) of this world. Creating a reverse lookup index isn’t hard, but you would normally expect to spend a couple of weeks writing code to create one when needed unless you decide to use an existing indexer like lucene. Using lucene is a bit heavy if the indexing is not core to your application. If you store your documents in CouchDB, you can create an indexer writing just 4 lines of JavaScript. You should have at least one more line for safeguarding your input values, but a search engine indexer in 5 lines of JavaScript is IMHO still pretty good.

The prerequisite for this indexer is that the documents you want to index all have a vector field containing a document vector in the form of a hash mapping term to term weight {<term0> => <tweight0>, <tterm1> => <tweight1>, …,<ttermn> => <tweightn>}. The weight could be just the number of times a term occurs in the document or a some metric indicating how important a word is to a document relative to all possible documents. The traditional weight used in search engines is tf-idf (term frequency multiplied by inverse document frequency). Head over to Wikipedia if you want to learn more about tf-idf. Of course, if you just want to find all documents matching a query, you can ignore the weights completely.

If you have document vectors, you have all the input data you need to create a reverse lookup index. This is the mapper if you just want to get all documents matching a query. Note that it completely ignores the term weights:


function(doc) {
    if (!('vector' in doc)) return;
    var vector = doc.vector;
    for (var term in vector) {
        emit(term, doc._id);
    }
}

The function operates on each document in the database. The first statement is a simple safeguard ensuring that we don’t try to access vector if the document doesn’t have that property. Since CouchDB is schema free, different types of documents with different fields may be stored in the same database. If the vector property is there, we store it in a variable called vector. For each element in vector, we emit the key which is the term and the id of the document we are operating on. If you run just this mapper, you will get a list of term to single document id mappings. This is a major step forward since we now have a reverse mapping of the database.

To get the reverse database map into something that is quick and easy to lookup, we need a reducer. It’s purpose is to convert the list of term, document id pairs into a single term to document id list.


function(keys, values) {
    var docs = [];
    for (var i = 0; i < values.length; ++i) {
        docs.push(values[i]);
    }
    return docs;
}

In this situation, we don’t care about the keys. CouchDB handles that for us. We need an array to store all the document ids. Then we iterate over all the values and push them into our array. Finally we return the array. CouchDB ensures that this is only called once for a single term ensuring we end up with a single document id list for each term.

This index can be used to find all documents matching a given set of terms. Note that there is not much sophistication in this method so the only rank score you can get is the number of matching terms. Adding term weights to the index will give you something to use for ranking. Change the emit line in the mapper to

emit(term, [doc._id, vector[term]]);

This will give you a list of document id, term weight pairs for each term instead of just the document ids.

That mapper is 7 lines of code, 4 lines if you don’t count the function declaration line and lines only containing curly braces. Ignoring the safe guard as well, the mapper body can be reduced to this single line by also skipping the temporary variable assignment:

for (var term in doc.vector) emit(term, [doc._id, doc.vector[term]]);

In the same manner, the reduce function may be reduced to a body of just 3 lines of code

var docs = [];
for (var i = 0; i < values.length; ++i) docs.push(values[i]);
return docs;

That’s the power and beauty of CouchDB map reduce. You can write a search engine indexer in 4 lines of JavaScript. Granted, you need to create vectors of your documents in advance, but that’s just a matter of parsing a text string and splitting on whitespace and/or punctuation into an array and reducing the term array into a hash of <term> => <weight> pairs. Sure, you can do this with a map reduce as well, but that might be overkill since you will only be operating on a single document at a time.

One last important point is that while Futon, the CouchDB browser client will show the expected results, you have to explicitly tell CouchDB to group the result if you want to use this in your application. My database is called pages, the design is called demos and the view index making the output of the map reduce available as json at
http://localhost:5984/pages/_design/demos/_view/index?group=true
Thanks to J. Chris Anderson for clarifying and pointing out the grouping query usage.


Camping with CouchDB

When developing a new system, getting end-to-end functionality and being able to demonstrate it as soon as possible is important. While doing so, it’s also an added benefit if you do not spend a lot of time writing throwaway code.

I have a set of scripts that let me test and use the system that I am developing from the command line. Since the whole system is written in Ruby, writing a script to allow command line interaction is straightforward. The end result of what I am developing will be a service, so functionality equivalent to the scripts must be available in a browser. With Ruby, there are several ways to bring an application to a browser. The usual suspects are Rails, Merb and if they don’t work, one can always revert to using WEBrick and write servlets.

Rails and Merb did not seem right since I have an existing CouchDB backed model. I had read about Campingand wanted to test it out so I did. This was great. From the time I started investigating Camping until I could show data from my model in the browser took only 45 minutes. When I realized that I could use my own model directly rather than use the Camping model, I was just a few lines from the goal. I haven’t given much thought to whether this is a production ready framework, but that doesn’t really matter at the moment since I don’t have to write much unnecessary code. The benefit of reduced development time while retaining the full programmatic control of my model.

To show how easy it is to connect CouchDB and camping, here’s a simple example that while not particularly useful on it’s own should show a pattern that you can use in your own application.

require 'camping'
require 'couchrest'

Camping.goes :MyCamp

I use CouchRest to simplify CouchDB interaction. The magic is of course in the last line, Camping.goes :MyCamp. That line tells Camping to serve the module called MyCamp.
Time to implement the controller. Note that Camping expects to find the controller definitions in MyCamp::Controllers.

module MyCamp::Controllers
  class MyObject < R '/object/(\w+)'

This construct might confuse people, but R is defined by Camping and the parameter is a path with a regexp in the parentheses. This regexp yields the argument to get below. In this case, any HTTP GET request to server/object/number1 will call get('number1'). Perfectly RESTful.

    def MyObject.set_storage(storage)
      @@storage = storage
    end

This is a way of letting the controller know the about our CouchRest model. I could have added a Model encapsulating that, but to me that is just adding another level of indirection that only serves to add confusion and complicate code maintenance since the model already exists.

    def get(id)
      @my_object = @@storage.get(id)
      @my_id = id
      render :mymodel
    end
  end
end

This is the method that is called by a HTTP GET request matching the pattern given /object/(\w+). render :mymodel result in the execution of the mymodel view.

module MyCamp::Views
  def mymodel
    body do
      h1 "#{@my_id}"
      ul do
        @my_object['items'].each do |field, value|
          li "#{field}: #{value}"
        end
      end
    end
  end
end

Simple view that iterates over the ‘items’ hash and lists field: value. Note that Camping by default uses markaby to create HTML programmatically.

db_url = 'http://localhost:5984/'
storage = CouchRest.database("#{@db_url}objects")
MyCamp::Controllers::MyObject.set_storage(storage)

Sets up CouchRest to use the CouchDB backend at http://localhost:5984/objects

Run you application with

camping my_camp.rb

If the database contains a document with _id = 'number1' and 'items' = {"a": 1, "b": 2}, your browser will show this at http://localhost:3301/object/number1:

number1

  • a: 1
  • b: 2

There you go, a nice little Camping application backed by CouchDB.

For the official Camping site, go to http://camping.rubyforge.org/


Follow

Get every new post delivered to your Inbox.