Case statement pitfall when migrating to Ruby 1.9
Posted: August 1, 2011 Filed under: code example, ruby Leave a comment »I have been using Rubinius 2.0 to run machine learning experiments with libsvm lately. When running in Ruby 1.9.2, I noticed that my classifier always classified all samples as negative. I though this was caused by issues with libsvm-ruby-swig so I recompiled libsvm-ruby-swig from scratch including rerunning swig, but nothing changed. Next, I changed to use libsvmffi instead, but the result was the same. Realizing that I actually had some tests running av very simple classifier and that these tests passed on 1.9.2 made me look closer at the code. What I found was that the behavior of the Ruby case statement has changed from 1.8.7 to 1.9.2.
For if statements, 1 is equal to 1.0 in both 1.8.7 and 1.9, but while 1 matches 1.0 in 1.8.7 case statements, it does not in 1.9.2.
Code snippet that shows the difference:
#!/usr/bin/env ruby puts case 1.0 when 1 "yay" else "nay" end
First the output of irb when running 1.8.7:
$ rvm use ruby-1.8.7 Using /usr/local/rvm/gems/ruby-1.8.7-p334 $ ./case.rb yay
And the same in 1.9.2:
$ rvm use ruby-1.9.2 Using /usr/local/rvm/gems/ruby-1.9.2-p180 $ ./case.rb nay
Needless to say, I was puzzled by this result, but I was more surprised by the 1.8.7 behavior than 1.9.2. My assumption when I wrote the code was that I was dealing with integer values and since it worked, I forgot about it. Next time you see different behavior between 1.8.7 and 1.9.2 it might be worth reviewing case statements.
CouchDB and the web
Posted: September 9, 2010 Filed under: code example, couchdb Leave a comment »This is my presentation from JavaZone 2010
Note that during my presentation, I showed the view section and basic replication directly in Futon instead of showing the fallback in the slides. What I did show was mostly the same, but naturally I showed some variations on the mappers as well.
CouchDB on Amazon EC2 CentOS server with Sprinkle
Posted: August 31, 2009 Filed under: code example, couchdb Leave a comment »Read the Getting Started part of Till Klampäckel’s CouchDB on Ubuntu on AWS blog post for some general information. I see no reason to repeat those things here.
Till stresses the need for a security group opening port 80, but you should also enable ssh at port 22, otherwise it will be impossible to isntall anything. The AMI I use is rightscale’s CentOS 5.2 i386 v4.2.4. If you need a 64-bit image, that should work just as well.
Make sure you have Sprinkle installed on you the system you are installing from. Put this in your spinkle file and name it something reasonable. I called it couchdb.rb. If not, gem install sprinkle. Sprinkle is written in Ruby so if you don’t have Ruby, you should start by installing that.
# Sprinkle provisioning and deployment for CouchDB on
# an Amazon EC2 CentOS server
package :spidermonkey do
source 'http://ftp.mozilla.org/pub/mozilla.org/js/js-1.7.0.tar.gz' do
custom_dir 'js/src'
custom_install "make BUILD_OPT=1 -f Makefile.ref && cp *.{h,tbl} /usr/include/ && cd Linux_All_OPT.OBJ && cp *.h /usr/include/ && mkdir -p /usr/{bin,lib}/ && cp js /usr/bin/ && cp libjs.so /usr/lib/"
end
verify do
has_executable 'js'
has_file '/usr/include/jsapi.h'
has_file '/usr/lib/libjs.so'
end
end
package :erlang_dependencies do
yum %w( ncurses-devel openssl-devel)
end
package :erlang do
description 'Erlang, the programming language'
source 'http://erlang.org/download/otp_src_R13B01.tar.gz'
verify do
has_executable '/usr/local/bin/erl'
end
requires :erlang_dependencies
end
package :couchdb_dependencies do
yum %w( curl curl-devel icu libicu-devel )
end
# - CouchDB 0.9.1
package :couchdb, :provides => :database do
description 'CouchDB'
version '0.9.1'
source 'http://mirrorservice.nomedia.no/apache.org/couchdb/0.9.1/apache-couchdb-0.9.1.tar.gz' do
post :install, 'adduser -r -d /usr/local/var/lib/couchdb -M -s /bin/bash -c "CouchDB Administrator" couchdb'
post :install, 'touch /usr/local/var/log/couchdb/couch.log'
post :install, 'chown couchdb /usr/local/var/log/couchdb/couch.log'
post :install, 'mkdir -p /usr/local/var/lib/couchdb'
post :install, 'chown couchdb /usr/local/var/lib/couchdb'
post :install, '/usr/local/etc/rc.d/couchdb start'
post :install, 'ln -s /usr/local/etc/rc.d/couchdb /etc/init.d/couchdb'
post :install, 'chkconfig --add couchdb'
end
verify do
has_executable '/usr/local/bin/couchdb'
end
requires :couchdb_dependencies
requires :erlang
requires :spidermonkey
end
package :rubygems do
description 'Ruby Gems Package Management System'
yum 'rubygems'
end
package :couchrest do
description 'Rest API for CouchDB'
version '0.33'
gem 'couchrest'
end
policy :db, :roles => :db do
requires :database
requires :couchrest
end
# Deployment
deployment do
delivery :capistrano do
set :user, 'root'
set :use_sudo, false
set :run_method, :run
role :db, 'ec2-x-y-z-w.eu-west-1.compute.amazonaws.com'
end
# source based package installer defaults
source do
prefix '/usr/local' # where all source packages will be configured to install
archives '/usr/local/sources' # where all source packages will be downloaded to
builds '/usr/local/build' # where all source packages will be built
end
end
Replace ec2-x-y-z-w.eu-west-1.compute.amazonaws.com with the public DNS name listed on the Amazon Web Services instance view. You don’t need rubygems and couchrest unless you are going to use Ruby, but I decided to leave them since CouchRest is a nice libarary to use when talking to CouchDB from Ruby.
Run it in a shell with sprinkle -s couchdb.rb. Might be interesting to check the powder cloud like this first: sprinkle -cts couchdb.rb. The expected cloud looks liek this:
--> Cloud hierarchy for policy db
Policy db requires package database
Selecting couchdb for virtual package database
Package couchdb requires couchdb_dependencies
Package couchdb requires erlang
Package erlang requires erlang_dependencies
Package couchdb requires spidermonkey
Policy db requires package couchrest
Package couchrest requires rubygems
Set up an SSH tunnel to get the remote futonI tend to use 5994 locally to avoid conflicts with the local CouchDB. ssh -L 5994:localhost:5984 root@ec2-x-y-z-w.eu-west-1.compute.amazonaws.com where once again, ec2-x-y-z-w.eu-west-1.compute.amazonaws.com should be replaced with the public DNS name listed on the Amazon Web Services instance view. Point your browser to http://localhost:5994/_utils/ for that familiar futon view.
The CouchDB indexer – lightweight search engine in hours
Posted: July 9, 2009 Filed under: code example, couchdb 9 Comments »Have you ever been in a situation where you needed to create a reverse lookup index of some documents you had lying around?
A reverse lookup index is the kind of index used by the search engines (or Googles if you like) of this world. Creating a reverse lookup index isn’t hard, but you would normally expect to spend a couple of weeks writing code to create one when needed unless you decide to use an existing indexer like lucene. Using lucene is a bit heavy if the indexing is not core to your application. If you store your documents in CouchDB, you can create an indexer writing just 4 lines of JavaScript. You should have at least one more line for safeguarding your input values, but a search engine indexer in 5 lines of JavaScript is IMHO still pretty good.
The prerequisite for this indexer is that the documents you want to index all have a vector field containing a document vector in the form of a hash mapping term to term weight {<term0> => <tweight0>, <tterm1> => <tweight1>, …,<ttermn> => <tweightn>}. The weight could be just the number of times a term occurs in the document or a some metric indicating how important a word is to a document relative to all possible documents. The traditional weight used in search engines is tf-idf (term frequency multiplied by inverse document frequency). Head over to Wikipedia if you want to learn more about tf-idf. Of course, if you just want to find all documents matching a query, you can ignore the weights completely.
If you have document vectors, you have all the input data you need to create a reverse lookup index. This is the mapper if you just want to get all documents matching a query. Note that it completely ignores the term weights:
function(doc) {
if (!('vector' in doc)) return;
var vector = doc.vector;
for (var term in vector) {
emit(term, doc._id);
}
}
The function operates on each document in the database. The first statement is a simple safeguard ensuring that we don’t try to access vector if the document doesn’t have that property. Since CouchDB is schema free, different types of documents with different fields may be stored in the same database. If the vector property is there, we store it in a variable called vector. For each element in vector, we emit the key which is the term and the id of the document we are operating on. If you run just this mapper, you will get a list of term to single document id mappings. This is a major step forward since we now have a reverse mapping of the database.
To get the reverse database map into something that is quick and easy to lookup, we need a reducer. It’s purpose is to convert the list of term, document id pairs into a single term to document id list.
function(keys, values) {
var docs = [];
for (var i = 0; i < values.length; ++i) {
docs.push(values[i]);
}
return docs;
}
In this situation, we don’t care about the keys. CouchDB handles that for us. We need an array to store all the document ids. Then we iterate over all the values and push them into our array. Finally we return the array. CouchDB ensures that this is only called once for a single term ensuring we end up with a single document id list for each term.
This index can be used to find all documents matching a given set of terms. Note that there is not much sophistication in this method so the only rank score you can get is the number of matching terms. Adding term weights to the index will give you something to use for ranking. Change the emit line in the mapper to
emit(term, [doc._id, vector[term]]);
This will give you a list of document id, term weight pairs for each term instead of just the document ids.
That mapper is 7 lines of code, 4 lines if you don’t count the function declaration line and lines only containing curly braces. Ignoring the safe guard as well, the mapper body can be reduced to this single line by also skipping the temporary variable assignment:
for (var term in doc.vector) emit(term, [doc._id, doc.vector[term]]);
In the same manner, the reduce function may be reduced to a body of just 3 lines of code
var docs = [];
for (var i = 0; i < values.length; ++i) docs.push(values[i]);
return docs;
That’s the power and beauty of CouchDB map reduce. You can write a search engine indexer in 4 lines of JavaScript. Granted, you need to create vectors of your documents in advance, but that’s just a matter of parsing a text string and splitting on whitespace and/or punctuation into an array and reducing the term array into a hash of <term> => <weight> pairs. Sure, you can do this with a map reduce as well, but that might be overkill since you will only be operating on a single document at a time.
One last important point is that while Futon, the CouchDB browser client will show the expected results, you have to explicitly tell CouchDB to group the result if you want to use this in your application. My database is called pages, the design is called demos and the view index making the output of the map reduce available as json at
http://localhost:5984/pages/_design/demos/_view/index?group=true
Thanks to J. Chris Anderson for clarifying and pointing out the grouping query usage.
Camping with CouchDB
Posted: March 8, 2009 Filed under: code example | Tags: camping, couchdb, example, ruby 1 Comment »When developing a new system, getting end-to-end functionality and being able to demonstrate it as soon as possible is important. While doing so, it’s also an added benefit if you do not spend a lot of time writing throwaway code.
I have a set of scripts that let me test and use the system that I am developing from the command line. Since the whole system is written in Ruby, writing a script to allow command line interaction is straightforward. The end result of what I am developing will be a service, so functionality equivalent to the scripts must be available in a browser. With Ruby, there are several ways to bring an application to a browser. The usual suspects are Rails, Merb and if they don’t work, one can always revert to using WEBrick and write servlets.
Rails and Merb did not seem right since I have an existing CouchDB backed model. I had read about Campingand wanted to test it out so I did. This was great. From the time I started investigating Camping until I could show data from my model in the browser took only 45 minutes. When I realized that I could use my own model directly rather than use the Camping model, I was just a few lines from the goal. I haven’t given much thought to whether this is a production ready framework, but that doesn’t really matter at the moment since I don’t have to write much unnecessary code. The benefit of reduced development time while retaining the full programmatic control of my model.
To show how easy it is to connect CouchDB and camping, here’s a simple example that while not particularly useful on it’s own should show a pattern that you can use in your own application.
require 'camping'
require 'couchrest'
Camping.goes :MyCamp
I use CouchRest to simplify CouchDB interaction. The magic is of course in the last line, Camping.goes :MyCamp. That line tells Camping to serve the module called MyCamp.
Time to implement the controller. Note that Camping expects to find the controller definitions in MyCamp::Controllers.
module MyCamp::Controllers
class MyObject < R '/object/(\w+)'
This construct might confuse people, but R is defined by Camping and the parameter is a path with a regexp in the parentheses. This regexp yields the argument to get below. In this case, any HTTP GET request to server/object/number1 will call get('number1'). Perfectly RESTful.
def MyObject.set_storage(storage)
@@storage = storage
end
This is a way of letting the controller know the about our CouchRest model. I could have added a Model encapsulating that, but to me that is just adding another level of indirection that only serves to add confusion and complicate code maintenance since the model already exists.
def get(id)
@my_object = @@storage.get(id)
@my_id = id
render :mymodel
end
end
end
This is the method that is called by a HTTP GET request matching the pattern given /object/(\w+). render :mymodel result in the execution of the mymodel view.
module MyCamp::Views
def mymodel
body do
h1 "#{@my_id}"
ul do
@my_object['items'].each do |field, value|
li "#{field}: #{value}"
end
end
end
end
end
Simple view that iterates over the ‘items’ hash and lists field: value. Note that Camping by default uses markaby to create HTML programmatically.
db_url = 'http://localhost:5984/'
storage = CouchRest.database("#{@db_url}objects")
MyCamp::Controllers::MyObject.set_storage(storage)
Sets up CouchRest to use the CouchDB backend at http://localhost:5984/objects
Run you application with
camping my_camp.rb
If the database contains a document with _id = 'number1' and 'items' = {"a": 1, "b": 2}, your browser will show this at http://localhost:3301/object/number1:
number1
- a: 1
- b: 2
There you go, a nice little Camping application backed by CouchDB.
For the official Camping site, go to http://camping.rubyforge.org/