CouchDB on Amazon EC2 CentOS server with Sprinkle
Posted: August 31, 2009 Filed under: code example, couchdb Leave a comment »Read the Getting Started part of Till Klampäckel’s CouchDB on Ubuntu on AWS blog post for some general information. I see no reason to repeat those things here.
Till stresses the need for a security group opening port 80, but you should also enable ssh at port 22, otherwise it will be impossible to isntall anything. The AMI I use is rightscale’s CentOS 5.2 i386 v4.2.4. If you need a 64-bit image, that should work just as well.
Make sure you have Sprinkle installed on you the system you are installing from. Put this in your spinkle file and name it something reasonable. I called it couchdb.rb. If not, gem install sprinkle. Sprinkle is written in Ruby so if you don’t have Ruby, you should start by installing that.
# Sprinkle provisioning and deployment for CouchDB on
# an Amazon EC2 CentOS server
package :spidermonkey do
source 'http://ftp.mozilla.org/pub/mozilla.org/js/js-1.7.0.tar.gz' do
custom_dir 'js/src'
custom_install "make BUILD_OPT=1 -f Makefile.ref && cp *.{h,tbl} /usr/include/ && cd Linux_All_OPT.OBJ && cp *.h /usr/include/ && mkdir -p /usr/{bin,lib}/ && cp js /usr/bin/ && cp libjs.so /usr/lib/"
end
verify do
has_executable 'js'
has_file '/usr/include/jsapi.h'
has_file '/usr/lib/libjs.so'
end
end
package :erlang_dependencies do
yum %w( ncurses-devel openssl-devel)
end
package :erlang do
description 'Erlang, the programming language'
source 'http://erlang.org/download/otp_src_R13B01.tar.gz'
verify do
has_executable '/usr/local/bin/erl'
end
requires :erlang_dependencies
end
package :couchdb_dependencies do
yum %w( curl curl-devel icu libicu-devel )
end
# - CouchDB 0.9.1
package :couchdb, :provides => :database do
description 'CouchDB'
version '0.9.1'
source 'http://mirrorservice.nomedia.no/apache.org/couchdb/0.9.1/apache-couchdb-0.9.1.tar.gz' do
post :install, 'adduser -r -d /usr/local/var/lib/couchdb -M -s /bin/bash -c "CouchDB Administrator" couchdb'
post :install, 'touch /usr/local/var/log/couchdb/couch.log'
post :install, 'chown couchdb /usr/local/var/log/couchdb/couch.log'
post :install, 'mkdir -p /usr/local/var/lib/couchdb'
post :install, 'chown couchdb /usr/local/var/lib/couchdb'
post :install, '/usr/local/etc/rc.d/couchdb start'
post :install, 'ln -s /usr/local/etc/rc.d/couchdb /etc/init.d/couchdb'
post :install, 'chkconfig --add couchdb'
end
verify do
has_executable '/usr/local/bin/couchdb'
end
requires :couchdb_dependencies
requires :erlang
requires :spidermonkey
end
package :rubygems do
description 'Ruby Gems Package Management System'
yum 'rubygems'
end
package :couchrest do
description 'Rest API for CouchDB'
version '0.33'
gem 'couchrest'
end
policy :db, :roles => :db do
requires :database
requires :couchrest
end
# Deployment
deployment do
delivery :capistrano do
set :user, 'root'
set :use_sudo, false
set :run_method, :run
role :db, 'ec2-x-y-z-w.eu-west-1.compute.amazonaws.com'
end
# source based package installer defaults
source do
prefix '/usr/local' # where all source packages will be configured to install
archives '/usr/local/sources' # where all source packages will be downloaded to
builds '/usr/local/build' # where all source packages will be built
end
end
Replace ec2-x-y-z-w.eu-west-1.compute.amazonaws.com with the public DNS name listed on the Amazon Web Services instance view. You don’t need rubygems and couchrest unless you are going to use Ruby, but I decided to leave them since CouchRest is a nice libarary to use when talking to CouchDB from Ruby.
Run it in a shell with sprinkle -s couchdb.rb. Might be interesting to check the powder cloud like this first: sprinkle -cts couchdb.rb. The expected cloud looks liek this:
--> Cloud hierarchy for policy db
Policy db requires package database
Selecting couchdb for virtual package database
Package couchdb requires couchdb_dependencies
Package couchdb requires erlang
Package erlang requires erlang_dependencies
Package couchdb requires spidermonkey
Policy db requires package couchrest
Package couchrest requires rubygems
Set up an SSH tunnel to get the remote futonI tend to use 5994 locally to avoid conflicts with the local CouchDB. ssh -L 5994:localhost:5984 root@ec2-x-y-z-w.eu-west-1.compute.amazonaws.com where once again, ec2-x-y-z-w.eu-west-1.compute.amazonaws.com should be replaced with the public DNS name listed on the Amazon Web Services instance view. Point your browser to http://localhost:5994/_utils/ for that familiar futon view.
Finding the one in an ocean of provisioning and deployment systems
Posted: August 21, 2009 Filed under: deployment | Tags: deployment, provisioning Leave a comment »I currently use some shell scripts for deployment. This works well on my development server. It also works when starting to run things in a data center, but there are limitations and shortcomings. For one, my deployment scripts do not have any verification. I can add that, but then again I would be duplicating work that others have done well before. More important is the fact that my deployment scripts assume a prepared installation environment with all the packages and gems needed installed. I want a deployment system that also provisions the target server with all my prerequisites like Apache or nginx, Passenger Phusion, Sinatra and CouchDB. Note that our application does not use Rails, the storage is handled by CouchDB and non-UI work is handled by racks running directly in Phusion. The UI is handled by Sinatra.
We are looking at running our service in Amazon EC2 when we are ready to launch. As we use Fedora and Ubuntu for development, the obvious candidate distros are the same or related distros like CentOS, RHEL and Debian. If the deployment system handles package installation through an abstractions system, this system must support both yum and apt.
Chef
Chef is built around a chef server hosting the deployment scripts (cookbooks) and chef clients managing the worker nodes. It uses CouchDB in the background for storing information. From what I see, Chef is very powerful and looks like a nice way of handling huge clusters. The power also seems to have the side-effects that doing simple things in a smallish cluster is a lot of work. Uses Rake underneath
Capistrano
Capistrano uses it’s own DSL in combination with Ruby. Capistrano is Rails-centric, but have instructions showing how to use it for non-Rails deployment. The deployment instructions mostly run shell scripts which makes it very flexible.
Vlad the Deployer
Vlad seems to be an extension to Rake. Aims to be a better Capistrano. It seems very subversion oriented. One of the features is Turnkey deployment for mongrel+apache+svn. Since we’re not using mongrel or svn and although we use apache now might switch to nginx soon, this all seems wrong.
Puppet
Puppet is an extremely flexible and powerful provisioning system. However, as is often the case, this comes at a cost. Doing simple things like our deployment is very time consuming in Puppet. It uses it’s proprietary DSL to specify very accurately how the target environment should be after running puppet. It is capable of running in the background and syncing automatically every 30 minutes or so. The feature set is overkill in our case right now and if we need those features later I’ll be a happy man who will gladly spend time moving to puppet.
Sprinkle
Sprinkle uses Ruby directly instead of a task specific DSL. Full support of the main deployment systems is included. What puzzled me at first was that it supports Capistrano and Vlad the deployer for remote command delivery. Later I realized that Sprinkle is a provisioning system and leaves application deployment to other specialized systems. Nice separation.
Moonshine
One of requirements of Moonshine is A server running Ubuntu 8.10 which effectively disqualifies it for our use. I’m sure supporting other distros is not much work, but I’m not willing to do that work at this point anyway.
Passenger_stack
Might seem weird to include this in the list, but we use Passenger and anything that can get us there faster is considered. This would get us apache, passenger and lots of other things we do not use like MySQL. Since I do not want to install things we don’t need and passenger_stack only works on Ubuntu (and not even Debian), it is disqualified.
Conclusion
With Moonshine out due to lack of distro support and Vlad the Deployer out due to being mongrel+svn focused and puppet being too difficult, the following candidates are still contestants: Chef, Capistrano and Sprinkle.
Sprinkle has nice dry-run functionality allowing you to see what will happen to your remote servers before you do anything. Capistrano gets praise for easy of application deployment so the Sprinkle and Capistrano combo sounds interesting. My choice is now down to Sprinkle or Chef for the server provisioning and the winner is Sprinkle. The reason for my choice is that it is simpler and does what I currently need it to do. Migrating to Chef in the future is still an option.
The CouchDB indexer – lightweight search engine in hours
Posted: July 9, 2009 Filed under: code example, couchdb 9 Comments »Have you ever been in a situation where you needed to create a reverse lookup index of some documents you had lying around?
A reverse lookup index is the kind of index used by the search engines (or Googles if you like) of this world. Creating a reverse lookup index isn’t hard, but you would normally expect to spend a couple of weeks writing code to create one when needed unless you decide to use an existing indexer like lucene. Using lucene is a bit heavy if the indexing is not core to your application. If you store your documents in CouchDB, you can create an indexer writing just 4 lines of JavaScript. You should have at least one more line for safeguarding your input values, but a search engine indexer in 5 lines of JavaScript is IMHO still pretty good.
The prerequisite for this indexer is that the documents you want to index all have a vector field containing a document vector in the form of a hash mapping term to term weight {<term0> => <tweight0>, <tterm1> => <tweight1>, …,<ttermn> => <tweightn>}. The weight could be just the number of times a term occurs in the document or a some metric indicating how important a word is to a document relative to all possible documents. The traditional weight used in search engines is tf-idf (term frequency multiplied by inverse document frequency). Head over to Wikipedia if you want to learn more about tf-idf. Of course, if you just want to find all documents matching a query, you can ignore the weights completely.
If you have document vectors, you have all the input data you need to create a reverse lookup index. This is the mapper if you just want to get all documents matching a query. Note that it completely ignores the term weights:
function(doc) {
if (!('vector' in doc)) return;
var vector = doc.vector;
for (var term in vector) {
emit(term, doc._id);
}
}
The function operates on each document in the database. The first statement is a simple safeguard ensuring that we don’t try to access vector if the document doesn’t have that property. Since CouchDB is schema free, different types of documents with different fields may be stored in the same database. If the vector property is there, we store it in a variable called vector. For each element in vector, we emit the key which is the term and the id of the document we are operating on. If you run just this mapper, you will get a list of term to single document id mappings. This is a major step forward since we now have a reverse mapping of the database.
To get the reverse database map into something that is quick and easy to lookup, we need a reducer. It’s purpose is to convert the list of term, document id pairs into a single term to document id list.
function(keys, values) {
var docs = [];
for (var i = 0; i < values.length; ++i) {
docs.push(values[i]);
}
return docs;
}
In this situation, we don’t care about the keys. CouchDB handles that for us. We need an array to store all the document ids. Then we iterate over all the values and push them into our array. Finally we return the array. CouchDB ensures that this is only called once for a single term ensuring we end up with a single document id list for each term.
This index can be used to find all documents matching a given set of terms. Note that there is not much sophistication in this method so the only rank score you can get is the number of matching terms. Adding term weights to the index will give you something to use for ranking. Change the emit line in the mapper to
emit(term, [doc._id, vector[term]]);
This will give you a list of document id, term weight pairs for each term instead of just the document ids.
That mapper is 7 lines of code, 4 lines if you don’t count the function declaration line and lines only containing curly braces. Ignoring the safe guard as well, the mapper body can be reduced to this single line by also skipping the temporary variable assignment:
for (var term in doc.vector) emit(term, [doc._id, doc.vector[term]]);
In the same manner, the reduce function may be reduced to a body of just 3 lines of code
var docs = [];
for (var i = 0; i < values.length; ++i) docs.push(values[i]);
return docs;
That’s the power and beauty of CouchDB map reduce. You can write a search engine indexer in 4 lines of JavaScript. Granted, you need to create vectors of your documents in advance, but that’s just a matter of parsing a text string and splitting on whitespace and/or punctuation into an array and reducing the term array into a hash of <term> => <weight> pairs. Sure, you can do this with a map reduce as well, but that might be overkill since you will only be operating on a single document at a time.
One last important point is that while Futon, the CouchDB browser client will show the expected results, you have to explicitly tell CouchDB to group the result if you want to use this in your application. My database is called pages, the design is called demos and the view index making the output of the map reduce available as json at
http://localhost:5984/pages/_design/demos/_view/index?group=true
Thanks to J. Chris Anderson for clarifying and pointing out the grouping query usage.
Meet the people
Posted: June 29, 2009 Filed under: startup Leave a comment »In the beforetime, before I started working in search engine companies, I had some direct customer relations. I didn’t think much about it back then, but I have great memories of the experiences I got from dealing with customers. I tend to be rated extrovert on personality tests, but if that’s true, I’m a conditional extrovert. Granted, I enjoy spending time with friends and socializing, but I’m also quite fond of sitting by myself in front of a keyboard. What I remember from that time was that I was always nervous before meeting the customers, but when we were face to face, everything was fine and I enjoyed their company as well as the learning experience.
During the last years, direct customer relations haven’t been that much of an issue. There was a time while I was working for FAST Search & Transfer that we had fairly regular conference calls with one particular customer. They had valuable deep web content that was unavailable to crawlers. We did a joint project between the two companies to make their content available to our web search index. This whole experience did not have much in common with a traditional customer-provider relationship. I have held training sessions and presentations internally over the years, but while I might be a little nervous before presenting to lots of people, it’s mostly just business as usual. In other words, no customer fronting experiences for me in almost 10 years.
Now that I’m bootstrapping a company, I have started talking to (potential) customers again. I am lucky enough to have a friend as pilot customer so it’s once again not a traditional customer-provider relationship. He is also talking to other potential customers through his network and gathers feedback. Getting that feedback and seeing that it mostly reflect what we expect is great, but I always learn new things whenever I talk to him. These things might seem obvious or minor to him, but it is important for us either for confirming our assumptions or showing that our assumptions are wrong.
As I mentioned in the beginning, I’m a conditional extrovert, but meeting customers again makes me realize that I have missed this. Presenting our plans to governmental institutions before applying for a grant and other potential future investors is fun. I’ll be the first to admit that I may be nervous before such presentations, but when the first page of the presentation is up and I start talking it’s more of a rush than anything else. Answering questions from customers and investors is fun too because I believe we are on to something. Moreover, we are talking about something that no one has told me to do. I am the one airing my thoughts to others and they react positively. It’s a great way of building confidence in yourself and what you are doing.
I remember another thing from the beforetime. A salesman in the company I worked for and I were working together at a trade fair when some potential future customers that he had briefly been in touch with before showed up. He talked them through the things we could do for them and generally spent some time discussing various alternatives with them. When they had left, he almost celebrated and said aloud to himself “I’m so good at this.” and he was. He was extremely good at what he was doing. For me, this was a great learning experience. I observed him during this seance and others and learned (I hope), but most of all I realized how big an energy rush you can get from good talks with customers.
Lifestyle design
Posted: June 19, 2009 Filed under: startup 2 Comments »I read Jeremy Lattimore’s blog post about lifestyle design on Startup Student the other day and it got me thinking. Sure, I have thought about the concept several times since I decided to found a company rather than taking a regular job, but reading that blog post made me more conscious about my decisions. Furthermore, it forced me to reflect on how starting a company is lifestyle design in many ways.
The stereotypical lifestyle design as described in David Risley’s example is to come up with a fairly cheap, easy to sell product. While I’ve not gotten around to reading Tim Ferriss‘ 4-Hour Workweek (it’s in my house in paper form, honestly), I think this is the primary example Tim uses in his defining book about lifestyle design as well. This approach is somewhat different from the approach used by Jeremy Lattimore, but the final goal is basically the same.
In my case, I had something that I desperately wanted to do. There is a theme that has been at the core of some things I have enjoyed working on over the last few years and I believe there are lots of business opportunities centered around this technology. When I set out, I did not want to start this all by myself. I wanted to have one to three other people heavily involved both to discuss plans and to verify that the idea was good. The one I wanted the most since I knew he also has an interested in the core technology was very eager to be involved and things started to happen. Now, we have a business development company involved, a pilot customer and leads to the next batch of customers as well as a possible lead to a large number of long term customers.
So, I’m starting a company and that does not exactly sound like lifestyle design. Starting a company means working long hours, with little or no pay and making things difficult for oneself. This is in many ways true, generating revenue doesn’t happen over night and the bigger your goals are, the longer it takes for that revenue stream to turn into a profit. I am willing to work long hours and I expect to have to live off my savings for a long time and get some funding to hire more people in the not too distant future, but… In the current situation, I work from home and I work when I want to rather than when some early bird has decided that I should work. I don’t feel bad about taking breaks during the day and I don’t force myself to pretend to work when I don’t think I will get anything done because I’m either too tired or my mind is preoccupied with something else. This does not mean that I’m slacking, I’m just focusing on output rather than time.
Looking back at my weekly sprints, I see that I mostly finish ahead of time and only once did I fail to complete my sprint in time. I actually worked longer hours during the failed sprint than otherwise since it was much harder than expected. Now, as I write this blog post, I am way ahead of the original schedule and we should have no technical problems meeting our first major milestone, first customer live.
Is this lifestyle design? To me, it is. I’ve been coding since I was 11 or so and I started because it was interesting and fun. The service we are developing is something I would like to use myself and I’m having fun developing it. While I don’t have a steady stream of money flowing into the bank at the moment, I believe that this service will make money and generate a profit at some point and while getting there, I am having fun coding and writing blog posts. I expected to have fun along the way when I decided to try out the startup life and I feel in control. I think this is very much in tune with Jeremy Lattimore’s lifestyle design thoughts.
The week of June 15 in links
Posted: June 19, 2009 Filed under: links | Tags: digest Leave a comment »Last weeks links. Lots of RTs as usual. The span of links goes from zombie music all the way to cooking, but most of the links are to more classic geek stuff:
- I wonder how much time I will spend on getting Kindle stuff to work on my netbook. My Asus eee 1000he is my current book reader and I prefer it to paper books. My ebook reader app of choice was eReader, but now they’re refusing to sell me books since I live in Norway so I need to find alternatives. Luckily, I still have a stack of unread books in eReader, but they will run out eventually and then I need to find another source of books. A Kindle app could be the best solution.
- RT @phusion_nl Phusion Passenger 2.2.3 “Bug Fix Edition” released. Passenger makes my set of Racks including Sinatra run in a web server environment (Apache) that I trust. I will also test Passenger with nginx, but that’s a bit down on my todo list.
- RT @startupstudent: New blog post: Why am I doing this?. I commented on the blog post, but I will actually publish my own take on startup life vs lifestyle design inspired by this post later today.
- RT @timbury: RT @StevoMoviemaker RT @MrMarketingMan: NC – loved this. Twitter from your Commodore 64. How cool is that? Both my wife and I still have C64s and last we checked, they worked. Back in the 80s, making the breadbox do things it wasn’t meant for was cool, but we mostly focused on sound and graphics. Any network connection would’ve been a dead slow modem
- RT @ruby_news Testing Rails with Rack::Test. I’m not using Rails, but I use Rack and Sinatra. If you’re developing Rack applications (no matter which framework) and not using Rack::Test you should either start doing so now or let me know which testing framework is even better.
- Another great recipe from Just Bento is the tuna tofu miso burger recipe. I haven’t tried this one, but I will as it looks and sounds delicious.
- RT @civis: 1 egg tamagoyaki (Japanese omelette) | Just Bento – I made one of these for lunch this week and loved it. Dead easy to make when following this recipe and low cal as well.
- RT @ruby_news: Profiling Ruby With Google’s Perftools – I know I’ll be using this in the coming month
- My favorite for the summer hit of 2010 – Shamblin’ Back by Brother D of Mail Order Zombie
- RT @atveit: RT @abiody Abiody ready for (cloud) business from July 1st – Interesting cloud oriented startup.
A fistful of links
Posted: June 10, 2009 Filed under: Uncategorized Leave a comment »Here’s a list of links I tweeted over the last week or so (as requested by @dmpetersson). I considered shamelessly removing the RT info, but came to my senses and carried them over. Clearly shows that the majority of my links are retweets, but then again more people might discover the interesting people who originally tweeted the links and follow them. I will start posting these digests weekly going forwards.
- Fedora 11 is here – looks like a nice upgrade from Fedora 10 and remains a good alternative to Ubuntu.
- Benchmarks: You are Doing it Wrong – interesting read about benchmarking web applications
- RT @Venture_Capital Do Young Venture Capitalists Have an Edge? – Experience is good, but so is openness to new ideas
- RT @DanekS How to write a tech press release – don’t show off. Journalists and editors may not understand geek
- RT @tferriss: Find out which interests Google associates with your cookie – who does Google think you are?
- RT @ruby_news: Rack 1.0 Released
- RT @tferriss: How to choose colors for your brand – great samples: via @gmc
Startup Scrum
Posted: April 29, 2009 Filed under: startup | Tags: scrum Leave a comment »We now have a written backlog for our product and we have completed our first few sprints. Velocity has been much higher than expected. This is partly a result of being in a low overhead startup and partly a result of finding good tools and applying kiss whenever possible.
Actually getting the backlog documented felt very good. The backlog and sprints are defined in Google docs to make cooperation between me and my co-founder easier since we are not sharing an office. Instead of maintaining sprint progress on a whiteboard, we maintain it in a shared spreadsheet and update once a day. We do not perform a daily scrum since I am the only one working full-time, but we do a face-to-face sync at least once a week.
The sprint length is one week. Most sprints have finished early as hinted to with the velocity comment earlier. The most extreme case was when the goal of the sprint was to demonstrate our product in an application similar to what one type of customer would have. The sprint was over Tuesday night. Some time on Monday that week, I realized that I needed an indexer and developed one using CouchDB map reduce. Phusion Passenger also made it easy to get my Rack application to run in Apache. In retrospect, CouchDB, Passenger and Rack saved me at least two weeks.
The current sprint should end with a demo of a customer application running Joomla talking to the Sincerial service and controlling the browser experience. Luckily, Joomla plugin development is straightforward and my rusty PHP foo seems to be sufficient. Too many uncertainties from the beginning and too much non-technical work this week doesn’t make me overly optimistic about having the demo this week.