CouchDB on Amazon EC2 CentOS server with Sprinkle

Read the Getting Started part of Till Klampäckel’s CouchDB on Ubuntu on AWS blog post for some general information. I see no reason to repeat those things here.

Till stresses the need for a security group opening port 80, but you should also enable ssh at port 22, otherwise it will be impossible to isntall anything. The AMI I use is rightscale’s CentOS 5.2 i386 v4.2.4. If you need a 64-bit image, that should work just as well.

Make sure you have Sprinkle installed on you the system you are installing from. Put this in your spinkle file and name it something reasonable. I called it couchdb.rb. If not, gem install sprinkle. Sprinkle is written in Ruby so if you don’t have Ruby, you should start by installing that.


# Sprinkle provisioning and deployment for CouchDB on
# an Amazon EC2 CentOS server

package :spidermonkey do
  source 'http://ftp.mozilla.org/pub/mozilla.org/js/js-1.7.0.tar.gz' do
    custom_dir 'js/src'
    custom_install "make BUILD_OPT=1 -f Makefile.ref && cp *.{h,tbl} /usr/include/ && cd Linux_All_OPT.OBJ && cp *.h /usr/include/ && mkdir -p /usr/{bin,lib}/ && cp js /usr/bin/ && cp libjs.so /usr/lib/"
end

  verify do
    has_executable 'js'
    has_file '/usr/include/jsapi.h'
    has_file '/usr/lib/libjs.so'
  end
end

package :erlang_dependencies do
  yum %w( ncurses-devel openssl-devel)
end

package :erlang do
  description 'Erlang, the programming language'
  source 'http://erlang.org/download/otp_src_R13B01.tar.gz'

  verify do
    has_executable '/usr/local/bin/erl'
  end

  requires :erlang_dependencies
end

package :couchdb_dependencies do
  yum %w( curl curl-devel icu libicu-devel )
end

# - CouchDB 0.9.1
package :couchdb, :provides => :database do
  description 'CouchDB'
  version '0.9.1'
  source 'http://mirrorservice.nomedia.no/apache.org/couchdb/0.9.1/apache-couchdb-0.9.1.tar.gz' do
    post :install, 'adduser -r -d /usr/local/var/lib/couchdb -M -s /bin/bash -c "CouchDB Administrator" couchdb'
    post :install, 'touch /usr/local/var/log/couchdb/couch.log'
    post :install, 'chown couchdb /usr/local/var/log/couchdb/couch.log'
    post :install, 'mkdir -p /usr/local/var/lib/couchdb'
    post :install, 'chown couchdb /usr/local/var/lib/couchdb'
    post :install, '/usr/local/etc/rc.d/couchdb start'
    post :install, 'ln -s /usr/local/etc/rc.d/couchdb /etc/init.d/couchdb'
    post :install, 'chkconfig --add couchdb'
  end

  verify do
    has_executable '/usr/local/bin/couchdb'
  end

  requires :couchdb_dependencies
  requires :erlang
  requires :spidermonkey
end

package :rubygems do
  description 'Ruby Gems Package Management System'
  yum 'rubygems'
end

package :couchrest do
  description 'Rest API for CouchDB'
  version '0.33'
  gem 'couchrest'
end

policy :db, :roles => :db do
  requires :database
  requires :couchrest
end

# Deployment
deployment do
  delivery :capistrano do
    set :user, 'root'
    set :use_sudo, false
    set :run_method, :run

    role :db, 'ec2-x-y-z-w.eu-west-1.compute.amazonaws.com'
  end

  # source based package installer defaults
  source do
    prefix '/usr/local' # where all source packages will be configured to install
    archives '/usr/local/sources' # where all source packages will be downloaded to
    builds '/usr/local/build' # where all source packages will be built
  end
end

Replace ec2-x-y-z-w.eu-west-1.compute.amazonaws.com with the public DNS name listed on the Amazon Web Services instance view. You don’t need rubygems and couchrest unless you are going to use Ruby, but I decided to leave them since CouchRest is a nice libarary to use when talking to CouchDB from Ruby.

Run it in a shell with sprinkle -s couchdb.rb. Might be interesting to check the powder cloud like this first: sprinkle -cts couchdb.rb. The expected cloud looks liek this:

--> Cloud hierarchy for policy db

Policy db requires package database
Selecting couchdb for virtual package database
  Package couchdb requires couchdb_dependencies
  Package couchdb requires erlang
    Package erlang requires erlang_dependencies
  Package couchdb requires spidermonkey

Policy db requires package couchrest
  Package couchrest requires rubygems

Set up an SSH tunnel to get the remote futonI tend to use 5994 locally to avoid conflicts with the local CouchDB. ssh -L 5994:localhost:5984 root@ec2-x-y-z-w.eu-west-1.compute.amazonaws.com where once again, ec2-x-y-z-w.eu-west-1.compute.amazonaws.com should be replaced with the public DNS name listed on the Amazon Web Services instance view. Point your browser to http://localhost:5994/_utils/ for that familiar futon view.


Finding the one in an ocean of provisioning and deployment systems

I currently use some shell scripts for deployment. This works well on my development server. It also works when starting to run things in a data center, but there are limitations and shortcomings. For one, my deployment scripts do not have any verification. I can add that, but then again I would be duplicating work that others have done well before. More important is the fact that my deployment scripts assume a prepared installation environment with all the packages and gems needed installed. I want a deployment system that also provisions the target server with all my prerequisites like Apache or nginx, Passenger Phusion, Sinatra and CouchDB. Note that our application does not use Rails, the storage is handled by CouchDB and non-UI work is handled by racks running directly in Phusion. The UI is handled by Sinatra.

We are looking at running our service in Amazon EC2 when we are ready to launch. As we use Fedora and Ubuntu for development, the obvious candidate distros are the same or related distros like CentOS, RHEL and Debian. If the deployment system handles package installation through an abstractions system, this system must support both yum and apt.

Chef
Chef is built around a chef server hosting the deployment scripts (cookbooks) and chef clients managing the worker nodes. It uses CouchDB in the background for storing information. From what I see, Chef is very powerful and looks like a nice way of handling huge clusters. The power also seems to have the side-effects that doing simple things in a smallish cluster is a lot of work. Uses Rake underneath

Capistrano
Capistrano uses it’s own DSL in combination with Ruby. Capistrano is Rails-centric, but have instructions showing how to use it for non-Rails deployment. The deployment instructions mostly run shell scripts which makes it very flexible.

Vlad the Deployer
Vlad seems to be an extension to Rake. Aims to be a better Capistrano. It seems very subversion oriented. One of the features is Turnkey deployment for mongrel+apache+svn. Since we’re not using mongrel or svn and although we use apache now might switch to nginx soon, this all seems wrong.

Puppet
Puppet is an extremely flexible and powerful provisioning system. However, as is often the case, this comes at a cost. Doing simple things like our deployment is very time consuming in Puppet. It uses it’s proprietary DSL to specify very accurately how the target environment should be after running puppet. It is capable of running in the background and syncing automatically every 30 minutes or so. The feature set is overkill in our case right now and if we need those features later I’ll be a happy man who will gladly spend time moving to puppet.

Sprinkle
Sprinkle uses Ruby directly instead of a task specific DSL. Full support of the main deployment systems is included. What puzzled me at first was that it supports Capistrano and Vlad the deployer for remote command delivery. Later I realized that Sprinkle is a provisioning system and leaves application deployment to other specialized systems. Nice separation.

Moonshine
One of requirements of Moonshine is A server running Ubuntu 8.10 which effectively disqualifies it for our use. I’m sure supporting other distros is not much work, but I’m not willing to do that work at this point anyway.

Passenger_stack
Might seem weird to include this in the list, but we use Passenger and anything that can get us there faster is considered. This would get us apache, passenger and lots of other things we do not use like MySQL. Since I do not want to install things we don’t need and passenger_stack only works on Ubuntu (and not even Debian), it is disqualified.

Conclusion
With Moonshine out due to lack of distro support and Vlad the Deployer out due to being mongrel+svn focused and puppet being too difficult, the following candidates are still contestants: Chef, Capistrano and Sprinkle.

Sprinkle has nice dry-run functionality allowing you to see what will happen to your remote servers before you do anything. Capistrano gets praise for easy of application deployment so the Sprinkle and Capistrano combo sounds interesting. My choice is now down to Sprinkle or Chef for the server provisioning and the winner is Sprinkle. The reason for my choice is that it is simpler and does what I currently need it to do. Migrating to Chef in the future is still an option.


The CouchDB indexer – lightweight search engine in hours

Have you ever been in a situation where you needed to create a reverse lookup index of some documents you had lying around?

A reverse lookup index is the kind of index used by the search engines (or Googles if you like) of this world. Creating a reverse lookup index isn’t hard, but you would normally expect to spend a couple of weeks writing code to create one when needed unless you decide to use an existing indexer like lucene. Using lucene is a bit heavy if the indexing is not core to your application. If you store your documents in CouchDB, you can create an indexer writing just 4 lines of JavaScript. You should have at least one more line for safeguarding your input values, but a search engine indexer in 5 lines of JavaScript is IMHO still pretty good.

The prerequisite for this indexer is that the documents you want to index all have a vector field containing a document vector in the form of a hash mapping term to term weight {<term0> => <tweight0>, <tterm1> => <tweight1>, …,<ttermn> => <tweightn>}. The weight could be just the number of times a term occurs in the document or a some metric indicating how important a word is to a document relative to all possible documents. The traditional weight used in search engines is tf-idf (term frequency multiplied by inverse document frequency). Head over to Wikipedia if you want to learn more about tf-idf. Of course, if you just want to find all documents matching a query, you can ignore the weights completely.

If you have document vectors, you have all the input data you need to create a reverse lookup index. This is the mapper if you just want to get all documents matching a query. Note that it completely ignores the term weights:


function(doc) {
    if (!('vector' in doc)) return;
    var vector = doc.vector;
    for (var term in vector) {
        emit(term, doc._id);
    }
}

The function operates on each document in the database. The first statement is a simple safeguard ensuring that we don’t try to access vector if the document doesn’t have that property. Since CouchDB is schema free, different types of documents with different fields may be stored in the same database. If the vector property is there, we store it in a variable called vector. For each element in vector, we emit the key which is the term and the id of the document we are operating on. If you run just this mapper, you will get a list of term to single document id mappings. This is a major step forward since we now have a reverse mapping of the database.

To get the reverse database map into something that is quick and easy to lookup, we need a reducer. It’s purpose is to convert the list of term, document id pairs into a single term to document id list.


function(keys, values) {
    var docs = [];
    for (var i = 0; i < values.length; ++i) {
        docs.push(values[i]);
    }
    return docs;
}

In this situation, we don’t care about the keys. CouchDB handles that for us. We need an array to store all the document ids. Then we iterate over all the values and push them into our array. Finally we return the array. CouchDB ensures that this is only called once for a single term ensuring we end up with a single document id list for each term.

This index can be used to find all documents matching a given set of terms. Note that there is not much sophistication in this method so the only rank score you can get is the number of matching terms. Adding term weights to the index will give you something to use for ranking. Change the emit line in the mapper to

emit(term, [doc._id, vector[term]]);

This will give you a list of document id, term weight pairs for each term instead of just the document ids.

That mapper is 7 lines of code, 4 lines if you don’t count the function declaration line and lines only containing curly braces. Ignoring the safe guard as well, the mapper body can be reduced to this single line by also skipping the temporary variable assignment:

for (var term in doc.vector) emit(term, [doc._id, doc.vector[term]]);

In the same manner, the reduce function may be reduced to a body of just 3 lines of code

var docs = [];
for (var i = 0; i < values.length; ++i) docs.push(values[i]);
return docs;

That’s the power and beauty of CouchDB map reduce. You can write a search engine indexer in 4 lines of JavaScript. Granted, you need to create vectors of your documents in advance, but that’s just a matter of parsing a text string and splitting on whitespace and/or punctuation into an array and reducing the term array into a hash of <term> => <weight> pairs. Sure, you can do this with a map reduce as well, but that might be overkill since you will only be operating on a single document at a time.

One last important point is that while Futon, the CouchDB browser client will show the expected results, you have to explicitly tell CouchDB to group the result if you want to use this in your application. My database is called pages, the design is called demos and the view index making the output of the map reduce available as json at
http://localhost:5984/pages/_design/demos/_view/index?group=true
Thanks to J. Chris Anderson for clarifying and pointing out the grouping query usage.


Meet the people

In the beforetime, before I started working in search engine companies, I had some direct customer relations. I didn’t think much about it back then, but I have great memories of the experiences I got from dealing with customers. I tend to be rated extrovert on personality tests, but if that’s true, I’m a conditional extrovert. Granted, I enjoy spending time with friends and socializing, but I’m also quite fond of sitting by myself in front of a keyboard. What I remember from that time was that I was always nervous before meeting the customers, but when we were face to face, everything was fine and I enjoyed their company as well as the learning experience.

During the last years, direct customer relations haven’t been that much of an issue. There was a time while I was working for FAST Search & Transfer that we had fairly regular conference calls with one particular customer. They had valuable deep web content that was unavailable to crawlers. We did a joint project between the two companies to make their content available to our web search index. This whole experience did not have much in common with a traditional customer-provider relationship. I have held training sessions and presentations internally over the years, but while I might be a little nervous before presenting to lots of people, it’s mostly just business as usual. In other words, no customer fronting experiences for me in almost 10 years.

Now that I’m bootstrapping a company, I have started talking to (potential) customers again. I am lucky enough to have a friend as pilot customer so it’s once again not a traditional customer-provider relationship. He is also talking to other potential customers through his network and gathers feedback. Getting that feedback and seeing that it mostly reflect what we expect is great, but I always learn new things whenever I talk to him. These things might seem obvious or minor to him, but it is important for us either for confirming our assumptions or showing that our assumptions are wrong.

As I mentioned in the beginning, I’m a conditional extrovert, but meeting customers again makes me realize that I have missed this. Presenting our plans to governmental institutions before applying for a grant and other potential future investors is fun. I’ll be the first to admit that I may be nervous before such presentations, but when the first page of the presentation is up and I start talking it’s more of a rush than anything else. Answering questions from customers and investors is fun too because I believe we are on to something. Moreover, we are talking about something that no one has told me to do. I am the one airing my thoughts to others and they react positively. It’s a great way of building confidence in yourself and what you are doing.

I remember another thing from the beforetime. A salesman in the company I worked for and I were working together at a trade fair when some potential future customers that he had briefly been in touch with before showed up. He talked them through the things we could do for them and generally spent some time discussing various alternatives with them. When they had left, he almost celebrated and said aloud to himself “I’m so good at this.” and he was. He was extremely good at what he was doing. For me, this was a great learning experience. I observed him during this seance and others and learned (I hope), but most of all I realized how big an energy rush you can get from good talks with customers.


Lifestyle design

I read Jeremy Lattimore’s blog post about lifestyle design on Startup Student the other day and it got me thinking. Sure, I have thought about the concept several times since I decided to found a company rather than taking a regular job, but reading that blog post made me more conscious about my decisions. Furthermore, it forced me to reflect on how starting a company is lifestyle design in many ways.

The stereotypical lifestyle design as described in David Risley’s example is to come up with a fairly cheap, easy to sell product. While I’ve not gotten around to reading Tim Ferriss4-Hour Workweek (it’s in my house in paper form, honestly), I think this is the primary example Tim uses in his defining book about lifestyle design as well. This approach is somewhat different from the approach used by Jeremy Lattimore, but the final goal is basically the same.

In my case, I had something that I desperately wanted to do. There is a theme that has been at the core of some things I have enjoyed working on over the last few years and I believe there are lots of business opportunities centered around this technology. When I set out, I did not want to start this all by myself. I wanted to have one to three other people heavily involved both to discuss plans and to verify that the idea was good. The one I wanted the most since I knew he also has an interested in the core technology was very eager to be involved and things started to happen. Now, we have a business development company involved, a pilot customer and leads to the next batch of customers as well as a possible lead to a large number of long term customers.

So, I’m starting a company and that does not exactly sound like lifestyle design. Starting a company means working long hours, with little or no pay and making things difficult for oneself. This is in many ways true, generating revenue doesn’t happen over night and the bigger your goals are, the longer it takes for that revenue stream to turn into a profit. I am willing to work long hours and I expect to have to live off my savings for a long time and get some funding to hire more people in the not too distant future, but… In the current situation, I work from home and I work when I want to rather than when some early bird has decided that I should work. I don’t feel bad about taking breaks during the day and I don’t force myself to pretend to work when I don’t think I will get anything done because I’m either too tired or my mind is preoccupied with something else. This does not mean that I’m slacking, I’m just focusing on output rather than time.

Looking back at my weekly sprints, I see that I mostly finish ahead of time and only once did I fail to complete my sprint in time. I actually worked longer hours during the failed sprint than otherwise since it was much harder than expected. Now, as I write this blog post, I am way ahead of the original schedule and we should have no technical problems meeting our first major milestone, first customer live.

Is this lifestyle design? To me, it is. I’ve been coding since I was 11 or so and I started because it was interesting and fun. The service we are developing is something I would like to use myself and I’m having fun developing it. While I don’t have a steady stream of money flowing into the bank at the moment, I believe that this service will make money and generate a profit at some point and while getting there, I am having fun coding and writing blog posts. I expected to have fun along the way when I decided to try out the startup life and I feel in control. I think this is very much in tune with Jeremy Lattimore’s lifestyle design thoughts.


The week of June 15 in links

Last weeks links. Lots of RTs as usual. The span of links goes from zombie music all the way to cooking, but most of the links are to more classic geek stuff:

  • I wonder how much time I will spend on getting Kindle stuff to work on my netbook. My Asus eee 1000he is my current book reader and I prefer it to paper books. My ebook reader app of choice was eReader, but now they’re refusing to sell me books since I live in Norway so I need to find alternatives. Luckily, I still have a stack of unread books in eReader, but they will run out eventually and then I need to find another source of books. A Kindle app could be the best solution.
  • RT @phusion_nl Phusion Passenger 2.2.3 “Bug Fix Edition” released. Passenger makes my set of Racks including Sinatra run in a web server environment (Apache) that I trust. I will also test Passenger with nginx, but that’s a bit down on my todo list.
  • RT @startupstudent: New blog post: Why am I doing this?. I commented on the blog post, but I will actually publish my own take on startup life vs lifestyle design inspired by this post later today.
  • RT @timbury: RT @StevoMoviemaker RT @MrMarketingMan: NC – loved this. Twitter from your Commodore 64. How cool is that? Both my wife and I still have C64s and last we checked, they worked. Back in the 80s, making the breadbox do things it wasn’t meant for was cool, but we mostly focused on sound and graphics. Any network connection would’ve been a dead slow modem
  • RT @ruby_news Testing Rails with Rack::Test. I’m not using Rails, but I use Rack and Sinatra. If you’re developing Rack applications (no matter which framework) and not using Rack::Test you should either start doing so now or let me know which testing framework is even better.
  • Another great recipe from Just Bento is the tuna tofu miso burger recipe. I haven’t tried this one, but I will as it looks and sounds delicious.
  • RT @civis: 1 egg tamagoyaki (Japanese omelette) | Just Bento – I made one of these for lunch this week and loved it. Dead easy to make when following this recipe and low cal as well.
  • RT @ruby_news: Profiling Ruby With Google’s Perftools – I know I’ll be using this in the coming month
  • My favorite for the summer hit of 2010 – Shamblin’ Back by Brother D of Mail Order Zombie
  • RT @atveit: RT @abiody Abiody ready for (cloud) business from July 1st – Interesting cloud oriented startup.

A fistful of links

Here’s a list of links I tweeted over the last week or so (as requested by @dmpetersson). I considered shamelessly removing the RT info, but came to my senses and carried them over. Clearly shows that the majority of my links are retweets, but then again more people might discover the interesting people who originally tweeted the links and follow them. I will start posting these digests weekly going forwards.


OAuth and the 4 hour work day

I must admit that I’m quite fond of the 4 hour work day. With that I don’t mean working excatly 4 hours per day, but the principle as explained by Scott Young. Come to think of it, the 4 hour part isn’t important at all, it’s the daily todo lists I like. I understand the desire to limit the actual time working if you hate what you do, but I am excited about my work and rather enjoy spending time on it.

On a given day, I start out making a list of the things I want to achieve on that day. If I get through my todo lists correctly every day in a week, I will successfully complete my (scrum) sprint. Of course, since I’m developing from scratch, estimation is much easier than it would be if I were surrounded by legacy systems. Interestingly, I normally overestimate and end up completing tasks much faster than expected. Maybe the standard software developer estimation factor (multiply the estimate by 3.14) is a result of reasonable estimates meeting legacy systems.

This was meant to be about OAuth and not estimation I suppose. The spec is a bit too much to go into here. OAuth is:

An open protocol to allow secure API authorization in a simple and standard method from desktop and web applications.

Letting OAuth speak for itself rather than trying to explain it with my own words felt good.

When needing to support OAuth authentication on our servers, I started out developing a consumer in Ruby as a Rack application. To test it, I used Agree2 since there is an excellent tutorial using Agree2 as an example service. Following the tutorial, I had a working consumer faster than estimated. The next step was to get my own OAuth Server running as a Rack application. Now this was a gauntlet run. First, I needed to get the communication sequence right. That wasn’t too hard, but I missed my estimates slightly. Next up was to start making things secure by verifying and signing messages. At this point things slowed down substantially. I had to read parts of the spec again and again while staring at my code. I eventually got my Rack based consumer to authorize with my Rack based server. All estimates were off, but at least it was working.

I know that our customers will use various systems to talk to our service. We expect to develop plugins for our service in their systems ourselves, and that means we should do the OAuth support as well. Our pilot customer has a PHP based system so the next step was to make the plugin for that system use OAuth. This was also the time I implemented verification of signed requests on the server side. I’ll cheat again with the description of the signature base string which is the basis for the signature:

The Signature Base String is a consistent reproducible concatenation of the request elements into a single string. The string is used as an input in hashing or signing algorithms. The HMAC-SHA1 signature method provides both a standard and an example of using the Signature Base String with a signing algorithm to generate signatures. All the request parameters MUST be encoded as described in Parameter Encoding (Parameter Encoding) prior to constructing the Signature Base String.

The OAuth spec is pretty specific on most things, but there are a few things that can cause problems when you start signing messages. Getting HTTP Get requests to work when signed was pretty straightforward. HTTP Post was worse. First of all, I am used to posting JSONs directly without thinking key, value pairs. This has interesting side-effects when the PHP library treats the JSON as a key and sorts it before the HTTP authentication parameters in the signature base string while the Ruby library sorts it after the HTTP authentication parameters. The PHP Consumer Tutorial I followed suggested you just convert your payload to a simple array. Surprisingly my JSON ended up not being included in the signature base string generated by the PHP library at all. The Ruby library however included it which is of course the right thing to do. The solution was to use an associative array with key ‘payload’. This was included in the signature base string by both the PHP and Ruby libraries. I had full functionality.

Next up is the handling of space in URLs. PHP is fond of encoding spaces as plus signs: “Alfa Romeo” -> “Alfa+Romeo” while Ruby tends to use the arguably more correct %20 encoding: “Alfa Romeo” -> “Alfa%20Romeo”. To my surprise, I never had any problems with this. I did have URL encoding issues, but they weren’t as simple as this.

After deploying my Rack applications to Passenger Phusion and Apache, the Ruby library started complaining about “http://localhost/Some Name/something” not being a proper URI. URI(is not URI?) is what it said. Now it was easy to see that the space was the problem, but not how it became a problem. After scanning through the library, I found that it uses the parse method of the standard Ruby uri module to verify correctness of URLs. Furthermore I found that it uses env['PATH_INFO'] to generate the “/Some Name/something” part of this URL. I tried to use CGI.escape on this value, but that escaped a bit too much. The sad, but working solution was to substitute ‘ ’ with ‘%20′ like this: request.path_info = env['PATH_INFO'].gsub(' ', '%20')

Now everything works perfectly with my PHP consumer. A few questions remain. The first and most important question is whether I trust these libraries after struggling like this. The easy answer is that I do. There are no security issues, just some pain and bloodshed needed to get them to work. The next question is what happens when we need to support consumers in ColdFusion Markup Language, ASP and other web scripting languages. I don’t know the answer to that. I guess the experience working with PHP will be helpful, but I also expect other pitfalls to show up. Time will tell.

Now what has this OAuth stuff got to do with the estimation and 4 hour work day principle I started with? Actually a lot. While OAuth isn’t exactly a legacy system, the added complexity was comparable. Getting the handshaking to work properly was straightforward and implementation was similar to what I expected and this is the part I actually estimated. When I got to the cross language part and request signing, I encountered a string of unexpected hurdles much like you would see developing in a legacy system. If I were to follow the 4 hour work day principle mindlessly, I would have had to do a couple of all-nighters in a row. I do believe my estimation was off by a factor of around 3.14 on average.


Authentication and Authorization

We are building a service. This service needs to relate to end-users. The users will mostly interact with our service through third party sites. The users, the aforementioned third party sites and others will probably refer to us as the third party, but I allow myself to be selfish for now. Anyway, the bottom line is that we need authentication for users in our own service and also authorization of third party sites to allow them to access a user’s data only with the permission of the user.

Naturally, when one has a need for authentication, one looks at existing solutions. The classic implement yourself is of course an option and may I add, a strategy that is all the rage with the NIH (Not Invented Here) crowd. I’m not too fond of NIH, but I am a big fan of KISS (Keep It Simple, Stupid). While those principles aren’t direct opposites, simplicity mostly dictates using an existing solution. OpenID, Facebook Connect and Sign in with Twitter exist and are all easy to use through existing libraries.

OpenID is cool. Unfortunately, we expect that most of our users do not have active OpenID accounts from day one. Forcing our users to sign up for an OpenID account in order to register for our service doesn’t exactly seem like a good idea. We do expect that a large percentage of our users will have Facebook accounts, but we are not developing a social networking service so it’s a bit weird to rely on Facebook Connect alone.

Since many of our users will have Facebook accounts, we will probably allow users to register with Facebook Connect. OpenID is however a more neutral system, so we will also support that. We need a simple way for users to register locally without having to register for an additional service as well. The solution to this is to provide OpenID ourselves.

The situation for authorization is a bit more straightforward. OAuth is a very simple and straightforward scheme that seems like a good match for our needs. Other solutions originally developed for single sign-on would work as well, but OAuth is designed for this and very simple to implement as well. Whether people are used to OAuth or not doesn’t really matter since it will only ask them to allow a site to access their data.

To sum up, we will support Facebook Connect, Sign in with Twitter and OpenID. To avoid introducing a local authentication scheme, we will provide OpenID ourselves and make registration simple. For authorization, we will go with OAuth.


Startup Scrum

We now have a written backlog for our product and we have completed our first few sprints. Velocity has been much higher than expected. This is partly a result of being in a low overhead startup and partly a result of finding good tools and applying kiss whenever possible.

Actually getting the backlog documented felt very good. The backlog and sprints are defined in Google docs to make cooperation between me and my co-founder easier since we are not sharing an office. Instead of maintaining sprint progress on a whiteboard, we maintain it in a shared spreadsheet and update once a day. We do not perform a daily scrum since I am the only one working full-time, but we do a face-to-face sync at least once a week.

The sprint length is one week. Most sprints have finished early as hinted to with the velocity comment earlier. The most extreme case was when the goal of the sprint was to demonstrate our product in an application similar to what one type of customer would have. The sprint was over Tuesday night. Some time on Monday that week, I realized that I needed an indexer and developed one using CouchDB map reduce. Phusion Passenger also made it easy to get my Rack application to run in Apache. In retrospect, CouchDB, Passenger and Rack saved me at least two weeks.

The current sprint should end with a demo of a customer application running Joomla talking to the Sincerial service and controlling the browser experience. Luckily, Joomla plugin development is straightforward and my rusty PHP foo seems to be sufficient. Too many uncertainties from the beginning and too much non-technical work this week doesn’t make me overly optimistic about having the demo this week.


Follow

Get every new post delivered to your Inbox.