First week of full-time work for the startup over and what did I accomplish? I have dumped all of English Wikipedia into a CouchDB database and discovered that mapping over this amount of data on a single server is not possible. I find this weird so I need to do some more investigation, but that can wait since I don’t really need that. Lookups are instant as they should be and that’s good enough for me.
I have also gathered the Norwegian Wikipedia and generated a Norwegian web corpus. Wikipedia should statistically be representative for the web, but the first version of the corpus was non-normalized. I had two options for stemming since I wanted to do everything in Ruby. Wrap the available C porter stemmer or write one from scratch according to the rules described at Norwegian Snowball page I chose the latter and since there is an exhaustive test set available with some 25000 or so word-stem pairs, I am confident my Ruby implementation is correct. Since Ruby 1.8 doesn’t come with utf-8 support and 1.9.1 does, I had to figure out the simplest way of getting a minimum amount of international character handling. Turned out I only needed a utf-8 length method and word.scan(/./u).length takes care of that. Not efficient, but this is a temporary solution until I upgrade to 1.9.1. I should probably make my Norwegian Ruby stemmer publicly available. For now, if anyone needs it, give me a ping.
What has taken most of my time is administrative work. Cleaning up the business plan, gathering market details and initial contact with potential customers. The result of the latter was revision of the business plan to reflect positive established contact with two possible pilot customers. I hadn’t revised my resume in a while so I did that as well. Sadly, my phone call to the local representative for governmental funding grants wasn’t too promising, but I sent him the business plan for review. Maybe he’ll see that we are on to something. Also handed the business plan over to the biz dev company we are working with for review and polishing.