Apache Pig is a fantastic language for processing data. It is sometimes incredibly annoying, but it beats the hell out of writing a ton of map reduces and chaining them together. When iterating over joins, an issue that I know that I’m not the only one having ran into is referencing data after a join […]
Hadoop Map-Reduce is a great tool for analyzing and processing large amount of data. There are a few things one needs to keep in mind when working with Hadoop. This is the simple solution to one possibly annoying problem. Hadoop expects reducers to emit something regularly. If a reducer runs for a long time without […]
Note that the pitfall is limited to MRI (standard Ruby) version 1.9.2. MRI 1.9.3, JRuby and Rubinius does not have this behavior. I have been using Rubinius 2.0 to run machine learning experiments with libsvm lately. When running in Ruby 1.9.2, I noticed that my classifier always classified all samples as negative. I though this […]
This is my presentation from JavaZone 2010 Note that during my presentation, I showed the view section and basic replication directly in Futon instead of showing the fallback in the slides. What I did show was mostly the same, but naturally I showed some variations on the mappers as well.
I have now reached the end of my todo list for Pillow. That doesn’t mean it’s finished and ready to be stamped version 1.0. In it’s current incarnation it is fully usable and production ready, but in order to earn a 1.0 it needs to do a bit more. The current resharding always doubles the […]
Several articles and blog posts have been written about functional Ruby. They tend to focus either on whether Ruby is a functional language or how to do functional programming in Ruby. I am not planning to do either. This post will look into the benefits of a functional approach to Ruby and the transition from […]
If there is one thing that has bothered me about my choice of CouchDB as the main storage system for Sincerial, it’s the lack of an automatic system for shard management. In the early days of a startup, a single server is probably capable of handling all the necessary data. However, a successful service that […]