Hadoop Status Reporting from Ruby

Hadoop Map-Reduce is a great tool for analyzing and processing large amount of data. There are a few things one needs to keep in mind when working with Hadoop. This is the simple solution to one possibly annoying problem.

Hadoop Logo

Hadoop expects reducers to emit something regularly. If a reducer runs for a long time without output, it will be terminated and retried. The error message in this case is something like “Task attempt X failed to report status for Y seconds”.

I bet some of you are thinking that this should not be a problem since the mappers should do all the work and not the reducers. This is mostly true, but if the job of the reducer is to feed a lot of data to a database that is not write-optimized, things may take a little time.

The trick is to regularly write to STDERR to let Hadoop know that your reducer is healthy and progressing.

Add this line to the input processing loop of your reducer:

STDERR.puts("reporter:status:things_ok") unless (count += 1) % 1000 > 0

This will emit reporter:status:things_ok every 1000 items which is a fine magical number. Substitute your favorite magic number as long as it’s not too big.


Case statement pitfall when migrating to Ruby 1.9

I have been using Rubinius 2.0 to run machine learning experiments with libsvm lately. When running in Ruby 1.9.2, I noticed that my classifier always classified all samples as negative. I though this was caused by issues with libsvm-ruby-swig so I recompiled libsvm-ruby-swig from scratch including rerunning swig, but nothing changed. Next, I changed to use libsvmffi instead, but the result was the same. Realizing that I actually had some tests running av very simple classifier and that these tests passed on 1.9.2 made me look closer at the code. What I found was that the behavior of the Ruby case statement has changed from 1.8.7 to 1.9.2.

For if statements, 1 is equal to 1.0 in both 1.8.7 and 1.9, but while 1 matches 1.0 in 1.8.7 case statements, it does not in 1.9.2.

Code snippet that shows the difference:

#!/usr/bin/env ruby

puts case 1.0
when 1
  "yay"
else
  "nay"
end

First the output of irb when running 1.8.7:

$ rvm use ruby-1.8.7
Using /usr/local/rvm/gems/ruby-1.8.7-p334
$ ./case.rb
yay

And the same in 1.9.2:

$ rvm use ruby-1.9.2
Using /usr/local/rvm/gems/ruby-1.9.2-p180
$ ./case.rb
nay

Needless to say, I was puzzled by this result, but I was more surprised by the 1.8.7 behavior than 1.9.2. My assumption when I wrote the code was that I was dealing with integer values and since it worked, I forgot about it. Next time you see different behavior between 1.8.7 and 1.9.2 it might be worth reviewing case statements.


A functional approach to Ruby

Several articles and blog posts have been written about functional Ruby. They tend to focus either on whether Ruby is a functional language or how to do functional programming in Ruby. I am not planning to do either. This post will look into the benefits of a functional approach to Ruby and the transition from thinking classic object-oriented to functional.

I consider the discussion around Ruby being a functional language or not academic with no effect on my use of the language. There is no doubt that functional programming is possible in Ruby, but bear in mind that it does not enforce pure functions that do not have side-effects. As for good overviews of functional programming in Ruby, I suggest Khaled alHabache’s post Ruby and Functional Programming.

When I originally started developing in Ruby, I was used to object-oriented programming. I tended to make classes for all kinds of data objects and the result looked like a nicer, more readable version of Java code. While this works, it is not the most effective way of developing in Ruby (or other dynamic programming languages).

One of the benefits of Ruby (and many other dynamic programming languages), is their lack of static typing. In classic object-oriented development, you would define member variables and methods to operate on the variables. You don’t have to do that in Ruby since you have a flexible hash class that can store most of what you need. Once you have replaced all member variables with a hash, the hash is your object and the methods of the old class are just functions that could operate on your hash.

There are situations where a hash doesn’t make sense. If you build an abstraction class, i.e. a storage system abstraction class, you might want to keep some information internally. Connection parameters for the storage system could be kept in a hash, but that doesn’t feel right. Interestingly, since there is normally only a single storage system of a particular type in use, you could make the storage abstraction class a singleton and keep all the connection parameters internally in traditional member variables.

Some information that you would stored in a database or on a different server might be used often enough to keep a memory cache. Let me stress that I don’t like caches and I try to avoid them whenever possible since they add complexity and the potential for inconsistent data among different servers. That being said, I do add caches when they are needed and once again, caches can also be implemented as a hash. Some code needs to control the caches, but since you normally cache data coming either from storage or a different server or service, you could put them in the abstraction class for that. If you need to write and use common code to manage the caches, you can easily build a mixin module.

Interestingly, the model you end up with if you only have hashes you send around and keep consistent data in singletons is similar to the Erlang gen_server behavior. This behavior is a general server template for Erlang, a pure functional language with immutable variables. The state variable is given as a parameter to all the general server functions. This allows the gen_server to maintain information in a pure functional setting.

When you get used to keeping abstractions in singletons and your data in hashes, you can also use modules instead of classes to modularize your code to keep related functionality together. If you also use blocks to define exact behavior inside function you end up with flexible code that is very easy to reuse.

Keeping your object data in a hash and implementing functions without side-effects make testing easy. What you get back from a function call is only a result of the function parameters and there is no need to test combinations of operations. In an object-oriented setting, member variables might not be observable and even if the returned value of a method call is correct the object might be in an undesired state that you cannot test without modifying the class to make internal variables accessible in your test system. Clearly the functional approach is cleaner and requires less test code.

The Sincerial system being a request handling system is built mostly functionally. There are singletons guarding the storage system and other cached data. The system also uses classic objects where that makes sense. Ruby is an object-oriented language and I believe in using the language features available when appropriate. This might sound like a contradiction to the whole post, but it’s not. My point is that you should avoid creating traditional classes when a hash can do the job and use functional programming techniques actively to improve maintainability, testability and readability of your code.


Follow

Get every new post delivered to your Inbox.