Hadoop Status Reporting from Ruby

Hadoop Map-Reduce is a great tool for analyzing and processing large amount of data. There are a few things one needs to keep in mind when working with Hadoop. This is the simple solution to one possibly annoying problem.

Hadoop Logo

Hadoop expects reducers to emit something regularly. If a reducer runs for a long time without output, it will be terminated and retried. The error message in this case is something like “Task attempt X failed to report status for Y seconds”.

I bet some of you are thinking that this should not be a problem since the mappers should do all the work and not the reducers. This is mostly true, but if the job of the reducer is to feed a lot of data to a database that is not write-optimized, things may take a little time.

The trick is to regularly write to STDERR to let Hadoop know that your reducer is healthy and progressing.

Add this line to the input processing loop of your reducer:

STDERR.puts("reporter:status:things_ok") unless (count += 1) % 1000 > 0

This will emit reporter:status:things_ok every 1000 items which is a fine magical number. Substitute your favorite magic number as long as it’s not too big.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s