Hadoop Map-Reduce is a great tool for analyzing and processing large amount of data. There are a few things one needs to keep in mind when working with Hadoop. This is the simple solution to one possibly annoying problem.
Hadoop expects reducers to emit something regularly. If a reducer runs for a long time without output, it will be terminated and retried. The error message in this case is something like “Task attempt X failed to report status for Y seconds”.
I bet some of you are thinking that this should not be a problem since the mappers should do all the work and not the reducers. This is mostly true, but if the job of the reducer is to feed a lot of data to a database that is not write-optimized, things may take a little time.
The trick is to regularly write to STDERR to let Hadoop know that your reducer is healthy and progressing.
Add this line to the input processing loop of your reducer:
STDERR.puts("reporter:status:things_ok") unless (count += 1) % 1000 > 0
This will emit reporter:status:things_ok every 1000 items which is a fine magical number. Substitute your favorite magic number as long as it’s not too big.