Iterating over joins in Pig

Apache Pig is a fantastic language for processing data. It is sometimes incredibly annoying, but it beats the hell out of writing a ton of map reduces and chaining them together. When iterating over joins, an issue that I know that I’m not the only one having ran into is referencing data after a join in pig.

Normally, you access fields using the dereference operators . or # depending on the data type. The period symbol, . is used for tuples and bags, i.e. tuple.field0, tuple.field1, bag.field0, bag.field1. Maps are dereference with a hash, #, i.e. map#’field0′, map#’field1′.

This does not work after a join. The expected iteration after a JOIN:


joined = JOIN list0 BY key, list1 BY key;
purified = FOREACH joined GENERATE list0.key;

This will fail with the obscure error: “scalar has more than one row in the output”. This error message is a known problem is and there is a . As can be seen from the ticket, the correct way to iterate over the join is by using the relation operator, :: instead of the dereferencing operators like this:


joined = JOIN list0 BY key, list1 BY key;
purified = FOREACH joined GENERATE list0::key;

If you fall for the temptation of skipping the name of the list to get the field from like this:

joined = JOIN list0 BY key, list1 BY key;
purified = FOREACH joined GENERATE key;

You will get the more informative message: “Found more than one match: list0::key, list1::key”.

What you are really doing after a join is addressing columns in relations. For users, addressing columns in a relation with a period would be easier, but using :: might make the underlying code easier to understand.


A fistful of links

Here’s a list of links I tweeted over the last week or so (as requested by @dmpetersson). I considered shamelessly removing the RT info, but came to my senses and carried them over. Clearly shows that the majority of my links are retweets, but then again more people might discover the interesting people who originally tweeted the links and follow them. I will start posting these digests weekly going forwards.


CouchDB to the rescue

Got CouchDB installed on my Fedora box. This thing is sweet. Working with a RESTful JSON/HTTP storage system is so much easier than old-fashioned SQL databases. If I were to store users and a lists of stuff per users where this stuff could be shared among more users in a realtional db, I would create a table for the users indexed on userid, a table for the stuff indexed by stuffid and a table of userid to stuffid relations. Then of course I would need lots of boilerplate code to work with the database.

In CouchDB, I would have a database of users where a user doc is stored under /users/<userid>. The stuff would be stored as documents under /stuff/<stuffid> and the relations would be stored either in the user document or in a separate database /userstuff/<userid>/. An important difference is that no matter if the relation information was stored in the user database or in a separate database, the document stored at that location would have to be replaced whenever stuff is added or removed for a user. This makes me prefer putting the relations in a separate database rather than keep updating the user document.

It was hard to believe that you could get a simpler interface than pure HTTP, but still I had to test CouchRest by Chris Anderson. This made working with CouchDB even easier. With all this stuff in blaze, all development is a breeze since I don’t have to spend time on the nitty-gritty repetitive low-level boilerplate stuff.


Follow

Get every new post delivered to your Inbox.