Iterating over joins in Pig

Apache Pig is a fantastic language for processing data. It is sometimes incredibly annoying, but it beats the hell out of writing a ton of map reduces and chaining them together. When iterating over joins, an issue that I know that I’m not the only one having ran into is referencing data after a join in pig.

Normally, you access fields using the dereference operators . or # depending on the data type. The period symbol, . is used for tuples and bags, i.e. tuple.field0, tuple.field1, bag.field0, bag.field1. Maps are dereference with a hash, #, i.e. map#’field0′, map#’field1′.

This does not work after a join. The expected iteration after a JOIN:


joined = JOIN list0 BY key, list1 BY key;
purified = FOREACH joined GENERATE list0.key;

This will fail with the obscure error: “scalar has more than one row in the output”. This error message is a known problem is and there is a . As can be seen from the ticket, the correct way to iterate over the join is by using the relation operator, :: instead of the dereferencing operators like this:


joined = JOIN list0 BY key, list1 BY key;
purified = FOREACH joined GENERATE list0::key;

If you fall for the temptation of skipping the name of the list to get the field from like this:

joined = JOIN list0 BY key, list1 BY key;
purified = FOREACH joined GENERATE key;

You will get the more informative message: “Found more than one match: list0::key, list1::key”.

What you are really doing after a join is addressing columns in relations. For users, addressing columns in a relation with a period would be easier, but using :: might make the underlying code easier to understand.

Advertisements

2 thoughts on “Iterating over joins in Pig

  1. That’s not quite correct. “::” is not an operator at all. It’s a disambiguating string Pig inserts to let you distinguish between fields that come from different sides of the join. If you say “describe joined”, you will see that it does not have fields called “list0” and “list1” — internal fields of those relations get flattened out during a join. So referencing “list0” is illegal — or, due to Pig’s scalar feature, might mean referencing the *relation* list0, and treating it as a scalar. Which is why it complains when it discovers list0 is not, in fact, a scalar. That’s the known bug — bad error messaging. We should warn you earlier since we can figure out what you *probably* mean when you say list0.key .

    You can hit the same error without a join:

    Let’s say you have a relation called “foo” with fields “some_field” and “some_other_field”.

    You want to extract all the values of some_field, so you write:

    newfoo = foreach foo generate foo.some_field;

    You are iterating over foo, so at every iteration, you are working on a single row from foo, which has a schema (“some_field”, “some_other_field”). Note that in this schema, there is no foo in sight. So referencing foo.some_field means you are trying to read foo while iterating over foo.. this causes the universe to collapse onto itself and the dreaded “not a scalar” error pops up.

  2. Thanks for the clarification, Dmitriy. I have looked for explanations both in the Pig wiki and in Alan F Gates’ book “Programming Pig”, but found nothing. Looked at the code now and found this in LOJoin.java:

    newFS = new LogicalSchema.LogicalFieldSchema(((LogicalRelationalOperator)op).getAlias()+”::”+fs.alias ,fs.schema, fs.type, fs.uid);

    Wonder if I had spotted that if I had checked before writing the post.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s