Pig lovers meet TOP

Have you ever needed to get the top n items for a key in Pig? For instance the most popular three items in each country for an online store? You could always solve this the hard way by calculating a threshold per country and then filter on that threshold. This is neither to write or execute. What you want to do is order the items by popularity per country and then limit to the top three for each country.

Pig has a builtin function that will help you with this, it’s called TOP. It is a combined order and limit and it can be used in a nested foreach. The parameters it takes are:

  • limit – the number of items to keep for each group
  • column number – the 0 based column to sort on
  • relation – the name of the relation to operate on

Read on for a detailed example of how TOP  can be used.

First we need some test data. Each line has a key and a score separated by tab. Let’s call the file test.tsv and put it in our hdfs home with hdfs dfs -put test.tsv.

a 1
b 1
a 2
b 2
c 1
c 2
a 3
c 3
c 4
a 4
a 5

Then we need to read and group the data by key. Here’s the pig code for that:

data = LOAD 'test.tsv' AS (key:chararray, score:int);
data_group = GROUP data BY key;

Running this and dumping the result gives us

(a,{(a,5),(a,4),(a,3),(a,2),(a,1)})
(b,{(b,2),(b,1)})
(c,{(c,4),(c,3),(c,2),(c,1)})

Now iterate over that using a FOREACH and a nested TOP:

data_top = FOREACH data_group {
top = TOP(3, 1, data);
GENERATE top;
}

The output of this is:

({(a,3),(a,5),(a,4)})
({(b,1),(b,2)})
({(c,2),(c,4),(c,3)})

Let’s flatten the top:

data_top = FOREACH data_group {
top = TOP(3, 1, data);
GENERATE flatten(top);
}

Dumping the result gives us

(a,3)
(a,5)
(a,4)
(b,1)
(b,2)
(c,2)
(c,4)
(c,3)

Since this is only sorted on the key, let’s also sort on descending score:

top_sorted = ORDER data_top BY key, score DESC;

The output after this is

(a,5)
(a,4)
(a,3)
(b,2)
(b,1)
(c,4)
(c,3)
(c,2)

Clearly we have only the top 3 items per key. In case of a, that means we have 5, 4 and 3. We only have 2 and 1 for b since b only has 2 scores. For c we have 4, 3 and 2.

The full pig script as tested on Cloudera’s pig version 0.12.0-cdh5.0.1

SET job.name 'nested top test';

data = LOAD 'test.tsv' AS (key:chararray, score:int);
data_group = GROUP data BY key;

data_top = FOREACH data_group {
top = TOP(3, 1, data);
GENERATE flatten(top);
}

top_sorted = ORDER data_top BY key, score DESC;
dump top_sorted;

Let’s call the script nested_top_test.pig. We run the script with pig nested_top_test.pig and observe the results.

Now you should have the very useful TOP function in your quiver for the next time you need to get the top items for each key in a relation. Have fun!

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s