Let’s have a quick look at the FILTER command from our Part 1:
grunt> movies_greater_than_four = FILTER movies BY (float)rating>4.0;
Here, we see a (float) keyword placed before the column ‘rating’. This is done to tell Pig that the column we are working on is of type, float. Pig was not informed about the type of the column when the data was loaded.
Following is the command we used to load the data:
grunt> movies = LOAD '/home/hduser/pig/myscripts/movies_data.csv' USING PigStorage(',') as (id,name,year,rating,duration);
The load command specified only the column names. We can modify the statement as follows to include the data type of the columns:
grunt> movies = LOAD '/home/hduser/pig/myscripts/movies_data.csv' USING PigStorage(',') as (id:int,name:chararray,year:int,rating:double,duration:int);
In the above statement, name is chararray (string), rating is a double and fields id, year and duration are integers. If the data was loaded using the above statement we would not need to cast the column during filtering.
The datatypes used in the above statement are called scalar data types. The other scalar types are long, double and bytearray.
To get better at using filters, let’s ask the data a few more questions:
List the movies that were released between 1950 and 1960
grunt> movies_between_50_60 = FILTER movies by year>1950 and year<1960; List the movies that start with the Alpahbet A grunt> movies_starting_with_A = FILTER movies by name matches 'A.*';
List the movies that have duration greater that 2 hours
grunt> movies_duration_2_hrs = FILTER movies by duration > 3600;
List the movies that have rating between 3 and 4
grunt> movies_rating_3_4 = FILTER movies BY rating>3.0 and rating<4.0; DESCRIBE The schema of a relation/alias can be viewed using the DESCRIBE command: grunt> DESCRIBE movies; movies: {id: int,name: chararray,year: int,rating: double,duration: int}
ILLUSTRATE
To view the step-by-step execution of a sequence of statements you can use the ILLUSTRATE command:
grunt> ILLUSTRATE movies_duration_2_hrs; ------------------------------------------------------------------------------------------------------------------------ | movies | id:int | name:chararray | year:int | rating:double | duration:int | ------------------------------------------------------------------------------------------------------------------------ | | 1567 | Barney: Sing & Dance with Barney | 2004 | 2.7 | 3244 | | | 3045 | Strange Circus | 2005 | 2.8 | 6509 | ------------------------------------------------------------------------------------------------------------------------ --------------------------------------------------------------------------------------------------------------------- | movies_duration_2_hrs | id:int | name:chararray | year:int | rating:double | duration:int | --------------------------------------------------------------------------------------------------------------------- | | 3045 | Strange Circus | 2005 | 2.8 | 6509 | ---------------------------------------------------------------------------------------------------------------------
DESCRIBE and ILLUSTRATE are really useful for debugging.
Complex Types
Pig supports three different complex types to handle data. It is important that you understand these types properly as they will be used very often when working with data.
Tuples
A tuple is just like a row in a table. It is comma separated list of fields.
(49539,'The Magic Crystal',2013,3.7,4561)
The above tuple has five fields. A tuple is surrounded by brackets.
Bags
A bag is an unordered collection of tuples.
{ (49382, 'Final Offer'), (49385, 'Delete') }
The above bag is has two tuples. Each tuple has two fields, Id and movie name.
Maps
A map is a <key, value> store. The key and value are joined together using #.
['name'#'The Magic Crystal', 'year'#2013]
The above map has two keys and name and year and have values ‘The Magic Crystal’ and 2013. The first value is a chararray and the second one is an integer.
We will be using the above complex type quite often in our future examples.
FOREACH
FOREACH gives a simple way to apply transformations based on columns. Let’s understand this with an example.
List the movie names its duration in minutes
grunt> movie_duration = FOREACH movies GENERATE name, (double)(duration/60);
The above statement generates a new alias that has the list of movies and it duration in minutes.
You can check the results using the DUMP command.
GROUP
The GROUP keyword is used to group fields in a relation.
List the years and the number of movies released each year.
grunt> grouped_by_year = group movies by year;
grunt> count_by_year = FOREACH grouped_by_year GENERATE group, COUNT(movies);
You can check the result by dumping the count_by_year relation on the screen.
We know in advance that the total number of movies in the dataset is 49590. We can check to see if our GROUP operation is correct by verify the total of the COUNT field. If he sum of of the count field is 49590 we can be confident that our grouping has worked correctly.
grunt> group_all = GROUP count_by_year ALL; grunt> sum_all = FOREACH group_all GENERATE SUM(count_by_year.$1); grunt> DUMP sum_all;
From the above three statements, the first statement, GROUP ALL, groups all the tuples to one group. This is very useful when we need to perform aggregation operations on the entire set.
The next statement, performs a FOREACH on the grouped relation group_all and applies the SUM function to the field in position 1 (positions start from 0). Here field in position 1, are the counts of movies for each year. One execution of the DUMP statement the MapReduce program kicks off and gives us the following result:
(49590)
The above value matches to our know fact that the dataset has 49590 movies. So we can conclude that our GROUP operation worked successfully.
ORDER BY
Let us question the data to illustrate the ORDER BY operation.
List all the movies in the ascending order of year.
grunt> desc_movies_by_year = ORDER movies BY year ASC; grunt> DUMP desc_movies_by_year;
List all the movies in the descending order of year.
grunt> asc_movies_by_year = ORDER movies by year DESC; grunt> DUMP asc_movies_by_year;
DISTINCT
The DISTINCT statement is used to remove duplicated records. It works only on entire records, not on individual fields.
Let’s illustrate this with an example:
grunt> movies_with_dups = LOAD '/home/hduser/pig/myscripts/movies_with_duplicates.csv' USING PigStorage(',') as (id:int,name:chararray,year:int,rating:double,duration:int); grunt> DUMP movies_with_dups; (1,The Nightmare Before Christmas,1993,3.9,4568) (1,The Nightmare Before Christmas,1993,3.9,4568) (1,The Nightmare Before Christmas,1993,3.9,4568) (2,The Mummy,1932,3.5,4388) (3,Orphans of the Storm,1921,3.2,9062) (4,The Object of Beauty,1991,2.8,6150) (5,Night Tide,1963,2.8,5126) (5,Night Tide,1963,2.8,5126) (5,Night Tide,1963,2.8,5126) (6,One Magic Christmas,1985,3.8,5333) (7,Muriel's Wedding,1994,3.5,6323) (8,Mother's Boys,1994,3.4,5733) (9,Nosferatu: Original Version,1929,3.5,5651) (10,Nick of Time,1995,3.4,5333) (9,Nosferatu: Original Version,1929,3.5,5651)
You see that there are are duplicates in this data set. Now let us list the distinct records present movies_with_dups:
grunt> no_dups = DISTINCT movies_with_dups; grunt> DUMP no_dups; (1,The Nightmare Before Christmas,1993,3.9,4568) (2,The Mummy,1932,3.5,4388) (3,Orphans of the Storm,1921,3.2,9062) (4,The Object of Beauty,1991,2.8,6150) (5,Night Tide,1963,2.8,5126) (6,One Magic Christmas,1985,3.8,5333) (7,Muriel's Wedding,1994,3.5,6323) (8,Mother's Boys,1994,3.4,5733) (9,Nosferatu: Original Version,1929,3.5,5651) (10,Nick of Time,1995,3.4,5333)
LIMIT
Use the LIMIT keyword to get only a limited number for results from relation.
grunt> top_10_movies = LIMIT movies 10; grunt> DUMP top_10_movies; (1,The Nightmare Before Christmas,1993,3.9,4568) (2,The Mummy,1932,3.5,4388) (3,Orphans of the Storm,1921,3.2,9062) (4,The Object of Beauty,1991,2.8,6150) (5,Night Tide,1963,2.8,5126) (6,One Magic Christmas,1985,3.8,5333) (7,Muriel's Wedding,1994,3.5,6323) (8,Mother's Boys,1994,3.4,5733) (9,Nosferatu: Original Version,1929,3.5,5651) (10,Nick of Time,1995,3.4,5333)
SAMPLE
Use the sample keyword to get sample set from your data.
grunt> sample_10_percent = sample movies 0.1; grunt> dump sample_10_percent;
Here, 0.1 = 10%
As we already know that the file has 49590 records. We can check to see the count of records in the relation.
grunt> sample_group_all = GROUP sample_10_percent ALL; grunt> sample_count = FOREACH sample_group_all GENERATE COUNT(sample_10_percent.$0); grunt> dump sample_count;
The output is (4937) which is approximately 10% for 49590.
In this post we have touched upon some important operations used in Pig. I suggest that you try out all the samples when you go through this tutorial as it is the doing that registers and not the reading. In the next post we will learn few more operations dealing with data transformation.
All code and data for this post can be downloaded from github.