Apache Pig is a tool used to analyze large amounts of data by represeting them as data flows. Using the PigLatin scripting language operations like ETL (Extract, Transform and Load), adhoc data anlaysis and iterative processing can be easily achieved.
Pig is an abstraction over MapReduce. In other words, all Pig scripts internally are converted into Map and Reduce tasks to get the task done. Pig was built to make programming MapReduce applications easier. Before Pig, Java was the only way to process the data stored on HDFS.
Pig was first built in Yahoo! and later became a top level Apache project. In this series of we will walk through the different features of pig using a sample dataset.
Dataset
The dataset that we are using here is from one of my projects called Flicksery. Flicksery is a Netflix Search Engine. The dataset is a simple text (movies_data.csv) file lists movie names and its details like release year, rating and runtime.
A sample of the dataset is as follows:
1,The Nightmare Before Christmas,1993,3.9,4568 2,The Mummy,1932,3.5,4388 3,Orphans of the Storm,1921,3.2,9062 4,The Object of Beauty,1991,2.8,6150 5,Night Tide,1963,2.8,5126 6,One Magic Christmas,1985,3.8,5333 7,Muriel's Wedding,1994,3.5,6323 8,Mother's Boys,1994,3.4,5733 9,Nosferatu: Original Version,1929,3.5,5651 10,Nick of Time,1995,3.4,5333
All code and data for this post can be downloaded from github. The file has a total of 49590 records.
Installing Pig
Download Pig
$ wget http://mirror.symnds.com/software/Apache/pig/pig-0.12.0/pig-0.12.0.tar.gz
Untar
$ tar xvzf pig-0.12.0.tar.gz
Rename to folder for easier access:
$ mv pig-0.12.0 pig
Update .bashrc to add the following:
export PATH=$PATH:/home/hduser/pig/bin
Pig can be started in one of the following two modes:
- Local Mode
- Cluster Mode
Using the ’-x local’ options starts pig in the local mode whereas executing the pig command without any options starts in Pig in the cluster mode. When in local mode, pig can access files on the local file system. In cluster mode, pig can access files on HDFS.
Restart your terminal and execute the pig command as follows:
To start in Local Mode:
$ pig -x local 2013-12-25 20:16:26,258 [main] INFO org.apache.pig.Main - Apache Pig version 0.12.0 (r1529718) compiled Oct 07 2013, 12:20:14 2013-12-25 20:16:26,259 [main] INFO org.apache.pig.Main - Logging error messages to: /home/hduser/pig/myscripts/pig_1388027786256.log 2013-12-25 20:16:26,281 [main] INFO org.apache.pig.impl.util.Utils - Default bootup file /home/hduser/.pigbootup not found 2013-12-25 20:16:26,381 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: file:/// grunt>
To start in Cluster Mode:
$ pig 2013-12-25 20:19:42,274 [main] INFO org.apache.pig.Main - Apache Pig version 0.12.0 (r1529718) compiled Oct 07 2013, 12:20:14 2013-12-25 20:19:42,274 [main] INFO org.apache.pig.Main - Logging error messages to: /home/hduser/pig/myscripts/pig_1388027982272.log 2013-12-25 20:19:42,300 [main] INFO org.apache.pig.impl.util.Utils - Default bootup file /home/hduser/.pigbootup not found 2013-12-25 20:19:42,463 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://localhost:54310 2013-12-25 20:19:42,672 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to map-reduce job tracker at: hdfs://localhost:9001 grunt>
This command presents you with a grunt shell. The grunt shell allows you to execute PigLatin statements to quickly test out data flows on your data step by step without having to execute complete scripts. Pig is now installed and we can go ahead and start using Pig to play with data.
Pig Latin
To learn Pig Latin, let’s question the data. Before we start asking questions, we need the data to be accessible in Pig.
Use the following command to load the data:
grunt> movies = LOAD '/home/hduser/pig/myscripts/movies_data.csv' USING PigStorage(',') as (id,name,year,rating,duration);
The above statement is made up of two parts. The part to the left of “=” is called the relation or alias. It looks like a variable but you should note that this is not a variable. When this statement is executed, no MapReduce task is executed.
Since our dataset has records with fields separated by a comma we use the keyword USING PigStorage(‘,’).
Another thing we have done in the above statement is giving the names to the fields using the ‘as’ keyword.
Now, let’s test to see if the alias has the data we loaded.
grunt> DUMP movies;
Once, you execute the above statement, you should see lot of text on the screen (partial text shown below).
2013-12-25 23:03:04,550 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: UNKNOWN 2013-12-25 23:03:04,633 [main] INFO org.apache.pig.newplan.logical.optimizer.LogicalPlanOptimizer - {RULES_ENABLED=[AddForEach, ColumnMapKeyPrune, DuplicateForEachColumnRewrite, GroupByConstParallelSetter, ImplicitSplitInserter, LimitOptimizer, LoadTypeCastInserter, MergeFilter, MergeForEach, NewPartitionFilterOptimizer, PartitionFilterOptimizer, PushDownForEachFlatten, PushUpFilter, SplitFilter, StreamTypeCastInserter], RULES_DISABLED=[FilterLogicExpressionSimplifier]} 2013-12-25 23:03:04,748 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - File concatenation threshold: 100 optimistic? false 2013-12-25 23:03:04,805 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 1 2013-12-25 23:03:04,805 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 1 2013-12-25 23:03:04,853 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig script settings are added to the job ................ HadoopVersion PigVersion UserId StartedAt FinishedAt Features 1.1.2 0.12.0 hduser 2013-12-25 23:03:04 2013-12-25 23:03:05 UNKNOWN Success! Job Stats (time in seconds): JobId Alias Feature Outputs job_local_0001 movies MAP_ONLY file:/tmp/temp-1685410826/tmp1113990343, Input(s): Successfully read records from: "/home/hduser/pig/myscripts/movies_data.csv" Output(s): Successfully stored records in: "file:/tmp/temp-1685410826/tmp1113990343" Job DAG: job_local_0001 ................ (49586,Winter Wonderland,2013,2.8,1812) (49587,Top Gear: Series 19: Africa Special,2013,,6822) (49588,Fireplace For Your Home: Crackling Fireplace with Music,2010,,3610) (49589,Kate Plus Ei8ht,2010,2.7,) (49590,Kate Plus Ei8ht: Season 1,2010,2.7,)
It is only after the DUMP statement that a MapReduce job is initiated. As we see our data in the output we can confirm that the data has been loaded successfully.
Now, since we have the data in Pig, let’s start with the questions.
List the movies that having a rating greater than 4
grunt> movies_greater_than_four = FILTER movies BY (float)rating>4.0; grunt> DUMP movies_greater_than_four;
The above statements filters the alias movies and store the results in a new alias movies_greater_than_four. The movies_greater_than_four alias will have only records of movies where the rating is greater than 4.
The DUMP command is only used to display information onto the standard output. If you need to store the data to a file you can use the following command:
grunt> store movies_greater_than_four into '/user/hduser/movies_greater_than_four';
In this post we got a good feel of Apache Pig. We loaded some data and executed some basic commands to query it. The next post will dive deeper into Pig Latin where we will learn some advanced techniques to do data analysis.