You can find my screencast for Apache Hadoop Streaming here at www.hadoopscreencasts.com
Apache Hadoop Streaming is a feature that allows developers to write MapReduce applications using languages like Python, Ruby, etc. A language that can read from standard input (STDIN) and write to standard output (STDOUT) can be used to write MapReduce applications.
In this post, I use Ruby to write the map and reduce functions.
First, let’s have some sample data. For a simple test, I have one file, that has just one line with a few repeating words.
The contents of the file (sample.txt) is as follows:
she sells sea shells on the sea shore where she sells fish too
Next, let’s create a ruby file for the map function and call it map.rb
Contents of map.rb
#!/usr/bin/env ruby STDIN.each do |line| line.split.each do |word| puts word + "\t" + "1" end end
In the above map code, we are splitting each line into words and emitting each word as a key with value 1.
Now, let’s create a ruby file for the reduce function and call it reduce.rb
Contents of reduce.rb
#!/usr/bin/env ruby prev_key=nil init_val = 1 STDIN.each do |line| key, value = line.split("\t") if prev_key != nil && prev_key != key puts prev_key + "\t" + init_val.to_s prev_key = key init_val = 1 elsif prev_key == nil prev_key = key elsif prev_key == key init_val = init_val + value.to_i end end puts prev_key + "\t" + init_val.to_s
In the above reduce code we take in each key and sum up the values for that key before printing.
We are now ready to test the map and reduce function locally before we run on the cluster.
Execute the following:
$ cat sample.txt | ruby map.rb | sort | ruby reduce.rb
The output should be as follows:
fish 1 on 1 sea 2 sells 2 she 2 shells 1 shore 1 the 1 too 1 where 1
In the above command we have provided the contents of the file sample.txt as input to the map.rb which in turn provides data to reduce.rb. The data is sorted before it is sent to the reducer.
It looks like are program is working as expected. Now it’s time to deploy this on a Hadoop cluster.
First, let move the sample data to a folder in HDFS:
$ hadoop fs -copyFromLocal sample.txt /user/data/
Once we have our sample data in HDFS we can execute the following command from the hadoop/bin folder to execute our MapReduce job:
$ hadoop jar contrib/streaming/hadoop-streaming-1.1.2.jar -file map.rb -mapper map.rb -file reduce.rb -reducer reduce.rb -input /user/data/* -output /user/wc
If everything goes fine you should see the following output on your terminal:
packageJobJar: [map.rb, reduce.rb, /home/hduser/tmp/hadoop-unjar2392048729049303810/] [] /tmp/streamjob3038768339999397115.jar tmpDir=null 13/12/12 10:25:01 INFO util.NativeCodeLoader: Loaded the native-hadoop library 13/12/12 10:25:01 WARN snappy.LoadSnappy: Snappy native library not loaded 13/12/12 10:25:01 INFO mapred.FileInputFormat: Total input paths to process : 1 13/12/12 10:25:01 INFO streaming.StreamJob: getLocalDirs(): [/home/hduser/tmp/mapred/local] 13/12/12 10:25:01 INFO streaming.StreamJob: Running job: job_201312120020_0007 13/12/12 10:25:01 INFO streaming.StreamJob: To kill this job, run: 13/12/12 10:25:01 INFO streaming.StreamJob: /home/hduser/hadoop/libexec/../bin/hadoop job -Dmapred.job.tracker=hdfs://localhost:9001 -kill job_201312120020_0007 13/12/12 10:25:01 INFO streaming.StreamJob: Tracking URL: http://localhost:50030/jobdetails.jsp?jobid=job_201312120020_0007 13/12/12 10:25:02 INFO streaming.StreamJob: map 0% reduce 0% 13/12/12 10:25:05 INFO streaming.StreamJob: map 50% reduce 0% 13/12/12 10:25:06 INFO streaming.StreamJob: map 100% reduce 0% 13/12/12 10:25:13 INFO streaming.StreamJob: map 100% reduce 33% 13/12/12 10:25:14 INFO streaming.StreamJob: map 100% reduce 100% 13/12/12 10:25:15 INFO streaming.StreamJob: Job complete: job_201312120020_0007 13/12/12 10:25:15 INFO streaming.StreamJob: Output: /user/wc
Now, let’s look at the output file generated to see our results:
$ hadoop fs -cat /user/wc/part-00000 fish 1 on 1 sea 2 sells 2 she 2 shells 1 shore 1 the 1 too 1 where 1
The results are just as we expected. We have successfully built and executed Hadoop MapReduce application using streaming written in Ruby.
You can find my screencast for Apache Hadoop Sreaming here at www.hadoopscreencasts.com