Apache Hadoop Streaming

You can find my screencast for Apache Hadoop Streaming here at www.hadoopscreencasts.com

Apache Hadoop Streaming is a feature that allows developers to write MapReduce applications using languages like Python, Ruby, etc. A language that can read from standard input (STDIN) and write to standard output (STDOUT) can be used to write MapReduce applications.

In this post, I use Ruby to write the map and reduce functions.

First, let’s have some sample data. For a simple test, I have one file, that has just one line with a few repeating words.
The contents of the file (sample.txt) is as follows:

she sells sea shells on the sea shore where she sells fish too

Next, let’s create a ruby file for the map function and call it map.rb

Contents of map.rb

#!/usr/bin/env ruby

STDIN.each do |line|
line.split.each do |word|
puts word + "\t" + "1"
end
end

In the above map code, we are splitting each line into words and emitting each word as a key with value 1.

Now, let’s create a ruby file for the reduce function and call it reduce.rb

Contents of reduce.rb

#!/usr/bin/env ruby

prev_key=nil
init_val = 1

STDIN.each do |line|
key, value = line.split("\t")
if prev_key != nil && prev_key != key
puts prev_key + "\t" + init_val.to_s
prev_key = key
init_val = 1
elsif prev_key == nil
prev_key = key
elsif prev_key == key
init_val = init_val + value.to_i
end
end

puts prev_key + "\t" + init_val.to_s

In the above reduce code we take in each key and sum up the values for that key before printing.

We are now ready to test the map and reduce function locally before we run on the cluster.

Execute the following:

$ cat sample.txt | ruby map.rb | sort | ruby reduce.rb

The output should be as follows:

fish 1
on 1
sea 2
sells 2
she 2
shells 1
shore 1
the 1
too 1
where 1

In the above command we have provided the contents of the file sample.txt as input to the map.rb which in turn provides data to reduce.rb. The data is sorted before it is sent to the reducer.
It looks like are program is working as expected. Now it’s time to deploy this on a Hadoop cluster.

First, let move the sample data to a folder in HDFS:

$ hadoop fs -copyFromLocal sample.txt /user/data/

Once we have our sample data in HDFS we can execute the following command from the hadoop/bin folder to execute our MapReduce job:

$ hadoop jar contrib/streaming/hadoop-streaming-1.1.2.jar -file map.rb -mapper map.rb -file reduce.rb -reducer reduce.rb -input /user/data/* -output /user/wc

If everything goes fine you should see the following output on your terminal:

packageJobJar: [map.rb, reduce.rb, /home/hduser/tmp/hadoop-unjar2392048729049303810/] [] /tmp/streamjob3038768339999397115.jar tmpDir=null
13/12/12 10:25:01 INFO util.NativeCodeLoader: Loaded the native-hadoop library
13/12/12 10:25:01 WARN snappy.LoadSnappy: Snappy native library not loaded
13/12/12 10:25:01 INFO mapred.FileInputFormat: Total input paths to process : 1
13/12/12 10:25:01 INFO streaming.StreamJob: getLocalDirs(): [/home/hduser/tmp/mapred/local]
13/12/12 10:25:01 INFO streaming.StreamJob: Running job: job_201312120020_0007
13/12/12 10:25:01 INFO streaming.StreamJob: To kill this job, run:
13/12/12 10:25:01 INFO streaming.StreamJob: /home/hduser/hadoop/libexec/../bin/hadoop job -Dmapred.job.tracker=hdfs://localhost:9001 -kill job_201312120020_0007
13/12/12 10:25:01 INFO streaming.StreamJob: Tracking URL: http://localhost:50030/jobdetails.jsp?jobid=job_201312120020_0007
13/12/12 10:25:02 INFO streaming.StreamJob: map 0% reduce 0%
13/12/12 10:25:05 INFO streaming.StreamJob: map 50% reduce 0%
13/12/12 10:25:06 INFO streaming.StreamJob: map 100% reduce 0%
13/12/12 10:25:13 INFO streaming.StreamJob: map 100% reduce 33%
13/12/12 10:25:14 INFO streaming.StreamJob: map 100% reduce 100%
13/12/12 10:25:15 INFO streaming.StreamJob: Job complete: job_201312120020_0007
13/12/12 10:25:15 INFO streaming.StreamJob: Output: /user/wc

Now, let’s look at the output file generated to see our results:

$ hadoop fs -cat /user/wc/part-00000

fish 1
on 1
sea 2
sells 2
she 2
shells 1
shore 1
the 1
too 1
where 1

The results are just as we expected. We have successfully built and executed Hadoop MapReduce application using streaming written in Ruby.

You can find my screencast for Apache Hadoop Sreaming here at www.hadoopscreencasts.com

Apache Hadoop Streaming

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112