Hadoop 2.x Quick Notes :: Part - 1


Bhaskar S 12/24/2014


Overview

We explored Hadoop 1.x almost an year ago. Now it is time to explore Hadoop 2.x.

Not much has changed in the Hadoop Distributed File System (HDFS), but the Hadoop MapReduce Distributed Processing Framework on the other hand has been completely overhauled.

Hadoop 2.x core consists of the following three main modules:


Single Node (localhost) Installation and Setup

We will first install Hadoop 2.x on a Ubuntu 14.04 LTS based Desktop for development and testing purposes.

Following are the steps to install and setup Hadoop 2.x on a single node (localhost):

This completes the installation, the necessary setup, and the start-up of Hadoop 2.x on a single node (localhost).


Single Node (localhost) Test Drive

Time to test drive HDFS on our single node (localhost) Hadoop 2.x setup.

We will first create a directory called /input in HDFS and then copy a 328MB test file into /input in HDFS.

To do that, execute the following commands:

hadoop dfs -mkdir /input

hadoop dfs -copyFromLocal Downloads/big-file.txt /input

The following screenshot shows the web browser pointing to the NameNode URL http://localhost:50070:

Browse Localhost Namenode
Browse NameNode (localhost)

The following screenshot shows the result of clicking on the Datanodes tab:

Datanodes Information
Datanode Information (localhost)

The following screenshot shows the result of clicking on Utilities->Browse the file system->input:

Browse input Directory (localhost)
Browse input Directory (localhost)

We were successful in using HDFS on our single node (localhost) Hadoop 2.x setup.

Now, it is time to test drive the MapReduce Framework on our single node (localhost) Hadoop 2.x setup.

We will execute the example wordcount MapReduce program (that is provided as part of the Hadoop 2.x download) on the 328MB test file located under /input in HDFS.

To do that, execute the following command:

hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0.jar wordcount /input /output

The following is the typical output:

Output.6

14/12/24 20:50:13 INFO client.RMProxy: Connecting to ResourceManager at localhost/127.0.0.1:8032
14/12/24 20:50:14 INFO input.FileInputFormat: Total input paths to process : 1
14/12/24 20:50:14 INFO mapreduce.JobSubmitter: number of splits:3
14/12/24 20:50:14 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1419887794046_0001
14/12/24 20:50:15 INFO impl.YarnClientImpl: Submitted application application_1419887794046_0001
14/12/24 20:50:15 INFO mapreduce.Job: The url to track the job: http://localhost:8088/proxy/application_1419887794046_0001/
14/12/24 20:50:15 INFO mapreduce.Job: Running job: job_1419887794046_0001
14/12/24 20:50:21 INFO mapreduce.Job: Job job_1419887794046_0001 running in uber mode : false
14/12/24 20:50:21 INFO mapreduce.Job:  map 0% reduce 0%
14/12/24 20:50:32 INFO mapreduce.Job:  map 25% reduce 0%
14/12/24 20:50:35 INFO mapreduce.Job:  map 38% reduce 0%
14/12/24 20:50:36 INFO mapreduce.Job:  map 50% reduce 0%
14/12/24 20:50:41 INFO mapreduce.Job:  map 58% reduce 0%
14/12/24 20:50:47 INFO mapreduce.Job:  map 65% reduce 0%
14/12/24 20:50:48 INFO mapreduce.Job:  map 65% reduce 11%
14/12/24 20:50:50 INFO mapreduce.Job:  map 66% reduce 11%
14/12/24 20:50:53 INFO mapreduce.Job:  map 73% reduce 11%
14/12/24 20:50:59 INFO mapreduce.Job:  map 78% reduce 11%
14/12/24 20:51:01 INFO mapreduce.Job:  map 100% reduce 11%
14/12/24 20:51:02 INFO mapreduce.Job:  map 100% reduce 100%
14/12/24 20:51:02 INFO mapreduce.Job: Job job_1419887794046_0001 completed successfully
14/12/24 20:51:02 INFO mapreduce.Job: Counters: 50
	File System Counters
		FILE: Number of bytes read=3166002
		FILE: Number of bytes written=4292109
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
		HDFS: Number of bytes read=308297019
		HDFS: Number of bytes written=164423
		HDFS: Number of read operations=12
		HDFS: Number of large read operations=0
		HDFS: Number of write operations=2
	Job Counters 
		Killed map tasks=2
		Launched map tasks=5
		Launched reduce tasks=1
		Data-local map tasks=5
		Total time spent by all maps in occupied slots (ms)=119855
		Total time spent by all reduces in occupied slots (ms)=22947
		Total time spent by all map tasks (ms)=119855
		Total time spent by all reduce tasks (ms)=22947
		Total vcore-seconds taken by all map tasks=119855
		Total vcore-seconds taken by all reduce tasks=22947
		Total megabyte-seconds taken by all map tasks=122731520
		Total megabyte-seconds taken by all reduce tasks=23497728
	Map-Reduce Framework
		Map input records=7172096
		Map output records=43099136
		Map output bytes=437059581
		Map output materialized bytes=703554
		Input split bytes=315
		Combine input records=43243028
		Combine output records=191856
		Reduce input groups=11991
		Reduce shuffle bytes=703554
		Reduce input records=47964
		Reduce output records=11991
		Spilled Records=263802
		Shuffled Maps =3
		Failed Shuffles=0
		Merged Map outputs=3
		GC time elapsed (ms)=1385
		CPU time spent (ms)=100400
		Physical memory (bytes) snapshot=1002299392
		Virtual memory (bytes) snapshot=7676395520
		Total committed heap usage (bytes)=761790464
	Shuffle Errors
		BAD_ID=0
		CONNECTION=0
		IO_ERROR=0
		WRONG_LENGTH=0
		WRONG_MAP=0
		WRONG_REDUCE=0
	File Input Format Counters 
		Bytes Read=308296704
	File Output Format Counters 
		Bytes Written=164423

The following screenshot shows the result of clicking on Utilities->Browse the file system->output:

Browse Directory Information
Browse output Directory (localhost)

We successfully demonstrated the use of the MapReduce Framework on our single node (localhost) Hadoop 2.x setup.