Hadoop Quick Notes :: Part - 1


Bhaskar S 12/23/2013


Overview

We have been hearing the Big Data buzz for quite sometime now. So what is all this buzz about ? With more and more of us doing business online and hooked on social media sites, inadvertently we are generating more and more data. How much data are we talking here ? As of 2013, we are generating few Exabytes of data per day which is expected to increase to few Zettabytes very soon !!! This is Big Data.

How do we store and process this much of data ??? Traditional approaches will not scale for this much of data - enter Apache Hadoop.

Apache Hadoop is an open-source framework that allows for distributed storing and processing of Big Data sets across a cluster of commodity computers (also referred to as nodes).

Rather than try to store all the data in a traditional relational datastore, data is distributed and replicated across the cluster of nodes.

And for processing the distributed data, application logic to process the data is moved closer versus trying to move all the distributed data.

Hadoop core consists of the following two main modules:


Installation and Setup

Download the latest stable version of Hadoop 1.x from the project site located at the URL hadoop.apache.org

The current stable 1.x version at this time is the release 1.2.1.

We will install Hadoop in a 3-node Ubuntu 13.10 based cluster. The following is how we will name the three nodes:

The following diagram illustrates our 3-node Hadoop cluster:

Hadoop Cluster
3-Node Hadoop Cluster

Following are the steps to install and setup Hadoop on all the 3 nodes in the cluster:

This completes the installation and the necessary setup of our 3-node Hadoop cluster.