Hadoop Tutorial

Learn how to get started with Hadoop (2.7.3). Demonstrates single process and single node distributed execution.

hadoop big data tutorial

“We are all different. Don’t judge, understand instead.”
― Roy T. Bennett, The Light in the Heart

1. Introduction

Hadoop is a toolkit for big-data processing. It uses a cluster of computers to split data into multiple chunks and process each chunk on one machine and re-assemble the output.

Let us now get started with Hadoop.

Download the Hadoop binary distribution from Apache servers.

2. Single Node Operation

Let us run a Hadoop job on a single node for understanding the basics of Hadoop processing. In this mode, the Hadoop data processing job runs on a single node as a single java process (no distributed computing).

We run a Hadoop job to find and count instances of a regular expression on a bunch of files.

2.1. Setup

Create a directory to be used as the working directory and change to that directory.

mkdir hadoop-getting-started && cd hadoop-getting-started

Extract the Hadoop binary tarball into this directory. A sub-directory such as hadoop-2.7.3 will be created containing the distribution.

tar xvzf hadoop-2.7.3.tar.gz

Ensure that JAVA_HOME is set and exported.

export JAVA_HOME=/usr/lib/jvm/jdk1.8.0_71

Run the Hadoop binary to ensure everything is setup correctly.

JAVA=java ./hadoop-2.7.3/bin/hadoop jar hadoop-2.7.3/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar

Note: The setting JAVA=java was required in the previous invocation since it appeared from the bin/hadoop executable shell script that this variable was being used. However there was no mention of this setting in the Hadoop docs (or at least, I didn’t see any). And the command would not run without this variable being set.

If everything has been setup correctly, you should see the following output from the above invocation.

An example program must be given as the first argument.
Valid program names are:
  aggregatewordcount: An Aggregate based map/reduce program that counts the words in the input files.
  aggregatewordhist: An Aggregate based map/reduce program that computes the histogram of the words in the input files.
  bbp: A map/reduce program that uses Bailey-Borwein-Plouffe to compute exact digits of Pi.
  dbcount: An example job that count the pageview counts from a database.
  distbbp: A map/reduce program that uses a BBP-type formula to compute exact bits of Pi.
  grep: A map/reduce program that counts the matches of a regex in the input.
  join: A job that effects a join over sorted, equally partitioned datasets
  multifilewc: A job that counts words from several files.
  pentomino: A map/reduce tile laying program to find solutions to pentomino problems.
...

2.2. Running the Job

Let us run a sample job: attempting to find regular expression matches in files.

mkdir input
cp hadoop-2.7.3/etc/hadoop/*.xml input/
JAVA=java ./hadoop-2.7.3/bin/hadoop jar hadoop-2.7.3/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar grep input output 'a+'

The job should now start, print a bunch of diagnostic output and finish with a few seconds. Check the output:

cat output/*

This command looks through all the files in the input directory and shows you the regular expression matches found in the input files.

3. Distributed Processing on Single Node

This example runs as a distributed system (where each Hadoop daemon is a separate process), but on a single node. It serves to lay out the processing structure for a fully-distributed system on a cluster.

3.1. Password-less SSH

Generate a key-pair with the following command. It writes the private key of an RSA key-pair into ~/.ssh/id_rsa and the public key into ~/.ssh/id_rsa.pub.

You should enter a passphrase to protect the key. (And remember it!)

ssh-keygen

Append the public key file ~/.ssh/id_rsa.pub to ~/.ssh/authorized_keys file.

cat joe.pub >> ~/.ssh/authorized_keys

Ensure that ~/.ssh/authorized_keys is private (only you can read/write the file).

chmod 600 ~/.ssh/authorized_keys

Add the private key file to your list of SSH identities. You will need to enter your passphrase here to do that.

ssh-add joe

You should now be able to login to the localhost without a password. (If it doesn’t work, let me know in the comments.)

ssh localhost

3.2. Format HDFS

Format the Hadoop Distributed File System (HDFS)

./hadoop-2.7.3/bin/hdfs namenode -format

3.3. Start Daemons

Ensure JAVA_HOME is properly set in hadoop-2.7.3/etc/hadoop/hadoop-env.sh. Put this line somewhere in the file.

export JAVA_HOME=/usr/lib/jvm/jdk1.8.0_71

This step assumes that password-less SSH is properly working.

$ sbin/start-dfs.sh

Something like the following will be printed as the diagnostic output:

localhost: starting namenode, logging to /home/hadoop/server/hadoop-2.7.3/lo...
localhost: starting datanode, logging to /home/hadoop/server/hadoop-2.7.3/lo...
Starting secondary namenodes [0.0.0.0]
...

Check that the processes are running with a ps -ef. You should see a namenode, a datanode and a secondardnamenode as follows:

/usr/lib/jvm/jdk1.8.0_71/bin/java -Dproc_namenode ...
/usr/lib/jvm/jdk1.8.0_71/bin/java -Dproc_datanode ...
/usr/lib/jvm/jdk1.8.0_71/bin/java -Dproc_secondarynamenode ...

3.4. Copy Input Files

Create directories in HDFS to copy files to (assuming joe is your username).

hadoop-2.7.3/bin/hdfs dfs -mkdir /user
hadoop-2.7.3/bin/hdfs dfs -mkdir /user/joe
hadoop-2.7.3/bin/hdfs dfs -mkdir input

Check that the input directory has been created.

hadoop-2.7.3/bin/hdfs dfs -ls

# prints
Found 1 items
drwxr-xr-x - hadoop supergroup 0 2017-03-16 17:33 input

Copy files to be processed to HDFS.

hadoop-2.7.3/bin/hdfs dfs -put hadoop-2.7.3/etc/hadoop/*.xml input

3.5. Run Hadoop Job

The job can now be run as follows (search for word “honour” in the input text files):

JAVA=java ./hadoop-2.7.3/bin/hadoop jar hadoop-2.7.3/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar grep input output 'honour'

View the generated output files with the command:

./hadoop-2.7.3/bin/hdfs dfs -cat output/*

# prints
59      honour

And that’s it. Stop the daemons with the command:

./hadoop-2.7.3/sbin/stop-dfs.sh

Summary

Hadoop is a big-data processing platform that can run on clusters of commodity hardware. In this tutorial we learnt how to get started with Hadoop by running it as a single process, and on a single node in distributed mode. In further articles, we will explore how to run it on a cluster of nodes.