“We are all different. Don’t judge, understand instead.”
― Roy T. Bennett, The Light in the Heart
Contents
1. Introduction
Hadoop is a toolkit for big-data processing. It uses a cluster of computers to split data into multiple chunks and process each chunk on one machine and re-assemble the output.
Let us now get started with Hadoop.
Download the Hadoop binary distribution from Apache servers.
2. Single Node Operation
Let us run a Hadoop job on a single node for understanding the basics of Hadoop processing. In this mode, the Hadoop data processing job runs on a single node as a single java process (no distributed computing).
We run a Hadoop job to find and count instances of a regular expression on a bunch of files.
2.1. Setup
Create a directory to be used as the working directory and change to that directory.
mkdir hadoop-getting-started && cd hadoop-getting-started
Extract the Hadoop binary tarball into this directory. A sub-directory such as hadoop-2.7.3
will be created containing the distribution.
tar xvzf hadoop-2.7.3.tar.gz
Ensure that JAVA_HOME is set and exported.
export JAVA_HOME=/usr/lib/jvm/jdk1.8.0_71
Run the Hadoop binary to ensure everything is setup correctly.
JAVA=java ./hadoop-2.7.3/bin/hadoop jar hadoop-2.7.3/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar
Note: The setting JAVA=java was required in the previous invocation since it appeared from the bin/hadoop executable shell script that this variable was being used. However there was no mention of this setting in the Hadoop docs (or at least, I didn’t see any). And the command would not run without this variable being set.
If everything has been setup correctly, you should see the following output from the above invocation.
An example program must be given as the first argument. Valid program names are: aggregatewordcount: An Aggregate based map/reduce program that counts the words in the input files. aggregatewordhist: An Aggregate based map/reduce program that computes the histogram of the words in the input files. bbp: A map/reduce program that uses Bailey-Borwein-Plouffe to compute exact digits of Pi. dbcount: An example job that count the pageview counts from a database. distbbp: A map/reduce program that uses a BBP-type formula to compute exact bits of Pi. grep: A map/reduce program that counts the matches of a regex in the input. join: A job that effects a join over sorted, equally partitioned datasets multifilewc: A job that counts words from several files. pentomino: A map/reduce tile laying program to find solutions to pentomino problems. ...
2.2. Running the Job
Let us run a sample job: attempting to find regular expression matches in files.
mkdir input cp hadoop-2.7.3/etc/hadoop/*.xml input/ JAVA=java ./hadoop-2.7.3/bin/hadoop jar hadoop-2.7.3/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar grep input output 'a+'
The job should now start, print a bunch of diagnostic output and finish with a few seconds. Check the output:
cat output/*
This command looks through all the files in the input directory and shows you the regular expression matches found in the input files.
3. Distributed Processing on Single Node
This example runs as a distributed system (where each Hadoop daemon is a separate process), but on a single node. It serves to lay out the processing structure for a fully-distributed system on a cluster.
3.1. Password-less SSH
Generate a key-pair with the following command. It writes the private key of an RSA key-pair into ~/.ssh/id_rsa and the public key into ~/.ssh/id_rsa.pub
.
You should enter a passphrase to protect the key. (And remember it!)
ssh-keygen
Append the public key file ~/.ssh/id_rsa.pub
to ~/.ssh/authorized_keys
file.
cat joe.pub >> ~/.ssh/authorized_keys
Ensure that ~/.ssh/authorized_keys
is private (only you can read/write the file).
chmod 600 ~/.ssh/authorized_keys
Add the private key file to your list of SSH identities. You will need to enter your passphrase here to do that.
ssh-add joe
You should now be able to login to the localhost without a password. (If it doesn’t work, let me know in the comments.)
ssh localhost
3.2. Format HDFS
Format the Hadoop Distributed File System (HDFS)
./hadoop-2.7.3/bin/hdfs namenode -format
3.3. Start Daemons
Ensure JAVA_HOME is properly set in hadoop-2.7.3/etc/hadoop/hadoop-env.sh. Put this line somewhere in the file.
export JAVA_HOME=/usr/lib/jvm/jdk1.8.0_71
This step assumes that password-less SSH is properly working.
$ sbin/start-dfs.sh
Something like the following will be printed as the diagnostic output:
localhost: starting namenode, logging to /home/hadoop/server/hadoop-2.7.3/lo... localhost: starting datanode, logging to /home/hadoop/server/hadoop-2.7.3/lo... Starting secondary namenodes [0.0.0.0] ...
Check that the processes are running with a ps -ef
. You should see a namenode, a datanode and a secondardnamenode as follows:
/usr/lib/jvm/jdk1.8.0_71/bin/java -Dproc_namenode ... /usr/lib/jvm/jdk1.8.0_71/bin/java -Dproc_datanode ... /usr/lib/jvm/jdk1.8.0_71/bin/java -Dproc_secondarynamenode ...
3.4. Copy Input Files
Create directories in HDFS to copy files to (assuming joe is your username).
hadoop-2.7.3/bin/hdfs dfs -mkdir /user hadoop-2.7.3/bin/hdfs dfs -mkdir /user/joe hadoop-2.7.3/bin/hdfs dfs -mkdir input
Check that the input directory has been created.
hadoop-2.7.3/bin/hdfs dfs -ls # prints Found 1 items drwxr-xr-x - hadoop supergroup 0 2017-03-16 17:33 input
Copy files to be processed to HDFS.
hadoop-2.7.3/bin/hdfs dfs -put hadoop-2.7.3/etc/hadoop/*.xml input
3.5. Run Hadoop Job
The job can now be run as follows (search for word “honour
” in the input text files):
JAVA=java ./hadoop-2.7.3/bin/hadoop jar hadoop-2.7.3/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar grep input output 'honour'
View the generated output files with the command:
./hadoop-2.7.3/bin/hdfs dfs -cat output/* # prints 59 honour
And that’s it. Stop the daemons with the command:
./hadoop-2.7.3/sbin/stop-dfs.sh
Summary
Hadoop is a big-data processing platform that can run on clusters of commodity hardware. In this tutorial we learnt how to get started with Hadoop by running it as a single process, and on a single node in distributed mode. In further articles, we will explore how to run it on a cluster of nodes.