Apache Spark – Setup Cluster on AWS

We demonstrate how to setup an Apache Spark cluster on a single AWS EC2 node and run a couple of jobs.

Apache Spark for Big Data

“If the facts don’t fit the theory, change the facts.”
― Albert Einstein

1. Introduction

Apache Spark is the newest kid on the block talking big data.

While re-using major components of the Apache Hadoop Framework, Apache Spark lets you execute big data processing jobs that do not neatly fit into the Map-Reduce paradigm. It provides support for many patterns similar to the Java 8 Streams functionality, while letting you run these jobs on a cluster.

You have a data processing job working nicely with Java 8 Streams? But need more horsepower & memory than a single machine can provide?Apache Spark is your friend.

In this article, we delve into the basics of Apache Spark and show you how to setup a single-node cluster using the computing resources of Amazon EC2. For the purposes of the demonstration, we setup a single server and run the master and slave on the same node. Such a setup is good for getting your feet wet with Apache Spark on a laptop.

2. Create AWS Instance

Setting up an AWS EC2 instance is quite straightforward and we have covered it here to demonstrate setting up a Hadoop Cluster. The procedure is the same up until the cluster is running on EC2. Follow the steps in that guide till the instance is launched, and get back here to continue with Apache Spark.

3. Instance Setup

Once the instance is up and running on AWS EC2, we need to setup the requirements for Apache Spark.

3.1. Install Java

Install Java on the node using the ubuntu package: openjdk-8-jdk-headless

sudo apt-get -y install openjdk-8-jdk-headless

3.2. Install Apache Spark

Next head on over to the Apache Spark website and download the latest version. At the time of the writing of this article, the latest version is 2.1.0. We have chosen to install Spark with Hadoop 2.7 (the default).

Download and unpack the Apache Spark package.

mkdir ~/server
cd ~/server
wget <Link to Apache Spark Binary Distribution>
tar xvzf spark-2.1.0-bin-hadoop2.7.tgz

After unpacking, you have just one step to complete the installation: JAVA_HOME.

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/

And that’s it for installation! The friendly folks at Apache Spark have certainly made our lives easy, haven’t they?

4. Startup Master

Let us now fire up the Apache Spark master. The master is in charge of the cluster. This is where you submit jobs, and this where you go for the status of the cluster. Start the master as follows:

cd ~/server
./spark-2.1.0-bin-hadoop2.7/sbin/start-master.sh

Once the master is running, navigate to port 8080 on the Node’s Public DNS and you get a snapshot of the cluster.

Apache Spark Master Node

The URL highlighted in red is the Spark URL for the Cluster. Copy it down as you will need it to start the slave.

5. Slave Startup

Ensure that JAVA_HOME is set properly and run the following command.

cd ~/server
./spark-2.1.0-bin-hadoop2.7/sbin/start-slave.sh spark://ip-172-31-30-53.us-west-1.compute.internal:7077

And with that your cluster should be functioning. Hit the status page again at port 8080 to check for it. Observe that you can see the slave under Workers, along with the number of core available and the memory.

6. Run Job using Pyspark

Let us now run a job using the Python shell provided by Apache Spark. Starting up the shell needs the Spark Cluster URL mentioned earlier.

cd ~/server
./spark-2.1.0-bin-hadoop2.7/bin/pyspark --master spark://ip-172-31-30-53.us-west-1.compute.internal:7077

After a brief startup, you should see the pyspark prompt “>>>”.

For the purpose of testing, we are using a data file containing salaries of baseball players from 1985 through 2016. It contains 26429 records.

Here is a sample session with the pyspark shell using the file Salaries.csv

>>> a = sc.textFile('Salaries.csv')
>>> a.count()
26429
>>> a.filter(lambda x : '2005' in x).count()
837

7. Python Code to Run Job

Let us now see how to run some sample python code on the Spark cluster. The following shows code similar to the above pyspark session.

from pyspark import SparkContext

dataFile = "../data/Salaries.csv"
sc = SparkContext("spark://ip-172-31-30-53.us-west-1.compute.internal:7077", "Simple App")
a = sc.textFile(dataFile)

print "Count of records: ", a.count()

print "Count of 2005 records: ", a.filter(lambda x : '2005' in x).count()

sc.stop()

Along with a bunch of diagnostic output, the code prints:

Count of records:  26429
Count of 2005 records:  837

Summary

And that, my friends, is a simple and complete Apache Spark tutorial. We covered the basics of setting up Apache Spark on an AWS EC2 instance. We ran both the Master and Slave daemons on the same node. Finally we demonstrated an interactive pyspark session as well as some python code to run jobs on the cluster.