Apache Spark – Setup Cluster on AWS

We demonstrate how to setup an Apache Spark cluster on a single AWS EC2 node and run a couple of jobs.

“If the facts don’t fit the theory, change the facts.”
― Albert Einstein

1. Introduction

Apache Spark is the newest kid on the block talking big data.

While re-using major components of the Apache Hadoop Framework, Apache Spark lets you execute big data processing jobs that do not neatly fit into the Map-Reduce paradigm. It provides support for many patterns similar to the Java 8 Streams functionality, while letting you run these jobs on a cluster.

Continue reading “Apache Spark – Setup Cluster on AWS”

How to Setup an Apache Hadoop Cluster on AWS EC2

This article demonstrates how to get Apache Hadoop up and running on an Amazon EC2 cluster.

Introduction

Lets talk about how to setup an Apache Hadoop cluster on AWS.

In a previous article, we discussed setting up a Hadoop processing pipeline on a single node (laptop). That involved running all the components of Hadoop on a single machine. In the setup we discuss here, we setup a multi-node cluster to run processing jobs.

Continue reading “How to Setup an Apache Hadoop Cluster on AWS EC2”

Hadoop Tutorial

Learn how to get started with Hadoop (2.7.3). Demonstrates single process and single node distributed execution.

“We are all different. Don’t judge, understand instead.”
― Roy T. Bennett, The Light in the Heart

1. Introduction

Hadoop is a toolkit for big-data processing. It uses a cluster of computers to split data into multiple chunks and process each chunk on one machine and re-assemble the output.

Continue reading “Hadoop Tutorial”