Python Pandas Tutorial – Series

Learn the basics of pandas Series in this beginner tutorial.

“Come friends, it’s not too late to seek a newer world.”
― Alfred Tennyson

1. Introduction

Pandas is a powerful toolkit providing data-analysis tools and structures for the python programming language.

Among the most important artifacts provided by pandas is the Series. In this article, we introduce the Series class from a beginner’s perspective. That means you do not need to know anything about pandas or data-analysis to understand this tutorial.

Continue reading “Python Pandas Tutorial – Series”

Java – Pivot Table using Streams

Implement a Pivot Table in Java using Java 8 Streams and Collections.

“Money may not buy happiness, but I’d rather cry in a Jaguar than on a bus.”
― Françoise Sagan

1. Introduction

Today let us see how we can implement a pivot table using java 8 streams. Raw data by itself does not deliver much insight to humans. We need some kind of data aggregation to discern patterns in raw data. A pivot table is one such instrument. Other more visual methods of aggregation include graphs and charts.

Continue reading “Java – Pivot Table using Streams”

Apache Spark – Setup Cluster on AWS

We demonstrate how to setup an Apache Spark cluster on a single AWS EC2 node and run a couple of jobs.

“If the facts don’t fit the theory, change the facts.”
― Albert Einstein

1. Introduction

Apache Spark is the newest kid on the block talking big data.

While re-using major components of the Apache Hadoop Framework, Apache Spark lets you execute big data processing jobs that do not neatly fit into the Map-Reduce paradigm. It provides support for many patterns similar to the Java 8 Streams functionality, while letting you run these jobs on a cluster.

Continue reading “Apache Spark – Setup Cluster on AWS”

How to Setup an Apache Hadoop Cluster on AWS EC2

This article demonstrates how to get Apache Hadoop up and running on an Amazon EC2 cluster.

Introduction

Lets talk about how to setup an Apache Hadoop cluster on AWS.

In a previous article, we discussed setting up a Hadoop processing pipeline on a single node (laptop). That involved running all the components of Hadoop on a single machine. In the setup we discuss here, we setup a multi-node cluster to run processing jobs.

Continue reading “How to Setup an Apache Hadoop Cluster on AWS EC2”

Java Regex – Simple Patterns

Learn the basics of regular expression pattern matching in Java. Covers a few basic regular expression constructs.

“Life’s under no obligation to give us what we expect.”
― Margaret Mitchell

Introduction

Regular Expressions are the bread and butter of string pattern matching. They allow you to express a search pattern in general terms without being too specific about what you are searching for.

This article covers the basics of string pattern matching using regular expressions.

Continue reading “Java Regex – Simple Patterns”

Sort Large CSV File using SQLite

Sorting a large CSV file by loading it into SQLite. Much faster and easier to process.

“When you’re at the end of your rope, tie a knot and hold on.”
― Theodore Roosevelt

1. Review

We are trying to sort a large CSV file. The file contains a couple of million rows – not large by “big-data” standards, but large enough to face problems working with it.

Continue reading “Sort Large CSV File using SQLite”

Sorting a Large CSV File

Large CSV files present a challenge when need arises to sort. Learn how to do that using a database.

“All of life is a constant education.”
― Eleanor Roosevelt, The Wisdom of Eleanor Roosevelt

1. Introduction

Let us explore some ways of sorting large data sets.

By large, I don’t mean typical “big-data” sizes – which might consist of billions of rows. Such data sets fall into the realm of “big data” which we are not exploring today. Instead I am talking of sorting a rather large CSV file – maybe a couple of million rows.

Continue reading “Sorting a Large CSV File”

Excel Pivot Table using Apache POI

Create an Excel Pivot table from Java using Apache POI.

“A foolish faith in authority is the worst enemy of truth.”
― Albert Einstein

1. Introduction

A Pivot Table is a tool used in Excel for summarizing data. It helps group data using user-selected criteria and compute group summaries using functions such as total, average, count, etc.

Continue reading “Excel Pivot Table using Apache POI”

Hadoop Tutorial

Learn how to get started with Hadoop (2.7.3). Demonstrates single process and single node distributed execution.

“We are all different. Don’t judge, understand instead.”
― Roy T. Bennett, The Light in the Heart

1. Introduction

Hadoop is a toolkit for big-data processing. It uses a cluster of computers to split data into multiple chunks and process each chunk on one machine and re-assemble the output.

Continue reading “Hadoop Tutorial”

Introductory Spring Boot Rest Service

A very basic Spring MVC Web Application. Illustrates the outline of a Spring project.

“The only true wisdom is in knowing you know nothing.”
― Socrates

1. Introduction

Getting started with Spring and Spring Boot? Welcome!

This is a beginner level tutorial for implementing a Hello World REST Service using Spring Boot. The emphasis is on getting the basic application outline worked out. One to which we can add various enterprise features in further articles.

The example presented is simple REST controller which returns a message to the user.

Continue reading “Introductory Spring Boot Rest Service”