big-data Archives | Novixys Software Dev Blog

Java – Reading a Large File Efficiently

Efficiently read text and binary files in Java

“The world is a book and those who do not travel read only one page.”
― Augustine of Hippo

1. Introduction

What’s the most efficient and easiest way to read a large file in java? Well, one way is to read the whole file at once into memory. Let us examine some issues that arise when doing so.

Continue reading “Java – Reading a Large File Efficiently”

How to Convert Large CSV to JSON

Learn how to use Jackson Streaming to convert a large CSV to JSON.

“We are a way for the cosmos to know itself.”
― Carl Sagan, Cosmos

1. Introduction

Let us today look into converting a large CSV to JSON without running into memory issues. This previous article showed how to parse CSV and output the data to JSON using Jackson. However, since that code loads the entire data into memory, it will run into issues loading large CSV files such as:

Continue reading “How to Convert Large CSV to JSON”

Python Pandas Tutorial – Series Methods

We cover commonly used methods of the pandas Series object in this article.

“The truth is not for all men but only for those who seek it.”
― Ayn Rand

1. Introduction

The Series is one of the most common pandas data structures. It is similar to a python list and is used to represent a column of data. After looking into the basics of creating and initializing a pandas Series object, we now delve into some common usage patterns and methods.

Continue reading “Python Pandas Tutorial – Series Methods”

Python Pandas Tutorial – DataFrame

Learn the basics of creating a DataFrame in this tutorial series on pandas.

1. Introduction

This is the next part of the pandas tutorial. In a previous article, we covered the pandas Series class. Today we are getting started with the main pandas data structure, the DataFrame.

Continue reading “Python Pandas Tutorial – DataFrame”

Python Pandas Tutorial – Series

Learn the basics of pandas Series in this beginner tutorial.

“Come friends, it’s not too late to seek a newer world.”
― Alfred Tennyson

1. Introduction

Pandas is a powerful toolkit providing data-analysis tools and structures for the python programming language.

Among the most important artifacts provided by pandas is the Series. In this article, we introduce the Series class from a beginner’s perspective. That means you do not need to know anything about pandas or data-analysis to understand this tutorial.

Continue reading “Python Pandas Tutorial – Series”

Apache Spark – Setup Cluster on AWS

We demonstrate how to setup an Apache Spark cluster on a single AWS EC2 node and run a couple of jobs.

“If the facts don’t fit the theory, change the facts.”
― Albert Einstein

1. Introduction

Apache Spark is the newest kid on the block talking big data.

While re-using major components of the Apache Hadoop Framework, Apache Spark lets you execute big data processing jobs that do not neatly fit into the Map-Reduce paradigm. It provides support for many patterns similar to the Java 8 Streams functionality, while letting you run these jobs on a cluster.

Continue reading “Apache Spark – Setup Cluster on AWS”

Hadoop Tutorial

Learn how to get started with Hadoop (2.7.3). Demonstrates single process and single node distributed execution.

“We are all different. Don’t judge, understand instead.”
― Roy T. Bennett, The Light in the Heart

1. Introduction

Hadoop is a toolkit for big-data processing. It uses a cluster of computers to split data into multiple chunks and process each chunk on one machine and re-assemble the output.

Continue reading “Hadoop Tutorial”