How to Convert Large CSV to JSON

Learn how to use Jackson Streaming to convert a large CSV to JSON.

“We are a way for the cosmos to know itself.”
― Carl Sagan, Cosmos

1. Introduction

Let us today look into converting a large CSV to JSON without running into memory issues. This previous article showed how to parse CSV and output the data to JSON using Jackson. However, since that code loads the entire data into memory, it will run into issues loading large CSV files such as:

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space

2. Use Jackson Streaming API

The solution to avoiding such OutOfMemoryError is to not store much data and use the Jackson Streaming API to output JSON as and when the data is being generated. This section shows an outline of how to use the Jackson Streaming API.

Begin by creating a JsonFactory.

JsonFactory fac = new JsonFactory();

Next create a JsonGenerator and specify the output file to write to.

JsonGenerator gen = fac.createGenerator(new File(jsonFile),
					JsonEncoding.UTF8)

Once the JsonGenerator is created, you can use writeStartArray()writeEndArray(), writeStartObject() and writeEndObject() to enclose the JSON writing code. For example, the following code writes an array of objects.

/* Begin writing JSON array */
gen.writeStartArray();
while (/* generating data to write */) {
    gen.writeStartObject();

    /* collect data for each object to write */
    /* write data */
    gen.writeStringField(name, value);
    ...

    /* complete object */
    gen.writeEndObject();
}
/* complete array */
gen.writeEndArray();

The data generation part could come from a database or some other source. In this article, we demonstrate converting CSV to JSON.

3. Parsing CSV using Regex

The simple case covers parsing CSV using a regex. This method works for CSV which do not have quoted fields, fields with embedded commas, embedded newlines and other assorted CSV complications.

JsonFactory fac = new JsonFactory();
try (BufferedReader in = new BufferedReader(new FileReader(csvFile));
     JsonGenerator gen = fac.createGenerator(new File(jsonFile),
                                             JsonEncoding.UTF8)
     .useDefaultPrettyPrinter();) {
    String[] headers = pattern.split(in.readLine());
    gen.writeStartArray();
    String line;
    while ((line = in.readLine()) != null) {
        gen.writeStartObject();
        String[] values = pattern.split(line);
        for (int i = 0 ; i < headers.length ; i++) {
            String value = i < values.length ? values[i] : null;
            gen.writeStringField(headers[i], value);
        }
        gen.writeEndObject();
    }
    gen.writeEndArray();
}

We use this code to convert CSV which looks like this:

yearID,teamID,lgID,playerID,salary
1985,ATL,NL,barkele01,870000
1985,ATL,NL,bedrost01,550000
...

And the output JSON:

[ {
  "yearID" : "1985",
  "teamID" : "ATL",
  "lgID" : "NL",
  "playerID" : "barkele01",
  "salary" : "870000"
}, {
  "yearID" : "1985",
  "teamID" : "ATL",
  "lgID" : "NL",
  "playerID" : "bedrost01",
  "salary" : "550000"
}, {
  "yearID" : "1985",
  "teamID" : "ATL",
  "lgID" : "NL",
  "playerID" : "benedbr01",
  "salary" : "545000"
}, {
...

4. More Complete CSV Parser

For CSV that contains embedded newlines, multi-line fields and commas within fiels, we need a more complete CSV parser. We use the CSV parser presented in this article. Again, we parse in a loop, do not store much data and output it to JSON.

JsonFactory fac = new JsonFactory();
try (InputStream in = new FileInputStream(csvFile);
     JsonGenerator gen = fac.createGenerator(new File(jsonFile),
                                             JsonEncoding.UTF8)
     .useDefaultPrettyPrinter();) {
    CSV csv = new CSV(true, ',', in);
    List<String> fieldNames = null;
    if ( csv.hasNext() ) fieldNames = new ArrayList<>(csv.next());
    gen.writeStartArray();
    while (csv.hasNext()) {
        gen.writeStartObject();
        List<String> x = csv.next();
        for (int i = 0 ; i < fieldNames.size() ; i++) {
            gen.writeStringField(fieldNames.get(i), x.get(i));
        }
        gen.writeEndObject();
    }
    gen.writeEndArray();
}

And here is the data we used to parse. (Incidentally, this data is from the Financial Services Consumer Complaint Database.) The data file is about 615 MB in size and caused an OutOfMemoryError on attempting to load it whole.

Date received,Product,Sub-product,Issue,Sub-issue,Consumer complaint narrative,Company public response,Company,State,ZIP code,Tags,Consumer consent provided?,Submitted via,Date sent to company,Company response to consumer,Timely response?,Consumer disputed?,Complaint ID
07/29/2013,Consumer Loan,Vehicle loan,Managing the loan or lease,,,,Wells Fargo & Company,VA,24540,,N/A,Phone,07/30/2013,Closed with explanation,Yes,No,468882
07/29/2013,Bank account or service,Checking account,Using a debit or ATM card,,,,Wells Fargo & Company,CA,95992,Ol
...

And the JSON output:

[ {
  "Date received" : "07/29/2013",
  "Product" : "Consumer Loan",
  "Sub-product" : "Vehicle loan",
  "Issue" : "Managing the loan or lease",
  "Sub-issue" : "",
  "Consumer complaint narrative" : "",
  "Company public response" : "",
  "Company" : "Wells Fargo & Company",
  "State" : "VA",
  "ZIP code" : "24540",
  "Tags" : "",
  "Consumer consent provided?" : "N/A",
  "Submitted via" : "Phone",
  "Date sent to company" : "07/30/2013",
  "Company response to consumer" : "Closed with explanation",
  "Timely response?" : "Yes",
  "Consumer disputed?" : "No",
  "Complaint ID" : "468882"
}, {
  "Date received" : "07/29/2013",
  "Product" : "Bank account or service",
  "Sub-product" : "Checking account",
  "Issue" : "Using a debit or ATM card",
  "Sub-issue" : "",
  "Consumer complaint narrative" : "",
  "Company public response" : "",
  "Company" : "Wells Fargo & Company",
...

Summary

This article showed how to use the Jackson Streaming API to convert a large CSV into JSON. Loading all data into RAM can cause OutOfMemoryError, so we need to stream the data to JSON. We used a simple CSV parser based on a regex for CSV which does not have quoted fields and such. Next we demonstrated a more complex CSV parser that can handle all these cases.