Sorting a Large CSV File

Large CSV files present a challenge when need arises to sort. Learn how to do that using a database.

Sort Large CSV files

“All of life is a constant education.”
― Eleanor Roosevelt, The Wisdom of Eleanor Roosevelt

1. Introduction

Let us explore some ways of sorting large data sets.

By large, I don’t mean typical “big-data” sizes – which might consist of billions of rows. Such data sets fall into the realm of “big data” which we are not exploring today. Instead I am talking of sorting a rather large CSV file – maybe a couple of million rows.

2. The Croaking Excel

First off, I tried the most obvious: loading it into Excel. And promptly received an error the first time.

Barfing Excel trying to open large CSV

Second time Excel rebuked me with the following message.

Sorting a Large CSV in Excel

Great job, Excel Developers! You not only can’t open my file, you want to chastise me for attempting a second time. Blah, Blah, Blah! I didn’t even read the wall of text fully. Does you really need so much text to say “Too Large”?

3. Awk and Sort: Linux Command Line

Let’s see. What other methods can I use? How about using the Linux command line? Split the CSV into fields with Awk and use sort.

Out of luck here too.

The CSV includes fields with quotes, line separators and commas. Too complex to use regular expressions with or without Awk.

4. Loading into an ArrayList

Looks like I cannot get it done without programming.

Let us write a Java program to parse the CSV and load the data into a List of Lists. Here is the main loop where the data is being loaded. (It uses the CSV module which handles Byte-Order-Marker, quoted fields, multi-line fields and more.)

List<String> colNames = null;
List<List<String>> data = new ArrayList<>();
long rowNum = 0;
try (InputStream in = new FileInputStream(csvFile);) {
    CSV csv = new CSV(true, ',', in);
    if ( csv.hasNext() ) colNames = new ArrayList<String>(csv.next());
    while (csv.hasNext()) {
        rowNum++;
        if ( rowNum / 1000 * 1000 == rowNum )
            System.err.printf("\rRow: %d", rowNum);
        data.add(new ArrayList<String>(csv.next()));
    }
    System.err.printf("\rRow: %d\n", rowNum);
}

Unfortunately that didn’t work too. I got an exception:

java.lang.OutOfMemoryError: Java heap space

Yeah, too big to load it all into memory too. Let us see now, what other options do we have?

5. Load into an SQL Database

Let us load the data into an SQL database. Since I am doing this in Java, the first option that came to my mind was: the H2 database. The H2 database is a nifty piece of software which lets you add SQL capabilities to your application with a single JAR. Let’s roll with it.

The POM segment:

    <dependency>
      <groupId>com.h2database</groupId>
      <artifactId>h2</artifactId>
      <version>1.4.194</version>
    </dependency>

Load the JDBC driver:

Class.forName("org.h2.Driver");

The H2 database provides a function CSVREAD() for directly reading a CSV file. From my (limited) experiments, it appears to be able to deal with quoted fields and other complications of CSV.

To write the CSV output after sorting, we use the CSVWRITE() function which can export the output of a query as CSV.

String dbUrl = "jdbc:h2:" + dbFile;
try (Connection con = DriverManager.getConnection(dbUrl);) {
    con
        .createStatement()
        .execute("create table sample as select * from CSVREAD('" + csvFile + "')");
    StringBuilder sbuf = new StringBuilder();
    sbuf.append("CALL CSVWRITE('sample_out.csv', 'SELECT * FROM sample order by cast(SCORE as int)'");
    con
        .createStatement()
        .execute(sbuf.toString());
}

And that did the job. Though the CSV output was somewhat not to my liking; CSVWRITE() enclosed all fields in quotes regardless of whether it was required or not.

It took about 11 minutes (wall-clock time) for a file of about 1.2 million rows. (The sample data in question was the StackOverflow questions CSV  data set from here.)

"ID","OWNERUSERID","CREATIONDATE","CLOSEDDATE","SCORE","TITLE","BODY"
...
"38185560","2377949","2016-07-04T13:08:37Z","NA","0","Join two query results where one is created from column values","<p>I have a table tables </p>
...

Summary

Sorting a large CSV file with a few million rows is not as straightforward as it appears. Excel cannot load CSV files of this size. So we are left with having to load the CSV into a database and sort it there. Sample code was presented to do this using the H2 database.

In the next part of this article, we examine how the SQLite database performs in this regard.

XML Import Wizard

Quick Wizard to convert XML to Excel, Access and CSV. Can merge multiple XML files. Try it today.