Sort Large CSV File using SQLite

Sorting a large CSV file by loading it into SQLite. Much faster and easier to process.

“When you’re at the end of your rope, tie a knot and hold on.”
― Theodore Roosevelt

1. Review

We are trying to sort a large CSV file. The file contains a couple of million rows – not large by “big-data” standards, but large enough to face problems working with it.

In the first part of this article, we saw how Excel failed with a memory error unable to load the CSV file.

An attempt to load the data into an ArrayList failed with an OutOfMemoryException.

2. Using SQLite

The first database option we tried for loading the CSV was H2.

Another option for a “desktop” database is SQLite. A very popular embedded database in the C/C++ world that is reputed to be very fast. Let’s use that to load our data and manipulate it in our application.

SQLite supports direct loading of a CSV file using the “mode” and “import” commands. However, it did not quite work right for me. So I am going to be writing code using JDBC to import and sort the data.

SQLite’s CSV import also cannot handle quoted fields, multi-line fields and other idiosyncrasies found in some CSV files.

3. SQLite with JDBC

SQLite is a database library suitable for embedding within C/C++ programs. There is no “official” JDBC driver for it. However, some nice folks have developed one for it which we are going to use. It seems to be in active development (as of Mar 2017) as far as I can see.

Download the JDBC driver from here. Or use this POM entry.

<dependency>
  <groupId>org.xerial</groupId>
  <artifactId>sqlite-jdbc</artifactId>
  <version>3.16.1</version>
</dependency>

4. Create Table from Header

Generally, the first line of a CSV file contains column headers. We are going to use this header line to create a table.

One nice thing about SQLite is that you can skip the column type when creating a table. SQLite uses dynamic typing and you can skip defining the type of a column when creating a table. The following is a valid SQLite table definition:

create table sample(
  yearID,
  teamID,
  lgID,
  playerID,
  salary
)

With this advantage, the code for creating a table is simple.

StringBuffer sbuf = new StringBuffer();
sbuf.append("create table sample(\n");
for (int i = 0, n = fieldNames.size() ; i < n ; i++) {
    String name = fieldNames.get(i);
    sbuf.append("  ").append(name);
    if ( i < n - 1 ) sbuf.append(",\n");
    else sbuf.append("\n");
}
sbuf.append(")");
con.createStatement().execute(sbuf.toString());

5. Transactions

In SQLite, by default each statement is executed within the scope of a transaction. That is each statement can be assumed to be enclosed within a begin and commit block as follows:

begin transaction
sql statement ...
commit transaction

While this increases the reliability of the code and the data, it makes things very, very slow for us. So what we are going to do is to commit every N number of transactions.

if ( COMMIT_EVERY > 0 && uncommitted >= COMMIT_EVERY ) {
    con.createStatement().execute("commit transaction");
    con.createStatement().execute("begin transaction");
    uncommitted = 0;
}

With this change, I saw a 10x improvement in performance.

6. Inserting the Data

We use a PreparedStatement to insert the data into the table.

StringBuffer sbuf = new StringBuffer();
StringBuffer vbuf = new StringBuffer();
sbuf.append("insert into sample(");
vbuf.append("values(");
for (int i = 0, n = fieldNames.size() ; i < n ; i++) {
    String name = fieldNames.get(i);
    sbuf.append(name);
    if ( i < n - 1 ) {
        sbuf.append(", ");
        vbuf.append("?, ");
    } else {
        vbuf.append("?)\n");
        sbuf.append(")\n");
    }
}
sbuf.append(vbuf.toString());
return con.prepareStatement(sbuf.toString());

7. Sort and Extract to CSV

After the data is inserted, we can sort the data with the required clause(s) and extract the data to CSV using a ResultSet.

String sortClause = "order by CAST(yearID as Integer) DESC, CAST(Salary as Integer) DESC, teamID";
String selectStmt = "select * from sample\n" + sortClause;
ResultSet rset = con.createStatement().executeQuery(selectStmt);

Main loop for extracting and exporting the data:

while (rset.next()) {
    rowNum++;
    for (int i = 0, n = fieldNames.size() ; i < n ; i++) {
	out.print(encodeValue(rset.getString(i+1)));
	if ( i < n - 1 ) out.print(',');
	else out.println();
    }
}

Encoding a field value to CSV:

static private Pattern rxquote = Pattern.compile("\"");

static private String encodeValue(String value) {
    boolean needQuotes = false;
    if ( value.indexOf(',') != -1 || value.indexOf('"') != -1 ||
	 value.indexOf('\n') != -1 || value.indexOf('\r') != -1 )
	needQuotes = true;
    Matcher m = rxquote.matcher(value);
    if ( m.find() ) { needQuotes = true; value = m.replaceAll("\"\""); }
    if ( needQuotes ) return "\"" + value + "\"";
    else return value;
}

And here is the nicely sorted output data. It is sorted by yearID, salary and teamID.

yearID,teamID,lgID,playerID,salary
2016,LAD,NL,kershcl01,33000000
2016,ARI,NL,greinza01,31799030
2016,BOS,AL,priceda01,30000000
2016,DET,AL,cabremi01,28000000
...

For comparison purposes, the same data file which took 11 mins with a H2 database took about 5 mins with SQLite. Draw from that what conclusions you will. Probably too less data points to draw a conclusion. The only thing worse than generalizing from one case is generalizing from no cases at all.

Summary

And that completes our exercise in sorting a large CSV file. Using a database seems to be the easiest way out. We have a choice of using H2 database or SQLite. Not much to choose between either.

XML to MySQL

Merge and Import data from multiple XML files. Handles large XML files easily. Creates tables automatically. Try it today.