How to Read CSV File in Java

Reading a CSV file in Java including handling BOM (Byte-Order-Marker), quoted fields, multi-line fields and more.

“The reason I talk to myself is because I’m the only one whose answers I accept.”
― George Carlin

1. Introduction

CSV files are extensively used in data interchange between applications. Especially useful when the only structure to the data being exchanged is rows and columns. This format is particularly popular as the data can be imported into Microsoft Excel and used for charts and visualization.

In this article we present an easy-to-use class for parsing and reading CSV data in Java. The class allows retrieval of each row of the CSV file as an array of columns. This row can then be processed further for filtering, inserting into a database, etc.

You might be thinking – Why not just use String.split() to split row data into fields? It is, after all, readily available and returns the data as an array. Well, the answer is, CSV parsing has a lot more nuances than is possible to handle using just String.split().

2. Excel Compatibility

The CSV format defines certain conventions which are commonly used in applications that import and export CSV data. One of the most common applications used for visualizing CSV data is Excel. Many applications including follow these conventions and hence a CSV reader must take these into consideration. Some of these are:

Commas are used to separate fields. And a Carriage-Return Line-Feed (CRLF) combination is used to separate rows.

When commas need to be included as a part of a field value, it must be quoted with double-quotes (").

Multi-line fields can be present in the CSV file and these fields must also be quoted.

Double quotes can be included within a field by repeating the double-quote character.

3. Read and Strip BOM

A CSV file generated from an application on Windows might include a BOM (Byte Order Mark) character at the very beginning of the file. This character, if present, can be used to determine the encoding of the file from among UTF-8, UTF-16BE (Big Ending) or UTF-16LE (Little Endian). The CSV Reader module (presented next) uses a routine to strip the BOM from the CSV file.

4. CSV Reader Class

Here is the class used to parse and read CSV data. Download it here.

package sample;

import java.io.InputStream;
import java.io.PushbackInputStream;
import java.io.InputStreamReader;
import java.io.Reader;
import java.io.BufferedReader;

import java.util.List;
import java.util.ArrayList;

/*
 * @Author Jay Sridhar
 */
public class CSV
{
  static final private int NUMMARK = 10;
  static final private char COMMA = ',';
  static final private char DQUOTE = '"';
  static final private char CRETURN = '\r';
  static final private char LFEED = '\n';
  static final private char SQUOTE = '\'';
  static final private char COMMENT = '#';

  /**
   * Should we ignore multiple carriage-return/newline characters
   * at the end of the record?
   */
  private boolean stripMultipleNewlines;

  /**
   * What should be used as the separator character?
   */
  private char separator;
  private ArrayList<String> fields;
  private boolean eofSeen;
  private Reader in;

  static public Reader stripBom(InputStream in)
    throws java.io.IOException,
           java.io.UnsupportedEncodingException
  {
    PushbackInputStream pin = new PushbackInputStream(in, 3);
    byte[] b = new byte[3];
    int len = pin.read(b, 0, b.length);
    if ( (b[0] & 0xFF) == 0xEF && len == 3 ) {
      if ( (b[1] & 0xFF) == 0xBB && 
           (b[2] & 0xFF) == 0xBF ) {
        return new InputStreamReader(pin, "UTF-8");
      } else {
        pin.unread(b, 0, len);
      }
    }
    else if ( len >= 2 ) {
      if ( (b[0] & 0xFF) == 0xFE &&
           (b[1] & 0xFF) == 0xFF ) {
        return new InputStreamReader(pin, "UTF-16BE");
      } else if ( (b[0] & 0xFF) == 0xFF &&
                  (b[1] & 0xFF) == 0xFE ) {
        return new InputStreamReader(pin, "UTF-16LE");
      } else {
        pin.unread(b, 0, len);
      }
    } else if ( len > 0 ) {
      pin.unread(b, 0, len);
    }
    return new InputStreamReader(pin, "UTF-8");
  }

  public CSV(boolean stripMultipleNewlines,
             char separator,
             Reader input)
  {
    this.stripMultipleNewlines = stripMultipleNewlines;
    this.separator = separator;
    this.fields = new ArrayList<String>();
    this.eofSeen = false;
    this.in = new BufferedReader(input);
  }

  public CSV(boolean stripMultipleNewlines,
             char separator,
             InputStream input)
    throws java.io.IOException,
           java.io.UnsupportedEncodingException
  {
    this.stripMultipleNewlines = stripMultipleNewlines;
    this.separator = separator;
    this.fields = new ArrayList<String>();
    this.eofSeen = false;
    this.in = new BufferedReader(stripBom(input));
  }

  public boolean hasNext() throws java.io.IOException
  {
    if ( eofSeen ) return false;
    fields.clear();
    eofSeen = split( in, fields );
    if ( eofSeen ) return ! fields.isEmpty();
    else return true;
  }

  public List<String> next()
  {
    return fields;
  }

  // Returns true if EOF seen.
  static private boolean discardLinefeed(Reader in,
                                         boolean stripMultiple)
    throws java.io.IOException
  {
    if ( stripMultiple ) {
      in.mark(NUMMARK);
      int value = in.read();
      while ( value != -1 ) {
        char c = (char)value;
        if ( c != CRETURN && c != LFEED ) {
          in.reset();
          return false;
        } else {
          in.mark(NUMMARK);
          value = in.read();
        }
      }
      return true;
    } else {
      in.mark(NUMMARK);
      int value = in.read();
      if ( value == -1 ) return true;
      else if ( (char)value != LFEED ) in.reset();
      return false;
    }
  }

  private boolean skipComment(Reader in)
    throws java.io.IOException
  {
    /* Discard line. */
    int value;
    while ( (value = in.read()) != -1 ) {
      char c = (char)value;
      if ( c == CRETURN )
        return discardLinefeed( in, stripMultipleNewlines );
    }
    return true;
  }

  // Returns true when EOF has been seen.
  private boolean split(Reader in,ArrayList<String> fields)
    throws java.io.IOException
  {
    StringBuilder sbuf = new StringBuilder();
    int value;
    while ( (value = in.read()) != -1 ) {
      char c = (char)value;
      switch(c) {
      case CRETURN:
        if ( sbuf.length() > 0 ) {
          fields.add( sbuf.toString() );
          sbuf.delete( 0, sbuf.length() );
        }
        return discardLinefeed( in, stripMultipleNewlines );

      case LFEED:
        if ( sbuf.length() > 0 ) {
          fields.add( sbuf.toString() );
          sbuf.delete( 0, sbuf.length() );
        }
        if ( stripMultipleNewlines )
          return discardLinefeed( in, stripMultipleNewlines );
        else return false;

      case DQUOTE:
        {
          // Processing double-quoted string ..
          while ( (value = in.read()) != -1 ) {
            c = (char)value;
            if ( c == DQUOTE ) {
              // Saw another double-quote. Check if
              // another char can be read.
              in.mark(NUMMARK);
              if ( (value = in.read()) == -1 ) {
                // Nope, found EOF; means End of
                // field, End of record and End of
                // File
                if ( sbuf.length() > 0 ) {
                  fields.add( sbuf.toString() );
                  sbuf.delete( 0, sbuf.length() );
                }
                return true;
              } else if ( (c = (char)value) == DQUOTE ) {
                // Found a second double-quote
                // character. Means the double-quote
                // is included.
                sbuf.append( DQUOTE );
              } else if ( c == CRETURN ) {
                // Found End of line. Means End of
                // field, and End of record.
                if ( sbuf.length() > 0 ) {
                  fields.add( sbuf.toString() );
                  sbuf.delete( 0, sbuf.length() );
                }
                // Read and discard a line-feed if we
                // can indeed do so.
                return discardLinefeed( in,
                                        stripMultipleNewlines );
              } else if ( c == LFEED ) {
                // Found end of line. Means End of
                // field, and End of record.
                if ( sbuf.length() > 0 ) {
                  fields.add( sbuf.toString() );
                  sbuf.delete( 0, sbuf.length() );
                }
                // No need to check further. At this
                // point, we have not yet hit EOF, so
                // we return false.
                if ( stripMultipleNewlines )
                  return discardLinefeed( in, stripMultipleNewlines );
                else return false; 
              } else {
                // Not one of EOF, double-quote,
                // newline or line-feed. Means end of
                // double-quote processing. Does NOT
                // mean end-of-field or end-of-record.
                // System.err.println("EOR on '" + c +
                // "'");
                in.reset();
                break;
              }
            } else {
              // Not a double-quote, so no special meaning.
              sbuf.append( c );
            }
          }
          // Hit EOF, and did not see the terminating double-quote.
          if ( value == -1 ) {
            // We ignore this error, and just add whatever
            // left as the next field.
            if ( sbuf.length() > 0 ) {
              fields.add( sbuf.toString() );
              sbuf.delete( 0, sbuf.length() );
            }
            return true;
          }
        }
        break;

      default:
        if ( c == separator ) {
          fields.add( sbuf.toString() );
          sbuf.delete(0, sbuf.length());
        } else {
          /* A comment line is a line starting with '#' with
           * optional whitespace at the start. */
          if ( c == COMMENT && fields.isEmpty() &&
               sbuf.toString().trim().isEmpty() ) {
            boolean eof = skipComment(in);
            if ( eof ) return eof;
            else sbuf.delete(0, sbuf.length());
            /* Continue with next line if not eof. */
          } else sbuf.append(c);
        }
      }
    }
    if ( sbuf.length() > 0 ) {
      fields.add( sbuf.toString() );
      sbuf.delete( 0, sbuf.length() );
    }
    return true;
  }
}

5. Read CSV File into an Array

The following code illustrates how to read a CSV file and load the rows and columns into a List. The file is opened with an InputStream so the Byte Order Marker can be automatically detected and discarded. Also the first row of the CSV file is assumed to be column headers and loaded into a separate array.

List<List<String>> rows = new ArrayList<>();
try (InputStream in = new FileInputStream(csvFile);) {
  CSV csv = new CSV(true, ',', in);
  List<String> colNames = null;
  if ( csv.hasNext() ) colNames = new ArrayList<String>(csv.next());
  while (csv.hasNext()) {
    List<String> fields = new ArrayList<String>(csv.next());
    rows.add(fields);
  }
}

6. Parse CSV Data from a String

You can also use the class to read CSV data from a String by creating a StringReader instead of an InputStream. The following code reads and prints numbered columns from the CSV string.

try (Reader in = new StringReader(csvStr);) {
    CSV csv = new CSV(true, ',', in);
    int nrows = 0;
    while (csv.hasNext()) {
	List<String> fields = csv.next();
	for (int i = 0 ; i < fields.size() ; i++) {
	    System.out.printf("%-3d: %s%n", (i+1), fields.get(i));
	}
	System.out.println();
	nrows++;
    }
}

7. Quoted Commas

Commas can be quoted with the double quote character and included in a field.

@Test
public void quotedCommaTest() throws Exception
{
    try(Reader in = new StringReader("A,B,\"C,D\",e");) {
	CSV csv = new CSV(true, ',', in);
	assertTrue(csv.hasNext());
	List<String> fields = csv.next();
	assertTrue(fields.size() == 4);
	List<String> expected = Arrays.asList("A", "B", "C,D", "e");
	assertThat(fields, is(expected));
    }
}

8. Dealing with Double Quotes

Single instances of double-quotes are read and stripped away from the field value; they are assumed to quote text containing commas and new-lines.

@Test
public void quotesStrippedTest() throws Exception
{
  try(Reader in = new StringReader("A,B,I said \"How are you?\",d");) {
    CSV csv = new CSV(true, ',', in);
    assertTrue(csv.hasNext());
    List<String> fields = csv.next();
    assertTrue(fields.size() == 4);
    List<String> expected = Arrays.asList("A",
					  "B",
					  "I said How are you?",
					  "d");
    assertThat(fields, is(expected));
  }
}

To include double quotes as part of the field, it must be repeated and enclosed within double quotes.

@Test
public void includeQuotesTest() throws Exception
{
  try(Reader in = new StringReader("A,B,\"I said \"\"How about you?\"\"\",d");) {
    CSV csv = new CSV(true, ',', in);
    assertTrue(csv.hasNext());
    List<String> fields = csv.next();
    assertTrue(fields.size() == 4);
    List<String> expected = Arrays.asList("A",
					  "B",
					  "I said \"How about you?\"",
					  "d");
    assertThat(fields, is(expected));
  }
}

9. Multiline Columns

Text spanning across lines is sometimes included within CSV data. The class handles this case also without hiccups.

@Test
public void multilineFieldQuoteTest() throws Exception
{
  try(Reader in = new StringReader("A,B,\"I said\n\"\"How are you?\"\"\",d");) {
    CSV csv = new CSV(true, ',', in);
    assertTrue(csv.hasNext());
    List<String> fields = csv.next();
    // fields.stream().forEach(System.out::println);
    assertTrue(fields.size() == 4);
    List<String> expected = Arrays
      .asList("A",
	      "B",
	      "I said\n\"How are you?\"",
	      "d");
    assertThat(fields, is(expected));
  }
}

Summary

Reading CSV data from a file is sometimes required in an application. A simplistic view would be to use String.split() to read the data, but this does not cover all the edge cases. These include commas within fields, double quotes, and multi-line text. Download and drop in the included class for a simple solution to parsing CSV files.

XML to MySQL

Merge and Import data from multiple XML files. Handles large XML files easily. Creates tables automatically. Try it today.