How do I Create a Java String from the Contents of a File?

1. Introduction

Here we present a few ways to read the whole contents of a file into a String in Java.

2. Using java.nio.file in Java 7+

Quite simple to load the whole file into a String using the java.nio.file package:

String str = new String(Files.readAllBytes(Paths.get(pathname)),
                        StandardCharsets.UTF_8);

Here is how the code works. Read all the bytes into a byte buffer using the Files and Paths available in the java.nio.file package.

byte[] buf = Files.readAllBytes(Paths.get(pathname));

Convert the byte buffer into a String by specifying the character set.

String str = new String(buf, StandardCharsets.UTF_8);

3. Scan for end-of-input

Another way to read the whole file into a String is to use the Scanner class. Create a scanner with the file as input, set the appropriate delimiter and read the next token.

Note: the actual delimiter used in the code below is the beginning-of-input marker which will not match anywhere other than the beginning of input.

Scanner scanner = null;
try {
    scanner = new Scanner(new File(pathname), "UTF-8");
    return scanner.useDelimiter("\\A").next();
} finally {
    if ( scanner != null ) scanner.close();
}

4. Memory Mapped File Reading

This method maps the file contents directly into memory using the MappedByteBuffer class. Memory mapping the contents directly might lead one to expect enhanced performance. However this advantage is only available if the buffer is used directly. In our case, since we are creating a String from the contents of the file, the speed advantage of the memory mapped buffers is probably not visible.

static private String readFile3(String pathname)
    throws java.io.IOException
{
    File f = new File(pathname);
    RandomAccessFile file = new RandomAccessFile(pathname, "r");
    MappedByteBuffer buffer = file.getChannel().map(MapMode.READ_ONLY,
						    0,
						    f.length());
    file.close();
    return new StringBuilder(StandardCharsets.UTF_8.decode(buffer))
	.toString();
}

ByteBuffer provides a method asCharBuffer() which returns a “view” of the byte buffer as a character buffer. However, there is no way to specify the encoding for converting bytes to characters with this method — probably an oversight in the Java API. The correct way to convert a ByteBuffer to CharBuffer is to use CharSet.decode(ByteBuffer) with the appropriate CharSet instance.

5. Simple Way Using java.io

Of course, there is always the “old” way (pre-Java 1.7) of reading a whole file into a String: reading the characters in a loop and appending to a buffer.

static private String readFile3(String pathname)
    throws java.io.IOException
{
    FileReader in = null;
    try {
	in = new FileReader(pathname);
	char[] buf = new char[2048];
	int len;
	StringBuilder sbuf = new StringBuilder();
	while ((len = in.read(buf, 0, buf.length)) != -1) {
	    sbuf.append(buf, 0, len);
	}
	return sbuf.toString();
    } finally {
	if ( in != null ) in.close();
    }
}

6. Benchmarking Various Approaches

Since we have several methods of reading a whole file into a string, it is interesting to see how the methods stack up against one another in performance. To this end, we implemented a simple benchmarking method using System.currentTimeMillis(). The following is the average time for each run over 1000 runs of the method.

simple       235 ms for 1000 iters: 0.235000 ms/op
nio          213 ms for 1000 iters: 0.213000 ms/op
scanner      629 ms for 1000 iters: 0.629000 ms/op
mmap         285 ms for 1000 iters: 0.285000 ms/op

For the set of conditions under which the application ran, we can conclude that the NIO method is the fastest followed by the Simple method. Slowest is the Scanner method which is somewhat expected since a regular expression search is involved. A note of warning: do not use these benchmark numbers to pick the method to use. Rather use the method closest to your paradigm of problem solving.

Conclusion

You are now aware of various methods of reading the whole contents of a file into a String. Pick whatever suits you best and use it!

How to Convert UTF-16 Text File to UTF-8 in Java?

1. Introduction

In this article, we show how to convert a text file from UTF-16 encoding to UTF-8. Such a conversion might be required because certain tools can only read UTF-8 text. Furthermore, the conversion procedure demonstrated here can be applied to convert a text file from any supported encoding to another.

UTF-8 is a character encoding that can represent all characters (or code points) defined by Unicode. It is designed to be backward compatible with legacy encodings such as ASCII.

UTF-16 is another character encoding that encodes characters in one or two 16-bit code units whereas UTF-8 encodes characters in a variable number of 8-bit code units.

2. Supported Character Sets

You can find the characters sets supported by the JVM using the class java.nio.charset.Charset as follows:

for (Map.Entry e : Charset.availableCharsets().entrySet()) {
     System.out.println(e.getKey());
}

// prints the following
Big5
Big5-HKSCS
CESU-8
IBM-Thai
...
US-ASCII
UTF-16
UTF-16BE
UTF-16LE
UTF-32
UTF-32BE
UTF-32LE
UTF-8
...

3. Conversion Using java.io Classes

Java provides java.io.InputStreamReader class as a bridge between byte streams to character streams. Open the file using this class to be able to read character buffers in the specified encoding:

Reader in = new InputStreamReader(new FileInputStream(infile), "UTF-16");

Analogously, the class java.io.OutputStreamWriter acts as a bridge between characters streams and bytes streams. Create a Writer with this class to be able to write bytes to the file:

Writer out = new OutputStreamWriter(new FileOutputStream(outfile), "UTF-8");

With the Reader and Writer in place, it is trivial to copy data from the input file to the output file:

char cbuf[] = new char[2048];
int len;
while ((len = in.read(cbuf, 0, cbuf.length)) != -1) {
    out.write(cbuf, 0, len);
}

And that’s it! You have successfully read and converted data from UTF-16 to UTF-8. You can use this code to perform the conversion between any two character sets supported by your JVM.

4. Using String for Converting Bytes

Sometimes, you may have a byte array which you need converted and output in a specific encoding. You can use the String class for these cases as shown below. First convert the byte array into a String:

String str = new String(bytes, 0, len, "UTF-16");

Next, obtain the bytes in the required encoding by using the String.getBytes(String) method:

byte[] outbytes = str.getBytes("UTF-8");

Write the byte array to an OutputStream:

OutputStream out = new FileOutputStream(outfile);
out.write(outbytes);
out.close();

Note that while you could use the String class as shown to convert bytes, you should prefer using Reader/Writer combination when possible to avoid problems with multi-byte characters. Specifically, the byte array you have read may contain an incomplete multi-byte character at the beginning or the end. This may lead to character encoding errors.

Conclusion

When you need to convert text from one character encoding to another in Java, you have several options:

  • Using InputStreamReader and OutputStreamWriter bridge classes for conversion.
  • Using the String class directly with specified encoding.
  • A more advanced option is to use CharsetEncoder and CharsetDecoder class (not presented in this article).