How to Convert UTF-16 Text File to UTF-8 in Java?

1. Introduction

In this article, we show how to convert a text file from UTF-16 encoding to UTF-8. Such a conversion might be required because certain tools can only read UTF-8 text. Furthermore, the conversion procedure demonstrated here can be applied to convert a text file from any supported encoding to another.

UTF-8 is a character encoding that can represent all characters (or code points) defined by Unicode. It is designed to be backward compatible with legacy encodings such as ASCII.

UTF-16 is another character encoding that encodes characters in one or two 16-bit code units whereas UTF-8 encodes characters in a variable number of 8-bit code units.

2. Supported Character Sets

You can find the characters sets supported by the JVM using the class java.nio.charset.Charset as follows:

for (Map.Entry e : Charset.availableCharsets().entrySet()) {
     System.out.println(e.getKey());
}

// prints the following
Big5
Big5-HKSCS
CESU-8
IBM-Thai
...
US-ASCII
UTF-16
UTF-16BE
UTF-16LE
UTF-32
UTF-32BE
UTF-32LE
UTF-8
...

3. Conversion Using java.io Classes

Java provides java.io.InputStreamReader class as a bridge between byte streams to character streams. Open the file using this class to be able to read character buffers in the specified encoding:

Reader in = new InputStreamReader(new FileInputStream(infile), "UTF-16");

Analogously, the class java.io.OutputStreamWriter acts as a bridge between characters streams and bytes streams. Create a Writer with this class to be able to write bytes to the file:

Writer out = new OutputStreamWriter(new FileOutputStream(outfile), "UTF-8");

With the Reader and Writer in place, it is trivial to copy data from the input file to the output file:

char cbuf[] = new char[2048];
int len;
while ((len = in.read(cbuf, 0, cbuf.length)) != -1) {
    out.write(cbuf, 0, len);
}

And that’s it! You have successfully read and converted data from UTF-16 to UTF-8. You can use this code to perform the conversion between any two character sets supported by your JVM.

4. Using String for Converting Bytes

Sometimes, you may have a byte array which you need converted and output in a specific encoding. You can use the String class for these cases as shown below. First convert the byte array into a String:

String str = new String(bytes, 0, len, "UTF-16");

Next, obtain the bytes in the required encoding by using the String.getBytes(String) method:

byte[] outbytes = str.getBytes("UTF-8");

Write the byte array to an OutputStream:

OutputStream out = new FileOutputStream(outfile);
out.write(outbytes);
out.close();

Note that while you could use the String class as shown to convert bytes, you should prefer using Reader/Writer combination when possible to avoid problems with multi-byte characters. Specifically, the byte array you have read may contain an incomplete multi-byte character at the beginning or the end. This may lead to character encoding errors.

Conclusion

When you need to convert text from one character encoding to another in Java, you have several options:

  • Using InputStreamReader and OutputStreamWriter bridge classes for conversion.
  • Using the String class directly with specified encoding.
  • A more advanced option is to use CharsetEncoder and CharsetDecoder class (not presented in this article).

Leave a Reply

Your email address will not be published. Required fields are marked *