Encode a String to UTF-8 in Java
When working with Strings in Java, we oftentimes need to encode them to a specific charset, such as UTF-8.
UTF-8 represents a variable-width character encoding that uses between one and four eight-bit bytes to represent all valid Unicode code points.
A code point can represent single characters, but also have other meanings, such as for formatting. «Variable-width» means that it encodes each code point with a different number of bytes (between one and four) and as a space-saving measure, commonly used code points are represented with fewer bytes than those used less frequently.
UTF-8 uses one byte to represent code points from 0-127, making the first 128 code points a one-to-one map with ASCII characters, so UTF-8 is backward-compatible with ASCII.
Note: Java encodes all Strings into UTF-16, which uses a minimum of two bytes to store code points. Why would we need to convert to UTF-8 then?
Not all input might be UTF-16, or UTF-8 for that matter. You might actually receive an ASCII-encoded String, which doesn’t support as many characters as UTF-8. Additionally, not all output might handle UTF-16, so it makes sense to convert to a more universal UTF-8.
We’ll be working with a few Strings that contain Unicode characters you might not encounter on a daily basis — such as č , ß and あ , simulating user input.
Let’s write out a couple of Strings:
String serbianString = "Šta radiš?"; // What are you doing? String germanString = "Wie heißen Sie?"; // What's your name? String japaneseString = "よろしくお願いします"; // Pleased to meet you.
Now, let’s leverage the String(byte[] bytes, Charset charset) constructor of a String, to recreate these Strings, but with a different Charset , simulating ASCII input that arrived to us in the first place:
String asciiSerbianString = new String(serbianString.getBytes(), StandardCharsets.US_ASCII); String asciigermanString = new String(germanString.getBytes(), StandardCharsets.US_ASCII); String asciijapaneseString = new String(japaneseString.getBytes(), StandardCharsets.US_ASCII); System.out.println(asciiSerbianString); System.out.println(asciigermanString); System.out.println(asciijapaneseString);
Once we’ve created these Strings and encoded them as ASCII characters, we can print them:
While the first two Strings contain just a few characters that aren’t valid ASCII characters — the final one doesn’t contain any.
To avoid this issue, we can assume that not all input might already be encoded to our liking — and encode it to iron out such cases ourselves. There are several ways we can go about encoding a String to UTF-8 in Java.
Encoding a String in Java simply means injecting certain bytes into the byte array that constitutes a String — providing additional information that can be used to format it once we form a String instance.
Using the getBytes() method
The String class, being made up of bytes, naturally offers a getBytes() method, which returns the byte array used to create the String. Since encoding is really just manipulating this byte array, we can put this array through a Charset to form it while getting the data.
By default, without providing a Charset , the bytes are encoded using the platform’s default Charset — which might not be UTF-8 or UTF-16. Let’s get the bytes of a String and print them out:
String serbianString = "Šta radiš?"; // What are you doing? byte[] bytes = serbianString.getBytes(StandardCharsets.UTF_8); for (byte b : bytes) < System.out.print(String.format("%s ", b)); >
-59 -96 116 97 32 114 97 100 105 -59 -95 63
These are the code points for our encoded characters, and they’re not really useful to human eyes. Though, again, we can leverage String’s constructor to make a human-readable String from this very sequence. Considering the fact that we’ve encoded this byte array into UTF_8 , we can go ahead and safely make a new String from this:
String utf8String = new String(bytes); System.out.println(utf8String);
Note: Instead of encoding them through the getBytes() method, you can also encode the bytes through the String constructor:
String utf8String = new String(bytes, StandardCharsets.UTF_8);
This now outputs the exact same String we started with, but encoded to UTF-8:
Free eBook: Git Essentials
Check out our hands-on, practical guide to learning Git, with best-practices, industry-accepted standards, and included cheat sheet. Stop Googling Git commands and actually learn it!
Encode a String to UTF-8 with Java 7 StandardCharsets
Since Java 7, we’ve been introduced to the StandardCharsets class, which has several Charset s available such as US_ASCII , ISO_8859_1 , UTF_8 and UTF-16 among others.
Each Charset has an encode() and decode() method, which accepts a CharBuffer (which implements CharSequence , same as a String ). In practical terms — this means we can chuck in a String into the encode() methods of a Charset .
The encode() method returns a ByteBuffer — which we can easily turn into a String again.
Earlier when we used our getBytes() method, we stored the bytes we got in an array of bytes, but when using the StandardCharsets class, things are a bit different. We first need to use a class called ByteBuffer to store our bytes. Then, we need to both encode and then decode back our newly allocated bytes. Let’s see how this works in code:
String japaneseString = "よろしくお願いします"; // Pleased to meet you. ByteBuffer byteBuffer = StandardCharsets.UTF_8.encode(japaneseString); String utf8String = new String(byteBuffer.array(), StandardCharsets.UTF_8); System.out.println(utf8String);
Running this code results in:
Encode a String to UTF-8 with Apache Commons
The Apache Commons Codec package contains simple encoders and decoders for various formats such as Base64 and Hexadecimal. In addition to these widely used encoders and decoders, the codec package also maintains a collection of phonetic encoding utilities.
For us to be able to use the Apache Commons Codec, we need to add it to our project as an external dependency.
Using Maven, let’s add the commons-codec dependency to our pom.xml file:
dependency> groupId>commons-codec groupId> artifactId>commons-codec artifactId> version>1.15 version> dependency>
Alternatively if you’re using Gradle:
compile 'commons-codec:commons-codec:1.15'
Now, we can utilize the utility classes of Apache Commons — and as usual, we’ll be leveraging the StringUtils class.
It allows us to convert Strings to and from bytes using various encodings required by the Java specification. This class is null-safe and thread-safe, so we’ve got an extra layer of protection when working with Strings.
To encode a String to UTF-8 with Apache Common’s StringUtils class, we can use the getBytesUtf8() method, which functions much like the getBytes() method with a specified Charset :
String germanString = "Wie heißen Sie?"; // What's your name? byte[] bytes = StringUtils.getBytesUtf8(germanString); String utf8String = StringUtils.newStringUtf8(bytes); System.out.println(utf8String);
Or, you can use the regular StringUtils class from the commons-lang3 dependency:
dependency> groupId>org.apache.commons groupId> artifactId>commons-lang3 artifactId> dependency>
implementation group: 'org.apache.commons', name: 'commons-lang3', version: $
And now, we can use much the same approach as with regular Strings:
String germanString = "Wie heißen Sie?"; // What's your name? byte[] bytes = StringUtils.getBytes(germanString, StandardCharsets.UTF_8); String utf8String = StringUtils.toEncodedString(bytes, StandardCharsets.UTF_8); System.out.println(utf8String);
Though, this approach is thread-safe and null-safe:
Conclusion
In this tutorial, we’ve taken a look at how to encode a Java String to UTF-8. We’ve taken a look at a few approaches — manually creating a String using getBytes() and manipulating them, the Java 7 StandardCharsets class as well as Apache Commons.
Byte Encodings and Strings
If a byte array contains non-Unicode text, you can convert the text to Unicode with one of the String constructor methods. Conversely, you can convert a String object into a byte array of non-Unicode characters with the String.getBytes method. When invoking either of these methods, you specify the encoding identifier as one of the parameters.
The example that follows converts characters between UTF-8 and Unicode. UTF-8 is a transmission format for Unicode that is safe for UNIX file systems. The full source code for the example is in the file StringConverter.java .
The StringConverter program starts by creating a String containing Unicode characters:
String original = new String("A" + "\u00ea" + "\u00f1" + "\u00fc" + "C");
When printed, the String named original appears as:
To convert the String object to UTF-8, invoke the getBytes method and specify the appropriate encoding identifier as a parameter. The getBytes method returns an array of bytes in UTF-8 format. To create a String object from an array of non-Unicode bytes, invoke the String constructor with the encoding parameter. The code that makes these calls is enclosed in a try block, in case the specified encoding is unsupported:
try < byte[] utf8Bytes = original.getBytes("UTF8"); byte[] defaultBytes = original.getBytes(); String roundTrip = new String(utf8Bytes, "UTF8"); System.out.println("roundTrip = " + roundTrip); System.out.println(); printBytes(utf8Bytes, "utf8Bytes"); System.out.println(); printBytes(defaultBytes, "defaultBytes"); >catch (UnsupportedEncodingException e)
The StringConverter program prints out the values in the utf8Bytes and defaultBytes arrays to demonstrate an important point: The length of the converted text might not be the same as the length of the source text. Some Unicode characters translate into single bytes, others into pairs or triplets of bytes.
The printBytes method displays the byte arrays by invoking the byteToHex method, which is defined in the source file, UnicodeFormatter.java . Here is the printBytes method:
public static void printBytes(byte[] array, String name) < for (int k = 0; k < array.length; k++) < System.out.println(name + "[" + k + "] = " + "0x" + UnicodeFormatter.byteToHex(array[k])); >>
The output of the printBytes method follows. Note that only the first and last bytes, the A and C characters, are the same in both arrays:
utf8Bytes[0] = 0x41 utf8Bytes[1] = 0xc3 utf8Bytes[2] = 0xaa utf8Bytes[3] = 0xc3 utf8Bytes[4] = 0xb1 utf8Bytes[5] = 0xc3 utf8Bytes[6] = 0xbc utf8Bytes[7] = 0x43 defaultBytes[0] = 0x41 defaultBytes[1] = 0xea defaultBytes[2] = 0xf1 defaultBytes[3] = 0xfc defaultBytes[4] = 0x43