Fun with Unicode in Java
Normally we don’t pay much attention to character encoding in Java. However, when we crisscross byte and char streams, things can get confusing unless we know the charset basics. Many tutorials and posts about character encoding are heavy in theory with little real examples. In this post, we try to demystify Unicode with easy to grasp examples.
Encode and Decode
Before diving into Unicode, first let’s understand terms — encode and decode. Suppose we capture a video in mpeg format, the encoder in the camera encodes the pixels into bytes and when played back, the decoder coverts back the bytes to pixels. Similar process plays out when we create a text file. For example, when letter H is typed in a text editor, the OS encodes the keystroke as byte 0x48 and pass it to editor. The editor holds the bytes in its buffer and pass it on to windowing system which decodes and displays the byte 0x48 as H. When file is saved, 0x48 gets into the file.
In short, encoder converts items such as pixels, audio stream or characters as binary bytes and decoder reconverts the bytes back original form.
Encode and Decode Java String
Let’s go ahead and encode some strings in Java.
The String.getBytes() method encodes the string as bytes (binary) using US_ASCII charset and printBytes() method outputs bytes in hex format. The hex output 0x48 0x65 0x6c 0x6c 0x6f is binary of form string Hello in ASCII.
Next, let’s see how to decode the bytes back as string.
Here we decode byte array filled with 0x48 0x65 0x6c 0x6c 0x6f as a new string. The String class decodes the bytes with US_ASCII charset which is displayed as Hello .
We can omit StandardCharsets.US_ASCII argument in new String(bytes) and str.getBytes() . The results will be same as default charset of Java is UTF-8 which use same hex value for English alphabets as US_ASCII.
The ASCII encoding scheme is quite simple where each character is mapped to a single byte, for example, H is encoded as 0x48, e as 0x65 and so on. It can handle English character set, numbers and control characters such as backspace, carriage return etc., but not many western or asian language characters etc.
Say Hello in Mandarin
Hello in Mandarin is nĭ hăo. It is written using two characters 你 (nĭ) and 好 (hăo). Let’s encode and decode single character 你 (nĭ).
Encoding the character 你 with UTF-8 character set returns an array of 3 bytes xe4 xbd xa0 , which on decode, returns 你.
Let’s do the same with another standard character set UTF_16.
Character set UTF_16 encodes 你 into 4 bytes — xfe xff x4f x60 while UTF_8 manages it with 3 bytes.
Just for heck of it, try to encode 你 with US_ASCII and it returns single byte x3f which decodes to ? character. This is because ASCII is single byte encoding scheme which can’t handle characters other than English alphabets.
Introducing Unicode
Unicode is coded character set (or simply character set) capable of representing most of the writing systems. The recent version of Unicode contains around 138,000 characters covering 150 modern and historic languages and scripts, as well as symbol sets and emoji. The below table shows how some characters from different languages are represent in Unicode.
Character | Code Point | UTF_8 | UTF_16 | Language |
---|---|---|---|---|
a | U+0061 | 61 | 00 61 | English |
Z | U+005A | 5a | 00 5a | English |
â | U+00E2 | c3 a2 | 00 e2 | Latin |
Δ | U+0394 | ce 94 | 03 94 | Latin |
ع | U+0639 | d8 b9 | 06 39 | Arabic |
你 | U+4F60 | e4 bd a0 | 4f 60 | Chinese |
好 | U+597D | e5 a5 bd | 59 7d | Chinese |
ಡ | U+0CA1 | e0 b2 a1 | 0c a1 | Kannada |
ತ | U+0CA4 | e0 b2 a4 | 0c a4 | Kannada |
Each character or symbol is represented by an unique Code point. Unicode has 1,112,064 code points out of which around 138,000 are presently defined. Unicode code point is represented as U+xxxx where U signifies it as Unicode. The String.codePointAt(int index) method returns code point for character.
A charset can have one or more encoding schemes and Unicode has multiple encoding schemes such as UTF_8, UTF_16, UTF_16LE and UTF_16BE that maps code point to bytes.
UTF-8
UTF-8 (8-bit Unicode Transformation Format) is a variable width character encoding capable of encoding all valid Unicode code points using one to four 8-bit bytes. In the above table, we can see that the length of encoded bytes varies from 1 to 3 bytes for UTF-8. Majority of web pages use UTF-8.
The first 128 characters of Unicode, which correspond one-to-one with ASCII, are encoded using a single byte with the same binary value as ASCII. The valid ASCII text is valid UTF-8-encoded Unicode as well.
UTF-16
UTF-16 (16-bit Unicode Transformation Format) is another encoding scheme capable of handling all characters of Unicode character set. The encoding is variable-length, as code points are encoded with one or two 16-bit code units (i.e minimum 2 bytes and maximum 4 bytes).
Many systems such as Windows, Java and JavaScript, internally, uses UTF-16. It is also often used for plain text and for word-processing data files on Windows, but rarely used for files on Unix/Linux or macOS.
Java internally uses UTF-16. From Java 9 onwards, to reduce the memory taken by String objects, it uses either ISO-8859-1/Latin-1 (one byte per character) or UTF-16 (two bytes per character) based upon the contents of the string. JEPS 254.
However don’t confuse the internal charset with Java default charset which is UTF-8. For example, the Strings live in heap memory as UTF-16, however the method String.getBytes() returns bytes encoded as UTF-8, the default charset.
You can use CharInfo.java to display character details of a string.
- Character set is collection of characters. Numbers, alphabets and Chinese characters are examples of character sets.
- Coded character set is a character set in which each character has an assigned int value. Unicode, US-ASCII and ISO-8859-1 are examples of coded character set.
- Code Point is an integer assigned to a character in a coded character set.
- Character encoding maps between code points of a coded character set and sequences of bytes. One coded character set may have one or more character encodings . For example, ASCII has one encoding scheme while Unicode has multiple encoding schemes — UTF-8, UTF-16, UTF_16BE, UTF_16LE etc.
Java IO
Use char stream IO classes Reader and Writer while dealing with text and text files. As already explained, the default charset of Java platform is UTF-8 and text written using Writer class is encoded in UTF-8 and Reader class reads the text in UTF-8.
Using java.io package, we can write and read a text file in default charset as below.
The above example, uses char stream classes — Writer and Reader — directly that uses default character set (UTF-8).
To encode/decode in non-default charset use byte oriented classes and use a bridge class to convert it char oriented class. For example, to read file as raw bytes use FileInputStream and wrap it with InputStreamReader , a bridge that can encode the bytes to chars in specified charset. Similarly for output, use OutputStreamWriter (bridge) and FileOutputWriter (byte output)
Following example, writes a file in UTF_16BE charset and reads it back.
Transcoding is the direct digital-to-digital conversion from an encoding to another, such as UTF-8 to UTF-16. We regularly encounter transcoding in video, audio and image files but rarely with text files.
Imagine, we receive a stream of bytes over the network encoded in CP-1252 (Windows-1252) or ISO 8859-1 and want to save it to text file in UTF 8.
There are couple of options to transcode from one charset to another. The easiest way to transcode it to use String class.
While this quite fast, it suffers when we deal with large set of byte as heap memory gets allocated to multiple large strings. Better option is to use java.io classes as shown below:
See Transcode.java for transcoding example and Char Server for a rough take on encoding between server and socket.
Play with Unicode in Linux terminal
We can work with text encoding in Linux terminal with some simple commands. Note that Linux terminal can display ASCII and UTF-8 files but not UTF-16.
delta-8.txt delta-16.txt nihou-8.txt nihou-16.txt nihou-16le.txt
Further Reading
Some good posts about Unicode usage in Java.
Convert Unicode to UTF-8 in Java
Before moving onto their conversions, let us learn about Unicode and UTF-8.
Unicode is an international standard of character encoding which has the capability of representing a majority of written languages all over the globe. Unicode uses hexadecimal to represent a character. Unicode is a 16-bit character encoding system. The lowest value is \u0000 and the highest value is \uFFFF.
UTF-8 is a variable width character encoding. UTF-8 has the ability to be as condensed as ASCII but can also contain any Unicode characters with some increase in the size of the file. UTF stands for Unicode Transformation Format. The ‘8’ signifies that it allocates 8-bit blocks to denote a character. The number of blocks needed to represent a character varies from 1 to 4.
In order to convert Unicode to UTF-8 in Java, we use the getBytes() method. The getBytes() method encodes a String into a sequence of bytes and returns a byte array.
Declaration — The getBytes() method is declared as follows.
public byte[] getBytes(String charsetName)
where charsetName is the specific charset by which the String is encoded into an array of bytes.
Let us see a program to convert Unicode to UTF-8 in Java using the getBytes() method.
Example
public class Example < public static void main(String[] args) throws Exception < String str1 = "\u0000"; String str2 = "\uFFFF"; byte[] arr = str1.getBytes("UTF-8"); byte[] brr = str2.getBytes("UTF-8"); System.out.println("UTF-8 for \u0000"); for(byte a: arr) < System.out.print(a); >System.out.println("
UTF-8 for \uffff" ); for(byte b: brr) < System.out.print(b); >> >
Output
UTF-8 for \u0000 0 UTF-8 for \uffff -17-65-65
Let us understand the above program. We have created two Strings.
String str1 = "\u0000"; String str2 = "\uFFFF";
String str1 is assigned \u0000 which is the lowest value in Unicode. String str2 is assigned \uFFFF which is the highest value in Unicode.
To convert them into UTF-8, we use the getBytes(“UTF-8”) method. This gives us an array of bytes as follows −
byte[] arr = str1.getBytes("UTF-8"); byte[] brr = str2.getBytes("UTF-8");
Then to print the byte array, we use an enhanced for loop as follows −
for(byte a: arr) < System.out.print(a); >for(byte b: brr)
Byte Encodings and Strings
If a byte array contains non-Unicode text, you can convert the text to Unicode with one of the String constructor methods. Conversely, you can convert a String object into a byte array of non-Unicode characters with the String.getBytes method. When invoking either of these methods, you specify the encoding identifier as one of the parameters.
The example that follows converts characters between UTF-8 and Unicode. UTF-8 is a transmission format for Unicode that is safe for UNIX file systems. The full source code for the example is in the file StringConverter.java .
The StringConverter program starts by creating a String containing Unicode characters:
String original = new String("A" + "\u00ea" + "\u00f1" + "\u00fc" + "C");
When printed, the String named original appears as:
To convert the String object to UTF-8, invoke the getBytes method and specify the appropriate encoding identifier as a parameter. The getBytes method returns an array of bytes in UTF-8 format. To create a String object from an array of non-Unicode bytes, invoke the String constructor with the encoding parameter. The code that makes these calls is enclosed in a try block, in case the specified encoding is unsupported:
try < byte[] utf8Bytes = original.getBytes("UTF8"); byte[] defaultBytes = original.getBytes(); String roundTrip = new String(utf8Bytes, "UTF8"); System.out.println("roundTrip = " + roundTrip); System.out.println(); printBytes(utf8Bytes, "utf8Bytes"); System.out.println(); printBytes(defaultBytes, "defaultBytes"); >catch (UnsupportedEncodingException e)
The StringConverter program prints out the values in the utf8Bytes and defaultBytes arrays to demonstrate an important point: The length of the converted text might not be the same as the length of the source text. Some Unicode characters translate into single bytes, others into pairs or triplets of bytes.
The printBytes method displays the byte arrays by invoking the byteToHex method, which is defined in the source file, UnicodeFormatter.java . Here is the printBytes method:
public static void printBytes(byte[] array, String name) < for (int k = 0; k < array.length; k++) < System.out.println(name + "[" + k + "] = " + "0x" + UnicodeFormatter.byteToHex(array[k])); >>
The output of the printBytes method follows. Note that only the first and last bytes, the A and C characters, are the same in both arrays:
utf8Bytes[0] = 0x41 utf8Bytes[1] = 0xc3 utf8Bytes[2] = 0xaa utf8Bytes[3] = 0xc3 utf8Bytes[4] = 0xb1 utf8Bytes[5] = 0xc3 utf8Bytes[6] = 0xbc utf8Bytes[7] = 0x43 defaultBytes[0] = 0x41 defaultBytes[1] = 0xea defaultBytes[2] = 0xf1 defaultBytes[3] = 0xfc defaultBytes[4] = 0x43