Fun with Unicode in Java
Normally we don’t pay much attention to character encoding in Java. However, when we crisscross byte and char streams, things can get confusing unless we know the charset basics. Many tutorials and posts about character encoding are heavy in theory with little real examples. In this post, we try to demystify Unicode with easy to grasp examples.
Encode and Decode
Before diving into Unicode, first let’s understand terms — encode and decode. Suppose we capture a video in mpeg format, the encoder in the camera encodes the pixels into bytes and when played back, the decoder coverts back the bytes to pixels. Similar process plays out when we create a text file. For example, when letter H is typed in a text editor, the OS encodes the keystroke as byte 0x48 and pass it to editor. The editor holds the bytes in its buffer and pass it on to windowing system which decodes and displays the byte 0x48 as H. When file is saved, 0x48 gets into the file.
In short, encoder converts items such as pixels, audio stream or characters as binary bytes and decoder reconverts the bytes back original form.
Encode and Decode Java String
Let’s go ahead and encode some strings in Java.
The String.getBytes() method encodes the string as bytes (binary) using US_ASCII charset and printBytes() method outputs bytes in hex format. The hex output 0x48 0x65 0x6c 0x6c 0x6f is binary of form string Hello in ASCII.
Next, let’s see how to decode the bytes back as string.
Here we decode byte array filled with 0x48 0x65 0x6c 0x6c 0x6f as a new string. The String class decodes the bytes with US_ASCII charset which is displayed as Hello .
We can omit StandardCharsets.US_ASCII argument in new String(bytes) and str.getBytes() . The results will be same as default charset of Java is UTF-8 which use same hex value for English alphabets as US_ASCII.
The ASCII encoding scheme is quite simple where each character is mapped to a single byte, for example, H is encoded as 0x48, e as 0x65 and so on. It can handle English character set, numbers and control characters such as backspace, carriage return etc., but not many western or asian language characters etc.
Say Hello in Mandarin
Hello in Mandarin is nĭ hăo. It is written using two characters 你 (nĭ) and 好 (hăo). Let’s encode and decode single character 你 (nĭ).
Encoding the character 你 with UTF-8 character set returns an array of 3 bytes xe4 xbd xa0 , which on decode, returns 你.
Let’s do the same with another standard character set UTF_16.
Character set UTF_16 encodes 你 into 4 bytes — xfe xff x4f x60 while UTF_8 manages it with 3 bytes.
Just for heck of it, try to encode 你 with US_ASCII and it returns single byte x3f which decodes to ? character. This is because ASCII is single byte encoding scheme which can’t handle characters other than English alphabets.
Introducing Unicode
Unicode is coded character set (or simply character set) capable of representing most of the writing systems. The recent version of Unicode contains around 138,000 characters covering 150 modern and historic languages and scripts, as well as symbol sets and emoji. The below table shows how some characters from different languages are represent in Unicode.
Character | Code Point | UTF_8 | UTF_16 | Language |
---|---|---|---|---|
a | U+0061 | 61 | 00 61 | English |
Z | U+005A | 5a | 00 5a | English |
â | U+00E2 | c3 a2 | 00 e2 | Latin |
Δ | U+0394 | ce 94 | 03 94 | Latin |
ع | U+0639 | d8 b9 | 06 39 | Arabic |
你 | U+4F60 | e4 bd a0 | 4f 60 | Chinese |
好 | U+597D | e5 a5 bd | 59 7d | Chinese |
ಡ | U+0CA1 | e0 b2 a1 | 0c a1 | Kannada |
ತ | U+0CA4 | e0 b2 a4 | 0c a4 | Kannada |
Each character or symbol is represented by an unique Code point. Unicode has 1,112,064 code points out of which around 138,000 are presently defined. Unicode code point is represented as U+xxxx where U signifies it as Unicode. The String.codePointAt(int index) method returns code point for character.
A charset can have one or more encoding schemes and Unicode has multiple encoding schemes such as UTF_8, UTF_16, UTF_16LE and UTF_16BE that maps code point to bytes.
UTF-8
UTF-8 (8-bit Unicode Transformation Format) is a variable width character encoding capable of encoding all valid Unicode code points using one to four 8-bit bytes. In the above table, we can see that the length of encoded bytes varies from 1 to 3 bytes for UTF-8. Majority of web pages use UTF-8.
The first 128 characters of Unicode, which correspond one-to-one with ASCII, are encoded using a single byte with the same binary value as ASCII. The valid ASCII text is valid UTF-8-encoded Unicode as well.
UTF-16
UTF-16 (16-bit Unicode Transformation Format) is another encoding scheme capable of handling all characters of Unicode character set. The encoding is variable-length, as code points are encoded with one or two 16-bit code units (i.e minimum 2 bytes and maximum 4 bytes).
Many systems such as Windows, Java and JavaScript, internally, uses UTF-16. It is also often used for plain text and for word-processing data files on Windows, but rarely used for files on Unix/Linux or macOS.
Java internally uses UTF-16. From Java 9 onwards, to reduce the memory taken by String objects, it uses either ISO-8859-1/Latin-1 (one byte per character) or UTF-16 (two bytes per character) based upon the contents of the string. JEPS 254.
However don’t confuse the internal charset with Java default charset which is UTF-8. For example, the Strings live in heap memory as UTF-16, however the method String.getBytes() returns bytes encoded as UTF-8, the default charset.
You can use CharInfo.java to display character details of a string.
- Character set is collection of characters. Numbers, alphabets and Chinese characters are examples of character sets.
- Coded character set is a character set in which each character has an assigned int value. Unicode, US-ASCII and ISO-8859-1 are examples of coded character set.
- Code Point is an integer assigned to a character in a coded character set.
- Character encoding maps between code points of a coded character set and sequences of bytes. One coded character set may have one or more character encodings . For example, ASCII has one encoding scheme while Unicode has multiple encoding schemes — UTF-8, UTF-16, UTF_16BE, UTF_16LE etc.
Java IO
Use char stream IO classes Reader and Writer while dealing with text and text files. As already explained, the default charset of Java platform is UTF-8 and text written using Writer class is encoded in UTF-8 and Reader class reads the text in UTF-8.
Using java.io package, we can write and read a text file in default charset as below.
The above example, uses char stream classes — Writer and Reader — directly that uses default character set (UTF-8).
To encode/decode in non-default charset use byte oriented classes and use a bridge class to convert it char oriented class. For example, to read file as raw bytes use FileInputStream and wrap it with InputStreamReader , a bridge that can encode the bytes to chars in specified charset. Similarly for output, use OutputStreamWriter (bridge) and FileOutputWriter (byte output)
Following example, writes a file in UTF_16BE charset and reads it back.
Transcoding is the direct digital-to-digital conversion from an encoding to another, such as UTF-8 to UTF-16. We regularly encounter transcoding in video, audio and image files but rarely with text files.
Imagine, we receive a stream of bytes over the network encoded in CP-1252 (Windows-1252) or ISO 8859-1 and want to save it to text file in UTF 8.
There are couple of options to transcode from one charset to another. The easiest way to transcode it to use String class.
While this quite fast, it suffers when we deal with large set of byte as heap memory gets allocated to multiple large strings. Better option is to use java.io classes as shown below:
See Transcode.java for transcoding example and Char Server for a rough take on encoding between server and socket.
Play with Unicode in Linux terminal
We can work with text encoding in Linux terminal with some simple commands. Note that Linux terminal can display ASCII and UTF-8 files but not UTF-16.
delta-8.txt delta-16.txt nihou-8.txt nihou-16.txt nihou-16le.txt
Further Reading
Some good posts about Unicode usage in Java.
Java — Converting from unicode to a string?
However, now I want to retrieve the number above, add 3, and print out the new unicode character that the numbers 0003 corresponds to. Is there a way for me to retrieve the ACTUAL string of unichar? As in «\u0000»? That way I could substring just the «0000», convert it to an int, add 3, and reverse the entire process.
3 Answers 3
I think you’re looking for String#codePointAt :
Returns the character (Unicode code point) at the specified index. The index refers to char values (Unicode code units) and ranges from 0 to length()- 1.
If the char value specified at the given index is in the high-surrogate range, the following index is less than the length of this String, and the char value at the following index is in the low-surrogate range, then the supplementary code point corresponding to this surrogate pair is returned. Otherwise, the char value at the given index is returned.
// String containing smiling face with smiling eyes emoji String str = "😊"; // Get the code point int cp = str.codePointAt(0); // Show it System.out.println(str + ", code point = U+" + toHex(cp)); // Increase it ++cp; // Get the updated string (from an array of code points) String updated = new String(new int[] < cp >, 0, 1); // Show it System.out.println(updated + ", code point = U+" + toHex(cp));
( toHex is just return Integer.toString(n, 16).toUpperCase(); )
😊, code point = U+1F60A
😋, code point = U+1F60B
Convert UTF-8 encoded string to human readable string
How to convert any UTF8 strings to readable strings. Like : ⬠(in UTF8) is € I tried using Charset but not working.
I just want to convert unreadable strings which are in UTF8 format to reable string (ASCII or other readable charset).
You cannot convert «â¬» to «€». You can convert «âBPH¬» to «€» though. but you don’t need to as long as you don’t do encoding screwups like this in the first place.
5 Answers 5
You are encoding a string to ISO-8859-15 with byte[] b = «Üü?öäABC».getBytes(«ISO-8859-15»); then you are decoding it with UTF-8 System.out.println(new String(b, «UTF-8»)); . You have to decode it the same way with ISO-8859-15.
Well, yes, the line that works is doing it right System.out.println(new String(b, «ISO-8859-15»)); . It decodes an ISO-8859-15 encoded string with ISO-8859-15 decoder. The other line decodes an ISO-8859-15 encoded string with UTF-8. Of course it won’t work.
It is completely pointless to encode something as x and then decode it right after as x. It will not do anything, in the best case it will lose even more information.
I was just explaining the code in the question but the question was modified. Your comment and the down vote do not make any sense.
What he was doing in the old code was completely legit way to attempt to repair data. What you are suggesting in the answer is NO-OP that never makes sense. Just so I didn’t misunderstand — you are suggesting to encode string as ISO-8859-15 and then decoding the resulting bytes as ISO-8859-15. It doesn’t take much thinking to see that this will not do anything.
This is not «UTF-8» but completely broken and unrepairable data. Strings do not have encodings. It makes no sense to say «UTF-8» string in this context. String is a string of abstract characters — it doesn’t have any encodings except as an internal implementation detail that is not our concern and not related to your problem.
This is not true. Strings always have an encoding. Even in memory, the logical characters have to be physically encoded. Java strings use UTF-16 in memory. If you have a String that contains UTF-16 encoded UTF-8 octets, then you could copy the character values as-is to a Byte array and then decode them back to a normal UTF-16 encoded string using the String constructor that takes a byte array and encoding as input.
@RemyLebeau I guess you read only the first sentence of my answer. The internal encoding is not ever relevant except when dealing with astral planes — in that case the choice of UTF-16 leaks to users. The data type to store binary data (encoded text for example) is byte[] , not String.
A string in java is already an unicode representation. When you call one of the getBytes methods on it you get an encoded representation (as bytes, thus binary values) in a specific encoding — ISO-8859-15 in your example. If you want to convert this byte array back to an unicode string you can do that with one of the string constructors accepting a byte array, like you did, but you must do so using the exact same encoding the byte array was originally generated with. Only then you can convert it back to an unicode string (which has no encoding, and doesn’t need one).
Beware of the encoding-less methods, both the string constructor and the getBytes method, since they use the default encoding of the platform the code is running on, which might not be what you want to achieve.