- Converting Between Character Encodings with Java
- Review of Types
- char Notes
- String Notes
- Unicode Ranges
- Single-byte example
- Multi-byte example
- Java Text File Encoding
- Java Text File Encoding
- Is there way to check charset encoding of .txt file with Java?
- How to detect the character encoding of a file?
- How to determine text encoding
Converting Between Character Encodings with Java
This assumes a reasonable level of familiarity with Unicode.
The example we will mainly use here is the string of Japanese text:
…which roughly translates as “Initialize settings”. Inspired by this Stack Overflow question: How to convert hex string to Shift-JIS encoding in Java?
There are certainly libraries out there which can help with the job of translating between character encodings, but I wanted to take a closer look at how this happens using Java.
Review of Types
The following Java types are of most interest here:
Type | Signed? | Size | Range | Notes |
---|---|---|---|---|
byte | yes | 8 bits | -128 — 127 | |
char | no | 16 bits | 0 — 65,535 | The only unsigned number primitive. |
int | yes | 32 bits | -2.1bn — 2.1bn | |
String | n/a | n/a | n/a | See notes. |
char Notes
Yes, a char is stored as a 16-bit unsigned integer, representing a Unicode code point. More on that below.
This is why you can do things like this:
but not this, due to a compile time error (lossy conversion from int to char ):
String Notes
Prior to Java 9, a string was represented internally in Java as a sequence of UTF-16 code units, stored in a char[] . In Java 9 that changed to using a more compact format by default, as presented in JEP 254: Compact Strings:
Java changed its internal representation of the String class…
…from a UTF-16 char array to a byte array plus an encoding-flag field.
The new String class will store characters encoded either as ISO-8859-1/Latin-1 (one byte per character), or as UTF-16 (two bytes per character), based upon the contents of the string.
These changes were purely internal. But it’s worth noting that internally from Java 9 onwards, Java uses a byte[] to store strings. And Java has never used UTF-8 for its internal representation of strings. It used to only use UTF-16 — and it now uses ISO-8859- and UTF-16 as noted above.
Unicode Ranges
Early versions of Unicode defined 65,536 possible values from U+0000 to U+FFFF . These are often referred to as the Base Multilingual Plane (BMP). This was handled by earlier versions of Java by the char primitive. A single char represents a single BMP symbol.
Over time, Unicode has expanded significantly. It currently covers code points in the range U+0000 to U+10FFFF — which is 21 bits of data (approximately 1 million possible values). Characters outside of the BMP range are referred to as “supplementary characters”.
Java handles Unicode supplementary characters using pairs of char values, in structures such as char arrays, Strings and StringBuffers. The first value in the pair is taken from the high-surrogates range, ( \uD800-\uDBFF ), the second from the low-surrogates range ( \uDC00-\uDFFF ). But, again, as noted above the underlying storage used by Java is actually a byte array.
Single-byte example
Taking the letter A , we know that has a Unicode value of U+0041 .
Consider the Java string String str = «A»;
We can see what bytes make up that string as follows:
We provide an explicit charset instead of relying on the default charset of the JVM. We can use a string for the charset name, instead:
In which case, we also need to handle the UnsupportedEncodingException . And a list of charset names can be found in the IANA Charset Registry.
For the above example, our byte array contains the decimal integer value 65 .
We can convert that from an integer to a hex value as follows:
This gives us «41» — which matches the Unicode value of U+0041 , since the UTF-8 single-byte code point values correspond to the Unicode values (and ASCII values).
We can also convert from the hex string back to the original integer ( 65 ):
(If you were trying to convert hex values outside the int range, you would need to use the Long equivalent methods of toHexString and valueOf .)
Multi-byte example
Consider the Java string String str = «設»; — the first character in the Japanese string mentioned at the start of this article. This is Unicode character U+8A2D . It has a UTF-8 encoding of 0xE8 0xA8 0xAD — and a Shift_JIS encoding of 0x90 0xDD .
…gives us a three-byte array: [ -24, -88, -83 ] . Where did these numbers come from? Why are they negative values? How do they relate to the UTF-8 encoding of 0xE8 0xA8 0xAD ?
If we try our previous approach String hexString = Integer.toHexString(bb[0]); , we get ffffffe8 , which doesn’t look right at all.
Because Java’s byte is a signed 8-bit integer, we first have to convert it to an unsigned integer.
Java Text File Encoding
Solution 2: Apache Tika is a content analysis toolkit that is mainly useful for determining file types — as opposed to encoding schemes — but it does returns content encoding information for text file types. By the way, what kind of character had caused the error, and what kind of error had caused?
Java Text File Encoding
Yes, there’s a number of methods to do character encoding detection, specifically in Java. Take a look at jchardet which is based on the Mozilla algorithm. There’s also cpdetector and a project by IBM called ICU4j. I’d take a look at the latter, as it seems to be more reliable than the other two. They work based on statistical analysis of the binary file, ICU4j will also provide a confidence level of the character encoding it detects so you can use this in the case above. It works pretty well.
UTF-8 and UCS-2/UTF-16 can be distinguished reasonably easily via a byte order mark at the start of the file. If this exists then it’s a pretty good bet that the file is in that encoding — but it’s not a dead certainty. You may well also find that the file is in one of those encodings, but doesn’t have a byte order mark.
I don’t know much about ISO-8859-2, but I wouldn’t be surprised if almost every file is a valid text file in that encoding. The best you’ll be able to do is check it heuristically. Indeed, the Wikipedia page talking about it would suggest that only byte 0x7f is invalid.
There’s no idea of reading a file «as it is» and yet getting text out — a file is a sequence of bytes , so you have to apply a character encoding in order to decode those bytes into characters.
You can use ICU4J (http://icu-project.org/apiref/icu4j/)
String charset = "ISO-8859-1"; //Default chartset, put whatever you want byte[] fileContent = null; FileInputStream fin = null; //create FileInputStream object fin = new FileInputStream(file.getPath()); /* * Create byte array large enough to hold the content of the file. * Use File.length to determine size of the file in bytes. */ fileContent = new byte[(int) file.length()]; /* * To read content of the file in byte array, use * int read(byte[] byteArray) method of java FileInputStream class. * */ fin.read(fileContent); byte[] data = fileContent; CharsetDetector detector = new CharsetDetector(); detector.setText(data); CharsetMatch cm = detector.detect(); if (cm != null) < int confidence = cm.getConfidence(); System.out.println("Encoding: " + cm.getName() + " - Confidence: " + confidence + "%"); //Here you have the encode name and the confidence //In my case if the confidence is >50 I return the encode, else I return the default value if (confidence > 50) < charset = cm.getName(); >>
Remember to put all the try catch need it.
I hope this works for you.
Detect encoding of RTF document in Java, 4 Answers. Sorted by: 3. RTF files begin with two control sequences, the first of which specifies the RTF version (not the standard, but almost always the cs \rtf1 ), and the second of which specifies the character set, which is one of \ansi (usual), \mac, \pc, or pca (almost never encountered). Immediately after this, it is …
Is there way to check charset encoding of .txt file with Java?
You cannot know with absolute certainty which charset is used in the general case. I found this to be a good read.
Especially the section Automatic detection of encoding .
Uhm, theoretically, how would you know if it is unicode?
This is the real question. Truthfully, you cannot know, but you can make a decent guess.
See: Java : How to determine the correct charset encoding of a stream for more details. 🙂
Determining binary/text file type in Java?, There’s no guaranteed way, but here are a couple of possibilities: Look for a header on the file. Unfortunately, headers are file-specific, so while you might be able to find out that it’s a RAR file, you won’t get the more generic answer of whether it’s text or binary. Count the number of character vs. non-character …
How to detect the character encoding of a file?
ICU4J’s CharsetDetector will help you.
BufferedInputStream bis = new BufferedInputStream(new FileInputStream(path)); CharsetDetector cd = new CharsetDetector(); cd.setText(bis); String charsetName = cd.detect().getName();
By the way, what kind of character had caused the error, and what kind of error had caused? I think ICU4J would have same problem, depending on the character and the error.
Apache Tika is a content analysis toolkit that is mainly useful for determining file types — as opposed to encoding schemes — but it does returns content encoding information for text file types. I don’t know if its algorithms are as advanced as JCharDet, but it might be worth a try.
How to read a text file with mixed encodings in Scala or, to generate the string with the proper encoding (use the appropriate encoding name per field, if you know it). Edit: you will have to use java.nio.charset.Charset.CharsetDecoder if you want to detect errors. Mapping to UTF-8 this way will just give you 0xFFFF in your string when there’s an error.
How to determine text encoding
This question is a duplicate of several previous ones. There are at least two libraries for Java that attempt to guess the encoding (although keep in mind that there is no way to guess right 100% of the time).
Of course, if you know the encoding will only be one of three or four options, you might be able to write a more accurate guessing algorithm.
Short answer is: you cannot.
Even in UTF-8, the BOM is entirely optional and it’s often recommended not to use it since many apps do not handle it properly and just display it as if it was a printable char. The original purpose of Byte Order Markers was to tell out the endianness of UTF-16 files.
This said, most apps that handle Unicode implement some sort of guessing algorithm. Read the beginning of the file and look for certain signatures.
If you don’t know the encoding and don’t have any indicators (like a BOM), its not always possible to accurately «guess» the encoding. Some pointers exist that can give you hints.
For example, a ISO-8859-1 file will (usually) not have any 0x00 chars, however a UTF-16 file have loads of them.
The most common solution is to let the user select the encoding if you cannot detect it.
How can I detect the encoding/codepage of a text file, if not, then take a large enough sample of the text, take all byte-pairs of the sample and choose the encoding that is the least common suggested from the dictionary. If you’ve also sampled UTF encoded texts that do not start with any BOM, the second step will cover those that slipped from the first step.