Encode String to UTF-8
I have a String with a «ñ» character and I have some problems with it. I need to encode this String to UTF-8 encoding. I have tried it by this way, but it doesn’t work:
byte ptext[] = myString.getBytes(); String value = new String(ptext, "UTF-8");
It’s unclear what exactly you’re trying to do. Does myString correctly contain the ñ character and you have problems converting it to a byte array (in that case see answers from Peter and Amir), or is myString corrupted and you’re trying to fix it (in that case, see answers from Joachim and me)?
I need to send myString to a server with utf-8 encoding and I need to convert the «ñ» character to utf-8 encoding.
Well, if that server expects UTF-8 then what you need to send it are bytes, not a String. So as per Peter’s answer, specify the encoding in the first line and drop the second line.
@Michael: I agree that it isn’t clear what the real intent is here. There seem to be a lot of questions where people are trying to explicit conversions between Strings and bytes rather than letting the
@Michael: Thanks, I suppose that makes sense. But it also makes it harder than it needs to be, doesn’t it? I am not very fond of languages that work that way, and so try to avoid working with them. I think Java’s model of Strings of characters instead of bytes makes things a whole lot easier. Perl and Python also share the “everything is Unicode strings” model. Yes, in all three you can still get at bytes if you work at it, but in practice it seems rare that you truly need to: that’s quite low-level. Plus it feels kinda like brushing a cat the wrong direction, if you know what I mean. 🙂
11 Answers 11
ByteBuffer byteBuffer = StandardCharsets.UTF_8.encode(myString)
@Alex: it’s not possible to have an UTF-8 encoded Java String. You want bytes, so either use the ByteBuffer directly (could even be the best solution if your goal is to send it via a network collection) or call array() on it to get a byte[]
Something else that may be helpful is to use Guava’s Charsets.UTF_8 enum instead of a String that may throw an UnsupportedEncodingException. String -> bytes: myString.getBytes(Charsets.UTF_8) , and bytes -> String: new String(myByteArray, Charsets.UTF_8) .
The array return by array() will most likely be bigger than needed and padded, as it is the ByteBuffer s internal array. Better to use string.getBytes(StandardCharsets.UTF_8) which will return a new array with the correct size.
String objects in Java use the UTF-16 encoding that can’t be modified * .
The only thing that can have a different encoding is a byte[] . So if you need UTF-8 data, then you need a byte[] . If you have a String that contains unexpected data, then the problem is at some earlier place that incorrectly converted some binary data to a String (i.e. it was using the wrong encoding).
* As a matter of implementation, String can internally use a ISO-8859-1 encoded byte[] when the range of characters fits it, but that is an implementation-specific optimization that isn’t visible to users of String (i.e. you’ll never notice unless you dig into the source code or use reflection to dig into a String object).
Technically speaking, byte[] doesn’t have any encoding. Byte array PLUS encoding can give you string though.
@Peter: true. But attaching an encoding to it only makes sense for byte[] , it doesn’t make sense for String (unless the encoding is UTF-16, in which case it makes sense but it still unnecessary information).
String objects in Java use the UTF-16 encoding that can’t be modified. Do you have an official source for this quote?
@AhmadHajjar docs.oracle.com/javase/10/docs/api/java/lang/… : «The Java platform uses the UTF-16 representation in char arrays and in the String and StringBuffer classes.»
Thanks to you and rzymek for your helpful answers! You both saved my time! You theoretic part and rzymek by practical part.
import static java.nio.charset.StandardCharsets.*; byte[] ptext = myString.getBytes(ISO_8859_1); String value = new String(ptext, UTF_8);
This has the advantage over getBytes(String) that it does not declare throws UnsupportedEncodingException .
If you’re using an older Java version you can declare the charset constants yourself:
import java.nio.charset.Charset; public class StandardCharsets < public static final Charset ISO_8859_1 = Charset.forName("ISO-8859-1"); public static final Charset UTF_8 = Charset.forName("UTF-8"); //. >
This is the right answer. If someone wants to use a string datatype, he can use it in the right format. Rest of the answers are pointing to the byte formatted type.
Correct answer for me too. One thing though, when I used as above, German character changed to ?. So, I used this: byte[] ptext = myString.getBytes(UTF_8); String value = new String(ptext, UTF_8); This worked fine.
The code sample doesn’t make sense. If you first convert to ISO-8859-1, then that array of byte is not UTF-8, so the next line is totally incorrect. It will work for ASCII strings, of course, but then you could as well make a simple copy: String value = new String(myString); .
Use byte[] ptext = String.getBytes(«UTF-8»); instead of getBytes() . getBytes() uses so-called «default encoding», which may not be UTF-8.
@Michael: he is clearly having trouble getting bytes from string. How is getBytes(encoding) missing the point? I think second line is there just to check if he can convert it back.
I interpret it as having a broken String and trying to «fix» it by converting to bytes and back (common misunderstanding). There’s no actual indication that the second line is just checking the result.
@Peter: you’re right, we’d need clarification from Alex what he really means. Can’t rescind the downvote though unless the answer is edited.
A Java String is internally always encoded in UTF-16 — but you really should think about it like this: an encoding is a way to translate between Strings and bytes.
So if you have an encoding problem, by the time you have String, it’s too late to fix. You need to fix the place where you create that String from a file, DB or network connection.
It’s a common mistake to believe that strings are internally encoded as UTF-16. Usually they are, but if, it is only an implementation specific detail of the String class. Since the internal storage of the character data is not accessible through the public API, a specific String implementation may decide to use any other encoding.
@jarnbjo: The API explicitly states «A String represents a string in the UTF-16 format». Using anything else as internal format would be highly inefficient, and all actual implementations I know do use UTF-16 internally. So unless you can cite one that doesn’t, you’re engaging in pretty absurd hairsplitting.
The JVM (as far as it is relevant to the VM at all) uses UTF-8 for string encoding, e.g. in the class files. The implementation of java.lang.String is decoupled from the JVM and I could easily implement the class for you using any other encoding for the internal representation if that is really necessary for you to realize that your answer is incorrect. Using UTF-16 as the internal format is in most cases highly inefficient as well when it comes to memory consumption and I don’t see why e.g. Java implementations for embedded hardware wouldn’t optimize for memory instead of performance.
@jarnbjo: And once more: as long as you cannot give a concrete example of a JVM whose standard API implementation does internally use something other than UTF-16 to implement Strings, my statement is correct. And no, the String class is not really decoupled from the JVM, due to things like intern() and the constant pool.
Byte Encodings and Strings
If a byte array contains non-Unicode text, you can convert the text to Unicode with one of the String constructor methods. Conversely, you can convert a String object into a byte array of non-Unicode characters with the String.getBytes method. When invoking either of these methods, you specify the encoding identifier as one of the parameters.
The example that follows converts characters between UTF-8 and Unicode. UTF-8 is a transmission format for Unicode that is safe for UNIX file systems. The full source code for the example is in the file StringConverter.java .
The StringConverter program starts by creating a String containing Unicode characters:
String original = new String("A" + "\u00ea" + "\u00f1" + "\u00fc" + "C");
When printed, the String named original appears as:
To convert the String object to UTF-8, invoke the getBytes method and specify the appropriate encoding identifier as a parameter. The getBytes method returns an array of bytes in UTF-8 format. To create a String object from an array of non-Unicode bytes, invoke the String constructor with the encoding parameter. The code that makes these calls is enclosed in a try block, in case the specified encoding is unsupported:
try < byte[] utf8Bytes = original.getBytes("UTF8"); byte[] defaultBytes = original.getBytes(); String roundTrip = new String(utf8Bytes, "UTF8"); System.out.println("roundTrip = " + roundTrip); System.out.println(); printBytes(utf8Bytes, "utf8Bytes"); System.out.println(); printBytes(defaultBytes, "defaultBytes"); >catch (UnsupportedEncodingException e)
The StringConverter program prints out the values in the utf8Bytes and defaultBytes arrays to demonstrate an important point: The length of the converted text might not be the same as the length of the source text. Some Unicode characters translate into single bytes, others into pairs or triplets of bytes.
The printBytes method displays the byte arrays by invoking the byteToHex method, which is defined in the source file, UnicodeFormatter.java . Here is the printBytes method:
public static void printBytes(byte[] array, String name) < for (int k = 0; k < array.length; k++) < System.out.println(name + "[" + k + "] = " + "0x" + UnicodeFormatter.byteToHex(array[k])); >>
The output of the printBytes method follows. Note that only the first and last bytes, the A and C characters, are the same in both arrays:
utf8Bytes[0] = 0x41 utf8Bytes[1] = 0xc3 utf8Bytes[2] = 0xaa utf8Bytes[3] = 0xc3 utf8Bytes[4] = 0xb1 utf8Bytes[5] = 0xc3 utf8Bytes[6] = 0xbc utf8Bytes[7] = 0x43 defaultBytes[0] = 0x41 defaultBytes[1] = 0xea defaultBytes[2] = 0xf1 defaultBytes[3] = 0xfc defaultBytes[4] = 0x43
Where to get «UTF-8» string literal in Java?
«UTF-8» appears in the code rather often, and would be much better to refer to some static final variable instead. Do you know where I can find such a variable in JDK? BTW, on a second thought, such constants are bad design: Public Static Literals . Are Not a Solution for Data Duplication
That’s some really bad advice from your link. He wants you to make a wrapper class for every possible string constant you might use?
11 Answers 11
In Java 1.7+, java.nio.charset.StandardCharsets defines constants for Charset including UTF_8 .
import java.nio.charset.StandardCharsets; . StandardCharsets.UTF_8.name();
For Android: minSdk 19
You don’t really need to call name() at all. You can directly pass the Charset object into the InputStreamReader constructor.
And there are other libs out there which do require a String , perhaps because of legacy reasons. In such cases, I keep a Charset object around, typically derived from StandardCharsets , and use name() if needed.
Now I use org.apache.commons.lang3.CharEncoding.UTF_8 constant from commons-lang.
The Google Guava library (which I’d highly recommend anyway, if you’re doing work in Java) has a Charsets class with static fields like Charsets.UTF_8 , Charsets.UTF_16 , etc.
Since Java 7 you should just use java.nio.charset.StandardCharsets instead for comparable constants.
Note that these constants aren’t strings, they’re actual Charset instances. All standard APIs that take a charset name also have an overload that take a Charset object which you should use instead.