Java stringbuffer to utf 8

Converting Between Character Encodings with Java

This assumes a reasonable level of familiarity with Unicode.

The example we will mainly use here is the string of Japanese text:

…which roughly translates as “Initialize settings”. Inspired by this Stack Overflow question: How to convert hex string to Shift-JIS encoding in Java?

There are certainly libraries out there which can help with the job of translating between character encodings, but I wanted to take a closer look at how this happens using Java.

Review of Types

The following Java types are of most interest here:

Type Signed? Size Range Notes
byte yes 8 bits -128 — 127
char no 16 bits 0 — 65,535 The only unsigned number primitive.
int yes 32 bits -2.1bn — 2.1bn
String n/a n/a n/a See notes.

char Notes

Yes, a char is stored as a 16-bit unsigned integer, representing a Unicode code point. More on that below.

This is why you can do things like this:

but not this, due to a compile time error (lossy conversion from int to char ):

String Notes

Prior to Java 9, a string was represented internally in Java as a sequence of UTF-16 code units, stored in a char[] . In Java 9 that changed to using a more compact format by default, as presented in JEP 254: Compact Strings:

Java changed its internal representation of the String class…

…from a UTF-16 char array to a byte array plus an encoding-flag field.

The new String class will store characters encoded either as ISO-8859-1/Latin-1 (one byte per character), or as UTF-16 (two bytes per character), based upon the contents of the string.

These changes were purely internal. But it’s worth noting that internally from Java 9 onwards, Java uses a byte[] to store strings. And Java has never used UTF-8 for its internal representation of strings. It used to only use UTF-16 — and it now uses ISO-8859- and UTF-16 as noted above.

Unicode Ranges

Early versions of Unicode defined 65,536 possible values from U+0000 to U+FFFF . These are often referred to as the Base Multilingual Plane (BMP). This was handled by earlier versions of Java by the char primitive. A single char represents a single BMP symbol.

Over time, Unicode has expanded significantly. It currently covers code points in the range U+0000 to U+10FFFF — which is 21 bits of data (approximately 1 million possible values). Characters outside of the BMP range are referred to as “supplementary characters”.

Java handles Unicode supplementary characters using pairs of char values, in structures such as char arrays, Strings and StringBuffers. The first value in the pair is taken from the high-surrogates range, ( \uD800-\uDBFF ), the second from the low-surrogates range ( \uDC00-\uDFFF ). But, again, as noted above the underlying storage used by Java is actually a byte array.

Single-byte example

Taking the letter A , we know that has a Unicode value of U+0041 .

Consider the Java string String str = «A»;

We can see what bytes make up that string as follows:

We provide an explicit charset instead of relying on the default charset of the JVM. We can use a string for the charset name, instead:

In which case, we also need to handle the UnsupportedEncodingException . And a list of charset names can be found in the IANA Charset Registry.

For the above example, our byte array contains the decimal integer value 65 .

We can convert that from an integer to a hex value as follows:

This gives us «41» — which matches the Unicode value of U+0041 , since the UTF-8 single-byte code point values correspond to the Unicode values (and ASCII values).

We can also convert from the hex string back to the original integer ( 65 ):

(If you were trying to convert hex values outside the int range, you would need to use the Long equivalent methods of toHexString and valueOf .)

Multi-byte example

Consider the Java string String str = «設»; — the first character in the Japanese string mentioned at the start of this article. This is Unicode character U+8A2D . It has a UTF-8 encoding of 0xE8 0xA8 0xAD — and a Shift_JIS encoding of 0x90 0xDD .

…gives us a three-byte array: [ -24, -88, -83 ] . Where did these numbers come from? Why are they negative values? How do they relate to the UTF-8 encoding of 0xE8 0xA8 0xAD ?

If we try our previous approach String hexString = Integer.toHexString(bb[0]); , we get ffffffe8 , which doesn’t look right at all.

Because Java’s byte is a signed 8-bit integer, we first have to convert it to an unsigned integer.

Источник

Encode a String to UTF-8 in Java

announcement - icon

The Kubernetes ecosystem is huge and quite complex, so it’s easy to forget about costs when trying out all of the exciting tools.

To avoid overspending on your Kubernetes cluster, definitely have a look at the free K8s cost monitoring tool from the automation platform CAST AI. You can view your costs in real time, allocate them, calculate burn rates for projects, spot anomalies or spikes, and get insightful reports you can share with your team.

Connect your cluster and start monitoring your K8s costs right away:

We rely on other people’s code in our own work. Every day.

It might be the language you’re writing in, the framework you’re building on, or some esoteric piece of software that does one thing so well you never found the need to implement it yourself.

The problem is, of course, when things fall apart in production — debugging the implementation of a 3rd party library you have no intimate knowledge of is, to say the least, tricky.

Lightrun is a new kind of debugger.

It’s one geared specifically towards real-life production environments. Using Lightrun, you can drill down into running applications, including 3rd party dependencies, with real-time logs, snapshots, and metrics.

Learn more in this quick, 5-minute Lightrun tutorial:

announcement - icon

Slow MySQL query performance is all too common. Of course it is. A good way to go is, naturally, a dedicated profiler that actually understands the ins and outs of MySQL.

The Jet Profiler was built for MySQL only, so it can do things like real-time query performance, focus on most used tables or most frequent queries, quickly identify performance issues and basically help you optimize your queries.

Critically, it has very minimal impact on your server’s performance, with most of the profiling work done separately — so it needs no server changes, agents or separate services.

Basically, you install the desktop application, connect to your MySQL server, hit the record button, and you’ll have results within minutes:

announcement - icon

DbSchema is a super-flexible database designer, which can take you from designing the DB with your team all the way to safely deploying the schema.

The way it does all of that is by using a design model, a database-independent image of the schema, which can be shared in a team using GIT and compared or deployed on to any database.

And, of course, it can be heavily visual, allowing you to interact with the database using diagrams, visually compose queries, explore the data, generate random data, import data or build HTML5 database reports.

announcement - icon

The Kubernetes ecosystem is huge and quite complex, so it’s easy to forget about costs when trying out all of the exciting tools.

To avoid overspending on your Kubernetes cluster, definitely have a look at the free K8s cost monitoring tool from the automation platform CAST AI. You can view your costs in real time, allocate them, calculate burn rates for projects, spot anomalies or spikes, and get insightful reports you can share with your team.

Connect your cluster and start monitoring your K8s costs right away:

We’re looking for a new Java technical editor to help review new articles for the site.

1. Overview

When dealing with Strings in Java, we sometimes need to encode them into a specific charset.

Further reading:

Guide to Character Encoding

Guide to Java URL Encoding/Decoding

Java Base64 Encoding and Decoding

How to do Base64 encoding and decoding in Java, using the new APIs introduced in Java 8 as well as Apache Commons.

This tutorial is a practical guide showing different ways to encode a String to the UTF-8 charset.

For a more technical deep-dive, see our Guide to Character Encoding.

2. Defining the Problem

To showcase the Java encoding, we’ll work with the German String “Entwickeln Sie mit Vergnügen”:

String germanString = "Entwickeln Sie mit Vergnügen"; byte[] germanBytes = germanString.getBytes(); String asciiEncodedString = new String(germanBytes, StandardCharsets.US_ASCII); assertNotEquals(asciiEncodedString, germanString);

This String encoded using US_ASCII gives us the value “Entwickeln Sie mit Vergn?gen” when printed because it doesn’t understand the non-ASCII ü character.

But when we convert an ASCII-encoded String that uses all English characters to UTF-8, we get the same string:

String englishString = "Develop with pleasure"; byte[] englishBytes = englishString.getBytes(); String asciiEncondedEnglishString = new String(englishBytes, StandardCharsets.US_ASCII); assertEquals(asciiEncondedEnglishString, englishString);

Let’s see what happens when we use the UTF-8 encoding.

3. Encoding With Core Java

Let’s start with the core library.

Strings are immutable in Java, which means we cannot change a String character encoding. To achieve what we want, we need to copy the bytes of the String and then create a new one with the desired encoding.

First, we get the String bytes, and then we create a new one using the retrieved bytes and the desired charset:

String rawString = "Entwickeln Sie mit Vergnügen"; byte[] bytes = rawString.getBytes(StandardCharsets.UTF_8); String utf8EncodedString = new String(bytes, StandardCharsets.UTF_8); assertEquals(rawString, utf8EncodedString);

4. Encoding With Java 7 StandardCharsets

Alternatively, we can use the StandardCharsets class introduced in Java 7 to encode the String.

First, we’ll encode the String into bytes, and second, we’ll decode it into a UTF-8 String:

String rawString = "Entwickeln Sie mit Vergnügen"; ByteBuffer buffer = StandardCharsets.UTF_8.encode(rawString); String utf8EncodedString = StandardCharsets.UTF_8.decode(buffer).toString(); assertEquals(rawString, utf8EncodedString);

5. Encoding With Commons-Codec

Besides using core Java, we can alternatively use Apache Commons Codec to achieve the same results.

Apache Commons Codec is a handy package containing simple encoders and decoders for various formats.

First, let’s start with the project configuration.

When using Maven, we have to add the commons-codec dependency to our pom.xml:

 commons-codec commons-codec 1.14 

Then, in our case, the most interesting class is StringUtils, which provides methods to encode Strings.

Using this class, getting a UTF-8 encoded String is pretty straightforward:

String rawString = "Entwickeln Sie mit Vergnügen"; byte[] bytes = StringUtils.getBytesUtf8(rawString); String utf8EncodedString = StringUtils.newStringUtf8(bytes); assertEquals(rawString, utf8EncodedString);

6. Conclusion

Encoding a String into UTF-8 isn’t difficult, but it’s not that intuitive. This article presents three ways of doing it, using either core Java or Apache Commons Codec.

As always, the code samples can be found over on GitHub.

announcement - icon

Slow MySQL query performance is all too common. Of course it is. A good way to go is, naturally, a dedicated profiler that actually understands the ins and outs of MySQL.

The Jet Profiler was built for MySQL only, so it can do things like real-time query performance, focus on most used tables or most frequent queries, quickly identify performance issues and basically help you optimize your queries.

Critically, it has very minimal impact on your server’s performance, with most of the profiling work done separately — so it needs no server changes, agents or separate services.

Basically, you install the desktop application, connect to your MySQL server, hit the record button, and you’ll have results within minutes:

Источник

Читайте также:  Java learn step by step
Оцените статью