Character source code java

Class Character

The Character class wraps a value of the primitive type char in an object. An object of class Character contains a single field whose type is char .

In addition, this class provides a large number of static methods for determining a character’s category (lowercase letter, digit, etc.) and for converting characters from uppercase to lowercase and vice versa.

Unicode Conformance

The fields and methods of class Character are defined in terms of character information from the Unicode Standard, specifically the UnicodeData file that is part of the Unicode Character Database. This file specifies properties including name and category for every assigned Unicode code point or character range. The file is available from the Unicode Consortium at http://www.unicode.org.

Character information is based on the Unicode Standard, version 15.0.

The Java platform has supported different versions of the Unicode Standard over time. Upgrades to newer versions of the Unicode Standard occurred in the following Java releases, each indicating the new version:

Shows Java releases and supported Unicode versions
Java release Unicode version
Java SE 20 Unicode 15.0
Java SE 19 Unicode 14.0
Java SE 15 Unicode 13.0
Java SE 13 Unicode 12.1
Java SE 12 Unicode 11.0
Java SE 11 Unicode 10.0
Java SE 9 Unicode 8.0
Java SE 8 Unicode 6.2
Java SE 7 Unicode 6.0
Java SE 5.0 Unicode 4.0
Java SE 1.4 Unicode 3.0
JDK 1.1 Unicode 2.0
JDK 1.0.2 Unicode 1.1.5
Читайте также:  Пояснение работы парсера

Variations from these base Unicode versions, such as recognized appendixes, are documented elsewhere.

Unicode Character Representations

The char data type (and therefore the value that a Character object encapsulates) are based on the original Unicode specification, which defined characters as fixed-width 16-bit entities. The Unicode Standard has since been changed to allow for characters whose representation requires more than 16 bits. The range of legal code points is now U+0000 to U+10FFFF, known as Unicode scalar value. (Refer to the definition of the U+n notation in the Unicode Standard.)

The set of characters from U+0000 to U+FFFF is sometimes referred to as the Basic Multilingual Plane (BMP). Characters whose code points are greater than U+FFFF are called supplementary characters. The Java platform uses the UTF-16 representation in char arrays and in the String and StringBuffer classes. In this representation, supplementary characters are represented as a pair of char values, the first from the high-surrogates range, (\uD800-\uDBFF), the second from the low-surrogates range (\uDC00-\uDFFF).

  • The methods that only accept a char value cannot support supplementary characters. They treat char values from the surrogate ranges as undefined characters. For example, Character.isLetter(‘\uD840’) returns false , even though this specific value if followed by any low-surrogate value in a string would represent a letter.
  • The methods that accept an int value support all Unicode characters, including supplementary characters. For example, Character.isLetter(0x2F81A) returns true because the code point value represents a letter (a CJK ideograph).

In the Java SE API documentation, Unicode code point is used for character values in the range between U+0000 and U+10FFFF, and Unicode code unit is used for 16-bit char values that are code units of the UTF-16 encoding. For more information on Unicode terminology, refer to the Unicode Glossary.

This is a value-based class; programmers should treat instances that are equal as interchangeable and should not use instances for synchronization, or unpredictable behavior may occur. For example, in a future release, synchronization may fail.

Источник

Character and String APIs

The Character class encapsulates the char data type. For the J2SE release 5, many methods were added to the Character class to support supplementary characters. This API falls into two categories: methods that convert between char and code point values and methods that verify the validity of or map code points.

This section describes a subset of the available methods in the Character class. For the complete list of available APIs, see the Character class specification.

Conversion Methods and the Character Class

The following table includes the most useful conversion methods, or methods that facilitate conversion, in the Character class. The codePointAt and codePointBefore methods are included in this list because text is generally found in a sequence, such as a String , and these methods can be used to extract the desired substring.

Method(s) Description
toChars(int codePoint, char[] dst, int dstIndex)
toChars(int codePoint)
Converts the specified Unicode code point to its UTF-16 representation and places it in a char array. Sample usage: Character.toChars(0x10400)
toCodePoint(char high, char low) Converts the specified surrogate pair to its supplementary code point value.
codePointAt(char[] a, int index)
codePointAt(char[] a, int index, int limit)
codePointAt(CharSequence seq, int index)
Returns the Unicode code point at the specified index. The third method takes a CharSequence and the second method enforces an upper limit on the index.
codePointBefore(char[] a, int index)
codePointBefore(char[] a, int index, int start)
codePointBefore(CharSequence seq, int index)
codePointBefore(char[], int, int)
Returns the Unicode code point before the specified index. The third method accepts a CharSequence and the other methods accept a char array. The second method enforces a lower limit on the index.
charCount(int codePoint) Returns the value 1 for characters that can be represented by a single char . Returns the value 2 for supplementary characters that require two char s.

Verification and Mapping Methods in the Character Class

Some of the previous methods that used the char primitive data type, such as isLowerCase(char) and isDigit(char) , were supplanted by methods that support supplementary characters, such as isLowerCase(int) and isDigit(int) . The previous methods are supported but do not work with supplementary characters. To create a global application and ensure that your code works seamlessly with any language, it is recommended that you use the newer forms of these methods.

Note that, for performance reasons, most methods that accept a code point do not verify the validity of the code point parameter. You can use the isValidCodePoint method for that purpose.

The following table lists some of the verification and mapping methods in the Character class.

Method(s) Description
isValidCodePoint(int codePoint) Returns true if the code point is within the range of 0x0000 to 0x10FFFF, inclusive.
isSupplementaryCodePoint(int codePoint) Returns true if the code point is within the range of 0x10000 to 0x10FFFF, inclusive.
isHighSurrogate(char) Returns true if the specified char is within the high surrogate range of \uD800 to \uDBFF, inclusive.
isLowSurrogate(char) Returns true if the specified char is within the low surrogate range of \uDC00 to \uDFFF, inclusive.
isSurrogatePair(char high, char low) Returns true if the specified high and low surrogate code values represent a valid surrogate pair.
codePointCount(CharSequence, int, int)
codePointCount(char[], int, int)
Returns the number of Unicode code points in the CharSequence , or char array.
isLowerCase(int)
isUpperCase(int)
Returns true if the specified Unicode code point is a lowercase or uppercase character.
isDefined(int) Returns true if the specified Unicode code point is defined in the Unicode standard.
isJavaIdentifierStart(char)
isJavaIdentifierStart(int)
Returns true if the specified character or Unicode code point is permissible as the first character in a Java identifier.
isLetter(int)
isDigit(int)
isLetterOrDigit(int)
Returns true if the specified Unicode code point is a letter, a digit, or a letter or digit.
getDirectionality(int) Returns the Unicode directionality property for the given Unicode code point.
Character.UnicodeBlock.of(int codePoint) Returns the object representing the Unicode block that contains the given Unicode code point or returns null if the code point is not a member of a defined block.

Methods in the String Classes

The String , StringBuffer , and StringBuilder classes also have constructors and methods that work with supplementary characters. The following table lists some of the commonly used methods. For the complete list of available APIs, see the javadoc for the String , StringBuffer , and StringBuilder classes.

Constructor or Methods Description
String(int[] codePoints, int offset, int count) Allocates a new String instance that contains characters from a subarray of a Unicode code point array.
String.codePointAt(int index)
StringBuffer.codePointAt(int index)
StringBuilder.codePointAt(int index)
Returns the Unicode code point at the specified index.
String.codePointBefore(int index)
StringBuffer.codePointBefore(int index)
StringBuilder.codePointBefore(int index)
Returns the Unicode code point before the specified index.
String.codePointCount(int beginIndex, int endIndex)
StringBuffer.codePointCount(int beginIndex, int endIndex)
StringBuilder.codePointCount(int beginIndex, int endIndex)
Returns the number of Unicode code points in the specified range.
StringBuffer.appendCodePoint(int codePoint)
StringBuilder.appendCodePoint(int codePoint)
Appends the string representation of the specified code point to the sequence.
String.offsetByCodePoints(int index, int codePointOffset)
StringBuffer.offsetByCodePoints(int index, int codePointOffset)
StringBuilder.offsetByCodePoints(int index, int codePointOffset)
Returns the index that is offset from the given index by the given number of code points.

Источник

Оцените статью