Java regex and unicode

Содержание

Unicode Support
Matching a Specific Code Point
Unicode Character Properties
Scripts
Blocks
General Category

Unicode Support

As of the JDK 7 release, Regular Expression pattern matching has expanded functionality to support Unicode 6.0.

Matching a Specific Code Point

You can match a specific Unicode code point using an escape sequence of the form \uFFFF , where FFFF is the hexadecimal value of the code point you want to match. For example, \u6771 matches the Han character for east.

Alternatively, you can specify a code point using Perl-style hex notation, \x <. >. For example:

Unicode Character Properties

Each Unicode character, in addition to its value, has certain attributes, or properties. You can match a single character belonging to a particular category with the expression \pprop> . You can match a single character not belonging to a particular category with the expression \Pprop> .

The three supported property types are scripts, blocks, and a «general» category.

Scripts

To determine if a code point belongs to a specific script, you can either use the script keyword, or the sc short form, for example, \p . Alternatively, you can prefix the script name with the string Is , such as \p .

Valid script names supported by Pattern are those accepted by UnicodeScript.forName .

Blocks

A block can be specified using the block keyword, or the blk short form, for example, \p . Alternatively, you can prefix the block name with the string In , such as \p .

Valid block names supported by Pattern are those accepted by UnicodeBlock.forName .

General Category

Categories can be specified with optional prefix Is . For example, IsL matches the category of Unicode letters. Categories can also be specified by using the general_category keyword, or the short form gc . For example, an uppercase letter can be matched using general_category=Lu or gc=Lu .

Supported categories are those of The Unicode Standard in the version specified by the Character class.

Источник