Unicode Support
As of the JDK 7 release, Regular Expression pattern matching has expanded functionality to support Unicode 6.0.
Matching a Specific Code Point
You can match a specific Unicode code point using an escape sequence of the form \uFFFF , where FFFF is the hexadecimal value of the code point you want to match. For example, \u6771 matches the Han character for east.
Alternatively, you can specify a code point using Perl-style hex notation, \x <. >. For example:
Unicode Character Properties
Each Unicode character, in addition to its value, has certain attributes, or properties. You can match a single character belonging to a particular category with the expression \pprop> . You can match a single character not belonging to a particular category with the expression \Pprop> .
The three supported property types are scripts, blocks, and a «general» category.
Scripts
To determine if a code point belongs to a specific script, you can either use the script keyword, or the sc short form, for example, \p . Alternatively, you can prefix the script name with the string Is , such as \p .
Valid script names supported by Pattern are those accepted by UnicodeScript.forName .
Blocks
A block can be specified using the block keyword, or the blk short form, for example, \p . Alternatively, you can prefix the block name with the string In , such as \p .
Valid block names supported by Pattern are those accepted by UnicodeBlock.forName .
General Category
Categories can be specified with optional prefix Is . For example, IsL matches the category of Unicode letters. Categories can also be specified by using the general_category keyword, or the short form gc . For example, an uppercase letter can be matched using general_category=Lu or gc=Lu .
Supported categories are those of The Unicode Standard in the version specified by the Character class.