Php unicode to bytes

Содержание

# Unicode Support in PHP
# How to use :
# Output :
# Converting Unicode characters to their numeric value and/or HTML entities using PHP
# How to use :
# Output :
# Intl extention for Unicode support
utf8_encode
Description
Parameters
Return Values
Changelog
Examples
Notes
See Also
User Contributed Notes 24 notes
hidehalo / unicode.php

# Unicode Support in PHP

You can use the following code for going back and forward.

if (!function_exists('codepoint_encode'))  function codepoint_encode($str)  return substr(json_encode($str), 1, -1); > > if (!function_exists('codepoint_decode'))  function codepoint_decode($str)  return json_decode(sprintf('"%s"', $str)); > >

# How to use :

echo "\nUse JSON encoding / decoding\n"; var_dump(codepoint_encode("我好")); var_dump(codepoint_decode('\u6211\u597d'));

# Output :

Use JSON encoding / decoding string(12) "\u6211\u597d" string(6) "我好"

# Converting Unicode characters to their numeric value and/or HTML entities using PHP

You can use the following code for going back and forward.

if (!function_exists('mb_internal_encoding'))  function mb_internal_encoding($encoding = NULL)  return ($from_encoding === NULL) ? iconv_get_encoding() : iconv_set_encoding($encoding); > > if (!function_exists('mb_convert_encoding'))  function mb_convert_encoding($str, $to_encoding, $from_encoding = NULL)  return iconv(($from_encoding === NULL) ? mb_internal_encoding() : $from_encoding, $to_encoding, $str); > > if (!function_exists('mb_chr'))  function mb_chr($ord, $encoding = 'UTF-8')  if ($encoding === 'UCS-4BE')  return pack("N", $ord); > else  return mb_convert_encoding(mb_chr($ord, 'UCS-4BE'), $encoding, 'UCS-4BE'); > > > if (!function_exists('mb_ord'))  function mb_ord($char, $encoding = 'UTF-8')  if ($encoding === 'UCS-4BE')  list(, $ord) = (strlen($char) === 4) ? @unpack('N', $char) : @unpack('n', $char); return $ord; > else  return mb_ord(mb_convert_encoding($char, 'UCS-4BE', $encoding), 'UCS-4BE'); > > > if (!function_exists('mb_htmlentities'))  function mb_htmlentities($string, $hex = true, $encoding = 'UTF-8')  return preg_replace_callback('/[\x-\x]/u', function ($match) use ($hex)  return sprintf($hex ? '&#x%X;' : '&#%d;', mb_ord($match[0])); >, $string); > > if (!function_exists('mb_html_entity_decode'))  function mb_html_entity_decode($string, $flags = null, $encoding = 'UTF-8')  return html_entity_decode($string, ($flags === NULL) ? ENT_COMPAT | ENT_HTML401 : $flags, $encoding); > >

# How to use :

echo "Get string from numeric DEC value\n"; var_dump(mb_chr(50319, 'UCS-4BE')); var_dump(mb_chr(271)); echo "\nGet string from numeric HEX value\n"; var_dump(mb_chr(0xC48F, 'UCS-4BE')); var_dump(mb_chr(0x010F)); echo "\nGet numeric value of character as DEC string\n"; var_dump(mb_ord('ď', 'UCS-4BE')); var_dump(mb_ord('ď')); echo "\nGet numeric value of character as HEX string\n"; var_dump(dechex(mb_ord('ď', 'UCS-4BE'))); var_dump(dechex(mb_ord('ď'))); echo "\nEncode / decode to DEC based HTML entities\n"; var_dump(mb_htmlentities('tchüß', false)); var_dump(mb_html_entity_decode('tchüß')); echo "\nEncode / decode to HEX based HTML entities\n"; var_dump(mb_htmlentities('tchüß')); var_dump(mb_html_entity_decode('tchüß'));

# Output :

Get string from numeric DEC value string(4) "ď" string(2) "ď" Get string from numeric HEX value string(4) "ď" string(2) "ď" Get numeric value of character as DEC int int(50319) int(271) Get numeric value of character as HEX string string(4) "c48f" string(3) "10f" Encode / decode to DEC based HTML entities string(15) "tchüß" string(7) "tchüß" Encode / decode to HEX based HTML entities string(15) "tchüß" string(7) "tchüß"

# Intl extention for Unicode support

Native string functions are mapped to single byte functions, they do not work well with Unicode. The extentions iconv and mbstring offer some support for Unicode, while the Intl-extention offers full support. Intl is a wrapper for the facto de standard ICU library, see http://site.icu-project.org

ICU offers full Internationalization of which Unicode is only a smaller part. You can do transcoding easily:

\UConverter::transcode($sString, 'UTF-8', 'UTF-8'); // strip bad bytes against attacks

But, do not dismiss iconv just yet, consider:

\iconv('UTF-8', 'ASCII//TRANSLIT', "Cliënt"); // output: "Client"

Источник

utf8_encode

This function has been DEPRECATED as of PHP 8.2.0. Relying on this function is highly discouraged.

Description

This function converts the string string from the ISO-8859-1 encoding to UTF-8 .

Note:

This function does not attempt to guess the current encoding of the provided string, it assumes it is encoded as ISO-8859-1 (also known as «Latin 1») and converts to UTF-8. Since every sequence of bytes is a valid ISO-8859-1 string, this never results in an error, but will not result in a useful string if a different encoding was intended.

Many web pages marked as using the ISO-8859-1 character encoding actually use the similar Windows-1252 encoding, and web browsers will interpret ISO-8859-1 web pages as Windows-1252 . Windows-1252 features additional printable characters, such as the Euro sign ( € ) and curly quotes ( “ ” ), instead of certain ISO-8859-1 control characters. This function will not convert such Windows-1252 characters correctly. Use a different function if Windows-1252 conversion is required.

Parameters

Return Values

Returns the UTF-8 translation of string .

Changelog

Version	Description
8.2.0	This function has been deprecated.
7.2.0	This function has been moved from the XML extension to the core of PHP. In previous versions, it was only available if the XML extension was installed.

Examples

Example #1 Basic example

// Convert the string ‘Zoë’ from ISO 8859-1 to UTF-8
$iso8859_1_string = «\x5A\x6F\xEB» ;
$utf8_string = utf8_encode ( $iso8859_1_string );
echo bin2hex ( $utf8_string ), «\n» ;
?>

The above example will output:

Notes

Note: Deprecation and alternatives

This function is deprecated as of PHP 8.2.0, and will be removed in a future version. Existing uses should be checked and replaced with appropriate alternatives.

Similar functionality can be achieved with mb_convert_encoding() , which supports ISO-8859-1 and many other character encodings.

$iso8859_1_string = «\xEB» ; // ‘ë’ (e with diaeresis) in ISO-8859-1
$utf8_string = mb_convert_encoding ( $iso8859_1_string , ‘UTF-8’ , ‘ISO-8859-1’ );
echo bin2hex ( $utf8_string ), «\n» ;

$iso8859_7_string = «\xEB» ; // the same string in ISO-8859-7 represents ‘λ’ (Greek lower-case lambda)
$utf8_string = mb_convert_encoding ( $iso8859_7_string , ‘UTF-8’ , ‘ISO-8859-7’ );
echo bin2hex ( $utf8_string ), «\n» ;

$windows_1252_string = «\x80» ; // ‘€’ (Euro sign) in Windows-1252, but not in ISO-8859-1
$utf8_string = mb_convert_encoding ( $windows_1252_string , ‘UTF-8’ , ‘Windows-1252’ );
echo bin2hex ( $utf8_string ), «\n» ;
?>

The above example will output:

Other options which may be available depending on the extensions installed are UConverter::transcode() and iconv() .

The following all give the same result:

$iso8859_1_string = «\x5A\x6F\xEB» ; // ‘Zoë’ in ISO-8859-1

$utf8_string = utf8_encode ( $iso8859_1_string );
echo bin2hex ( $utf8_string ), «\n» ;

$utf8_string = mb_convert_encoding ( $iso8859_1_string , ‘UTF-8’ , ‘ISO-8859-1’ );
echo bin2hex ( $utf8_string ), «\n» ;

$utf8_string = UConverter :: transcode ( $iso8859_1_string , ‘UTF8’ , ‘ISO-8859-1’ );
echo bin2hex ( $utf8_string ), «\n» ;

$utf8_string = iconv ( ‘ISO-8859-1’ , ‘UTF-8’ , $iso8859_1_string );
echo bin2hex ( $utf8_string ), «\n» ;
?>

The above example will output:

5a6fc3ab 5a6fc3ab 5a6fc3ab 5a6fc3ab

User Contributed Notes 24 notes

Please note that utf8_encode only converts a string encoded in ISO-8859-1 to UTF-8. A more appropriate name for it would be «iso88591_to_utf8». If your text is not encoded in ISO-8859-1, you do not need this function. If your text is already in UTF-8, you do not need this function. In fact, applying this function to text that is not encoded in ISO-8859-1 will most likely simply garble that text.

If you need to convert text from any encoding to any other encoding, look at iconv() instead.

Here’s some code that addresses the issue that Steven describes in the previous comment;

/* This structure encodes the difference between ISO-8859-1 and Windows-1252,
as a map from the UTF-8 encoding of some ISO-8859-1 control characters to
the UTF-8 encoding of the non-control characters that Windows-1252 places
at the equivalent code points. */

$cp1252_map = array(
«\xc2\x80» => «\xe2\x82\xac» , /* EURO SIGN */
«\xc2\x82» => «\xe2\x80\x9a» , /* SINGLE LOW-9 QUOTATION MARK */
«\xc2\x83» => «\xc6\x92» , /* LATIN SMALL LETTER F WITH HOOK */
«\xc2\x84» => «\xe2\x80\x9e» , /* DOUBLE LOW-9 QUOTATION MARK */
«\xc2\x85» => «\xe2\x80\xa6» , /* HORIZONTAL ELLIPSIS */
«\xc2\x86» => «\xe2\x80\xa0» , /* DAGGER */
«\xc2\x87» => «\xe2\x80\xa1» , /* DOUBLE DAGGER */
«\xc2\x88» => «\xcb\x86» , /* MODIFIER LETTER CIRCUMFLEX ACCENT */
«\xc2\x89» => «\xe2\x80\xb0» , /* PER MILLE SIGN */
«\xc2\x8a» => «\xc5\xa0» , /* LATIN CAPITAL LETTER S WITH CARON */
«\xc2\x8b» => «\xe2\x80\xb9» , /* SINGLE LEFT-POINTING ANGLE QUOTATION */
«\xc2\x8c» => «\xc5\x92» , /* LATIN CAPITAL LIGATURE OE */
«\xc2\x8e» => «\xc5\xbd» , /* LATIN CAPITAL LETTER Z WITH CARON */
«\xc2\x91» => «\xe2\x80\x98» , /* LEFT SINGLE QUOTATION MARK */
«\xc2\x92» => «\xe2\x80\x99» , /* RIGHT SINGLE QUOTATION MARK */
«\xc2\x93» => «\xe2\x80\x9c» , /* LEFT DOUBLE QUOTATION MARK */
«\xc2\x94» => «\xe2\x80\x9d» , /* RIGHT DOUBLE QUOTATION MARK */
«\xc2\x95» => «\xe2\x80\xa2» , /* BULLET */
«\xc2\x96» => «\xe2\x80\x93» , /* EN DASH */
«\xc2\x97» => «\xe2\x80\x94» , /* EM DASH */

«\xc2\x98» => «\xcb\x9c» , /* SMALL TILDE */
«\xc2\x99» => «\xe2\x84\xa2» , /* TRADE MARK SIGN */
«\xc2\x9a» => «\xc5\xa1» , /* LATIN SMALL LETTER S WITH CARON */
«\xc2\x9b» => «\xe2\x80\xba» , /* SINGLE RIGHT-POINTING ANGLE QUOTATION*/
«\xc2\x9c» => «\xc5\x93» , /* LATIN SMALL LIGATURE OE */
«\xc2\x9e» => «\xc5\xbe» , /* LATIN SMALL LETTER Z WITH CARON */
«\xc2\x9f» => «\xc5\xb8» /* LATIN CAPITAL LETTER Y WITH DIAERESIS*/
);

function cp1252_to_utf8 ( $str ) global $cp1252_map ;
return strtr ( utf8_encode ( $str ), $cp1252_map );
>

For reference, it may be insightful to point out that:
utf8_encode($s)
is actually identical to:
recode_string(‘latin1..utf8’, $s)
and:
iconv(‘iso-8859-1’, ‘utf-8’, $s)
That is, utf8_encode is a specialized case of character set conversions.

If your string to be converted to utf-8 is something other than iso-8859-1 (such as iso-8859-2 (Polish/Croatian)), you should use recode_string() or iconv() instead rather than trying to devise complex str_replace statements.

If you haven’t guessed already: If the UTF-8 character has no representation in the ISO-8859-1 codepage, a ? will be returned. You might want to wrap a function around this to make sure you aren’t saving a bunch of . into your database.

If you need a function which converts a string array into a utf8 encoded string array then this function might be useful for you:

Источник

hidehalo / unicode.php

This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters

/**

* character of unicode symbol convert to unicode value

* @param string $symbol

* @param integer $bytes

* @return integer $ascii

public function getUnicode($symbol,$bytes = 1)

$offset = 0;

$highChar = substr($symbol, $offset ,1);

$ascii = ord($highChar);

if ($bytes > 1)

$code = ($ascii) & ((1 < < (7 - $bytes)) - 1);

for ($i = 1;$i < $bytes;$i++)

$char = substr($symbol, $offset + $i, 1);

$code = ($code < < 6 ) | (ord($char) & 0x3f);

$ascii = $code;

return $ascii;

/**

* get Unicode symbol bytes number

* @param string $symbol

* @return integer $bytesNumber

public function getBytesNumber($symbol)

$ascii = ord($symbol);

$bytesNumber = 1;

if ($ascii > 0x7f)

switch ($ascii&0xf0)

case 0xfd:

$bytesNumber = 6;

break;

case 0xf8:

$bytesNumber = 5;

break;

case 0xf0:

$bytesNumber = 4;

break;

case 0xe0:

$bytesNumber = 3;

break;

case 0xd1:

case 0xd0:

$bytesNumber = 2;

break;

return $bytesNumber;

Источник

Php unicode to bytes

# Unicode Support in PHP

# How to use :

# Output :

# Converting Unicode characters to their numeric value and/or HTML entities using PHP

# How to use :

# Output :

# Intl extention for Unicode support

utf8_encode

Description

Parameters

Return Values

Changelog

Examples

Notes

See Also

User Contributed Notes 24 notes

hidehalo / unicode.php