Utf 8 decoding in php

utf8_decode

utf8_decode-Преобразует строку из UTF-8 в ISO-8859-1,заменяя недопустимые или непредставимые символы.

Description

utf8_decode(string $string): string

Эта функция преобразует строку string из UTF-8 в кодировку ISO-8859-1 . Байты в строке, которые не являются допустимыми UTF-8 , и символы UTF-8 , которые не существуют в ISO-8859-1 (то есть кодовые точки выше U+00FF ), заменяются на ? .

Note:

Многие веб-страницы, помеченные как использующие кодировку символов ISO-8859-1 , на самом деле используют аналогичную кодировку Windows-1252 , и веб-браузеры интерпретируют веб-страницы ISO-8859-1 как Windows-1252 . Windows-1252 содержит дополнительные печатные символы, такие как знак евро ( € ) и фигурные кавычки ( “ ” ), вместо некоторых управляющих символов ISO-8859-1 .Эта функция не будет правильно преобразовывать такие символы Windows-1252 .Используйте другую функцию, если требуется преобразование Windows-1252 .

Parameters

Return Values

Возвращает перевод string ISO-8859-1 .

Changelog

Version Description
7.2.0 Эта функция была перенесена из расширения XML в ядро PHP.В предыдущих версиях она была доступна только при установленном расширении XML.

Examples

Пример #1 Основные примеры

 // Convert the string 'Zoë' from UTF-8 to ISO 8859-1 $utf8_string = "\x5A\x6F\xC3\xAB"; $iso8859_1_string = utf8_decode($utf8_string); echo bin2hex($iso8859_1_string), "\n"; // Invalid UTF-8 sequences are replaced with '?' $invalid_utf8_string = "\xC3"; $iso8859_1_string = utf8_decode($invalid_utf8_string); var_dump($iso8859_1_string); // Characters which don't exist in ISO 8859-1, such as // '€' (Euro Sign) are also replaced with '?' $utf8_string = "\xE2\x82\xAC"; $iso8859_1_string = utf8_decode($utf8_string); var_dump($iso8859_1_string); ?>

Выводится приведенный выше пример:

See Also

  • utf8_encode() — Преобразует строку из ISO-8859-1 в UTF-8.
  • mb_convert_encoding() — Преобразует строку из одной кодировки символов в другую
  • UConverter::transcode() — Преобразует строку из одной кодировки символов в другую
  • iconv() — Преобразование строки из одной кодировки символов в другую
PHP 8.2

(PHP 4,5,7,8)usleep Задержка выполнения в микросекундах Задерживает выполнение программы на заданное количество микросекунд.

(PHP 4,5,7,8)сортирует массив по значениям с помощью пользовательской функции сравнения Сортирует массив на месте по значениям с помощью пользовательской функции сравнения

(PHP 4,5,7,8)utf8_encode Преобразование строки из ISO-8859-1 в Эта функция преобразует строку из кодировки ISO-8859-1 в UTF-8.

(PHP 4,5,7,8)var_dump Выгрузка информации о переменной Эта функция выводит структурированную информацию о переменной,включая ее тип.

Источник

PHP 8.2: utf8_encode and utf8_decode functions deprecated

utf8_encode and utf8_decode functions, despite their names, are used to convert strings between ISO-8859-1 (Also known as «Latin 1») and UTF-8 encodings. These functions do not attempt to detect the actual character encoding in a given text, and always convert character encodings between ISO-8859-1 and UTF-8, even if the source text is not encoded in ISO-8859-1.

Although PHP includes utf8_encode and utf8_decode functions in its standard library, these functions cannot be used to detect and convert other character encodings such as Windows-1252, UTF-16, and UTF-32 to UTF-8. Passing arbitrary text to utf8_encode function is prone to bugs that do not result in any warnings or errors but may lead to undesired results.

Some frequent examples of bugs include:

  • The Euro sign ( € , character sequence \xE2\x82\xAC ), when passed to utf8_encode function as utf8_encode(«€») results in a a garbled (also called as «Mojibake») text output of ⬠.
  • The German Eszett character ( ß , character sequence \xDF ), when passed through utf8_encode(«ß») results in à .

Both of the examples above do not emit any warnings or errors although their resulting text is wrong.

Because of the misleading function names, lack of error messages and warnings, and the lack of support for character encodings other than ISO-8859-1, utf8_encode and utf8_decode functions are deprecated in PHP 8.2.

Using utf8_encode and utf8_decode functions emit a deprecation notice in PHP 8.2, and the functions will be removed in PHP 9.0.

utf8_encode('foo'); uft8_decode('foo');
Function utf8_encode() is deprecated in . on line . Function uft8_decode() is deprecated in . on line . 

Replacements for the deprecated functions

utf8_encode function encodes a ISO-8859-1 encoded string text into UTF-8. Most of the utf8_encode calls in legacy PHP applications use this function as an additional safe-guard to prevent any potential malformed text to UTF-8, but as shown in the examples above, using this function often results in undesired outcomes rather than fixing any malformed text.

Similarly, calling utf8_decode function on a string decodes that string to ISO-8859-1 character encoding. Majority of the web applications, web sites, and text formats in fact expect UTF-8 encoded text and not ISO-8859-1.

It might be ideal to reevaluate the need of utf8_encode and utf8_decode function calls prior to replacing them, because more often than not, these function calls are not required, and only result in undesired outcomes.

PHP does not bundle multi-byte character encoding functions in its core, but PHP core mbstring , intl , and iconv extensions provide a robust and accurate functionality to detect and convert character encodings. Both mbstring and iconv are core extensions, but mbstring is used widely in modern PHP applications, and can be polyfilled as well.

Replacements for utf8_encode

If the actual use case of an existing utf8_encode function call is to convert a known ISO-8859-1 string to UTF-8, it is possible to use iconv , intl , or mbstring extensions to properly convert the encoding. Alternatively, it is possible to directly convert code-points to UTF-8 string as well using user-land PHP albeit with a small performance penalty.

When the use case of utf8_encode is to automatically detect the character encoding and convert it to UTF-8, even though the function did not detect character encodings in the first place, the replacement would be detecting the character encoding first, and then converting it to UTF-8.

ISO-8859-1 to UTF-8 Any encoding to UTF-8
PHP Standard Functions ISO-8859-1 to UTF-8 using Standard PHP Functions N/A
With mbstring ISO-8859-1 to UTF-8 using mbstring Any encoding to UTF-8 using mbstring
With intl ISO-8859-1 to UTF-8 using intl N/A
With iconv ISO-8859-1 to UTF-8 using iconv N/A

ISO-8859-1 to UTF-8 using Standard PHP Functions

symfony/polyfill-php72 library provides a PHP function that mimics the utf8_encode functionality using standard PHP functions. For better readability and to convey the meaning of the function, it is renamed to iso8859_1_to_utf8 in the example below.

function iso8859_1_to_utf8(string $s): string < $s .= $s; $len = \strlen($s); for ($i = $len >> 1, $j = 0; $i < $len; ++$i, ++$j) < switch (true) < case $s[$i] < "\x80": $s[$j] = $s[$i]; break; case $s[$i] < "\xC0": $s[$j] = "\xC2"; $s[++$j] = $s[$i]; break; default: $s[$j] = "\xC3"; $s[++$j] = \chr(\ord($s[$i]) - 64); break; >> return substr($s, 0, $j); >

With the function above declared in application code, it is now possible to replace all utf8_encode calls with the new iso8859_1_to_utf8 function to avoid the deprecation notice:

- utf8_encode($string); + iso8859_1_to_utf8($string);

ISO-8859-1 to UTF-8 using mbstring

mbstring extension, one of the most widely used optional PHP extensions, provides a cleaner and straight-forward approach to convert ISO-8859-1 encoded strings to UTF-8. This can be used to replace the utf8_encode function deprecated in PHP 8.2.

- utf8_encode($string); + mb_convert_encoding($string, 'UTF-8', 'ISO-8859-1');

Any encoding to UTF-8 using mbstring

Without knowing the actual character encoding used in the input text, it might lead to erroneous results when PHP is forced to detect the input character encoding. However, it is possible to make a reasonable guess of the source character encoding and convert it to UTF-8 using mbstring extension.

- utf8_encode($string); + mb_convert_encoding($string, 'UTF-8', mb_list_encodings());

ISO-8859-1 to UTF-8 using intl

The UConverter class in the intl extension also provides a way to convert character encodings from one to another. It follows a similar function signature as mbstring counterparts as well. Using UConverter::transcode , it is possible to replicate utf8_encode functionality:

- utf8_encode($string); + UConverter::transcode($latin1, 'UTF8', 'ISO-8859-1');

ISO-8859-1 to UTF-8 using iconv

Applications that can use the iconv extension can replace the utf8_encode function using iconv function:

- utf8_encode($string); + iconv('ISO-8859-1', 'UTF-8', $string);

Replacements for utf8_decode

utf8_decode function decodes a UTF-8 encoded string to ISO-8859-1. With the utf8_decode function deprecated, it is possible to replicate this functionality using PHP standard functions, mbstring extension, intl extension, or iconv extension.

UTF-8 to ISO-8859-1
PHP Standard Functions UTF-8 to ISO-8859-1 using Standard PHP Functions
With mbstring UTF-8 to ISO-8859-1 using mbstring
With intl UTF-8 to ISO-8859-1 using intl
With iconv UTF-8 to ISO-8859-1 using iconv

UTF-8 to ISO-8859-1 using Standard PHP Functions

Similar the the utf8_encode polyfill, symfony/polyfill-php72 library provides a PHP function that mimics the utf8_decode functionality:

function utf8_to_iso8859_1(string $string): string < $s = (string) $string; $len = \strlen($s); for ($i = 0, $j = 0; $i < $len; ++$i, ++$j) < switch ($s[$i] & "\xF0") < case "\xC0": case "\xD0": $c = (\ord($s[$i] & "\x1F") > return substr($s, 0, $j); >

With the function above included, it is now possible to replace utf8_decode calls with the new utf8_to_iso8859_1 function:

- utf8_decode($string); + utf8_to_iso8859_1($string);

UTF-8 to ISO-8859-1 using mbstring

Using mbstring , the following example replaces the deprecated utf8_decode function with mb_convert_encoding :

- utf8_decode($string); + mb_convert_encoding($string, 'ISO-8859-1', 'UTF-8');

UTF-8 to ISO-8859-1 using intl

With help of UConverter::transcode in the intl extension, the following example shows a utf8_decode replacement:

- utf8_encode($string); + UConverter::transcode($string, 'ISO-8859-1', 'UTF8', ['to_subst' => '?']);

UTF-8 to ISO-8859-1 using iconv

iconv function can also be used to mimic and replace the utf8_decode functionality to avoid the utf8_decode deprecation in PHP 8.2:

- utf8_encode($string); + iconv('UTF-8', 'ISO-8859-1', $string);

Backwards Compatibility Impact

utf8_encode and utf8_decode functions are sometimes used in legacy PHP applications and applications that process incoming data and files with various character encodings. These functions are deprecated in PHP 8.2, and will be removed in PHP 9.0 because these functions are misleadingly named, and are prone to unexpected and undesired results that emit no warnings or errors.

Since PHP 8.2 and later, using these functions result in a deprecation notice for each time the functions are called.

utf8_encode and utf8_decode functions are to be removed from PHP in PHP 9.0.

A large number of applications that use these functions use them without being aware that they only work with ISO-8859-1 character encoding and nothing else for the source character encoding. It is possible that the ideal fix for the deprecation is to see why these functions are used in the first place, and determine if they are absolutely necessary.

Depending on the availability of PHP extensions and the willingness to use a somewhat slower PHP implementation, it is possible to replace utf8_encode and utf8_decode function calls.

Источник

Перекодировка текста UTF-8 и WINDOWS-1251

Проблема кодировок часто возникает при написании парсеров, чтении данных из xml и CSV файлов. Ниже представлены способы эту проблему решить.

windows-1251 в UTF-8

$text = iconv('windows-1251//IGNORE', 'UTF-8//IGNORE', $text); echo $text;
$text = mb_convert_encoding($text, 'UTF-8', 'windows-1251'); echo $text;

UTF-8 в windows-1251

$text = iconv('utf-8//IGNORE', 'windows-1251//IGNORE', $text); echo $text;
$text = mb_convert_encoding($text, 'windows-1251', 'utf-8'); echo $text;

Когда ни что не помогает

$text = iconv('utf-8//IGNORE', 'cp1252//IGNORE', $text); $text = iconv('cp1251//IGNORE', 'utf-8//IGNORE', $text); echo $text;

Иногда доходит до бреда, но работает:

$text = iconv('utf-8//IGNORE', 'windows-1251//IGNORE', $text); $text = iconv('windows-1251//IGNORE', 'utf-8//IGNORE', $text); echo $text;

File_get_contents / CURL

Бывают случаи когда file_get_contents() или CURL возвращают иероглифы (Алмазные борÑ) – причина тут не в кодировке, а в отсутствии BOM-метки.

$text = file_get_contents('https://example.com'); $text = "\xEF\xBB\xBF" . $text; echo $text;

Ещё бывают случаи, когда file_get_contents() возвращает текст в виде:

Это сжатый текст в GZIP, т.к. функция не отправляет правильные заголовки. Решение проблемы через CURL:

function getcontents($url) < $ch = curl_init(); curl_setopt($ch, CURLOPT_URL, $url); curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1); curl_setopt($ch, CURLOPT_ENCODING, 'gzip'); curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 0); curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0); $output = curl_exec($ch); curl_close($ch); return $output; >echo getcontents('https://example.com');

Источник

Читайте также:  Четные делители числа питон
Оцените статью