Encoding php xml utf 8

Содержание

Перекодировка текста UTF-8 и WINDOWS-1251
windows-1251 в UTF-8
UTF-8 в windows-1251
Когда ни что не помогает
File_get_contents / CURL
Character Encoding
User Contributed Notes
utf8_encode
Описание
Список параметров
Возвращаемые значения
Список изменений
Примеры
Примечания
Смотрите также
User Contributed Notes 24 notes

Перекодировка текста UTF-8 и WINDOWS-1251

Проблема кодировок часто возникает при написании парсеров, чтении данных из xml и CSV файлов. Ниже представлены способы эту проблему решить.

windows-1251 в UTF-8

$text = iconv('windows-1251//IGNORE', 'UTF-8//IGNORE', $text); echo $text;

$text = mb_convert_encoding($text, 'UTF-8', 'windows-1251'); echo $text;

UTF-8 в windows-1251

$text = iconv('utf-8//IGNORE', 'windows-1251//IGNORE', $text); echo $text;

$text = mb_convert_encoding($text, 'windows-1251', 'utf-8'); echo $text;

Когда ни что не помогает

$text = iconv('utf-8//IGNORE', 'cp1252//IGNORE', $text); $text = iconv('cp1251//IGNORE', 'utf-8//IGNORE', $text); echo $text;

Иногда доходит до бреда, но работает:

$text = iconv('utf-8//IGNORE', 'windows-1251//IGNORE', $text); $text = iconv('windows-1251//IGNORE', 'utf-8//IGNORE', $text); echo $text;

File_get_contents / CURL

Бывают случаи когда file_get_contents() или CURL возвращают иероглифы (ÐÐ»Ð¼Ð°Ð·Ð½ÑÐµ Ð±Ð¾ÑÑ) – причина тут не в кодировке, а в отсутствии BOM-метки.

$text = file_get_contents('https://example.com'); $text = "\xEF\xBB\xBF" . $text; echo $text;

Ещё бывают случаи, когда file_get_contents() возвращает текст в виде:

Это сжатый текст в GZIP, т.к. функция не отправляет правильные заголовки. Решение проблемы через CURL:

function getcontents($url) < $ch = curl_init(); curl_setopt($ch, CURLOPT_URL, $url); curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1); curl_setopt($ch, CURLOPT_ENCODING, 'gzip'); curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 0); curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0); $output = curl_exec($ch); curl_close($ch); return $output; >echo getcontents('https://example.com');

Источник

Character Encoding

PHP’s XML extension supports the » Unicode character set through different character encoding s. There are two types of character encodings, source encoding and target encoding . PHP’s internal representation of the document is always encoded with UTF-8 .

Source encoding is done when an XML document is parsed. Upon creating an XML parser, a source encoding can be specified (this encoding can not be changed later in the XML parser’s lifetime). The supported source encodings are ISO-8859-1 , US-ASCII and UTF-8 . The former two are single-byte encodings, which means that each character is represented by a single byte. UTF-8 can encode characters composed by a variable number of bits (up to 21) in one to four bytes. The default source encoding used by PHP is ISO-8859-1 .

Target encoding is done when PHP passes data to XML handler functions. When an XML parser is created, the target encoding is set to the same as the source encoding, but this may be changed at any point. The target encoding will affect character data as well as tag names and processing instruction targets.

If the XML parser encounters characters outside the range that its source encoding is capable of representing, it will return an error.

If PHP encounters characters in the parsed XML document that can not be represented in the chosen target encoding, the problem characters will be «demoted». Currently, this means that such characters are replaced by a question mark.

User Contributed Notes

XML Parser

Introduction
Installing/Configuring
Predefined Constants
Event Handlers
Case Folding
Error Codes
Character Encoding
Examples
XML Parser Functions
XMLParser

Источник

utf8_encode

Функция объявлена УСТАРЕВШЕЙ, начиная с PHP 8.2.0. Использовать эту функцию крайне не рекомендуется.

Описание

Функция преобразует строку string из кодировки ISO-8859-1 в кодировку UTF-8 .

Замечание:

Функция не пытается угадать текущую кодировку предоставленной строки, а предполагает, что она закодирована в ISO-8859-1 (также известная как «Latin 1») и преобразует её в UTF-8. Поскольку каждая последовательность байтов является корректной строкой ISO-8859-1, это никогда не приводит к ошибке, но не приведёт к получению полезной строки, если предполагалась другая кодировка.

Многие веб-страницы, отмеченные как использующие кодировку ISO-8859-1 , на самом деле используют схожую кодировку Windows-1252 , и веб-браузеры интерпретируют страницы ISO-8859-1 как Windows-1252 . Однако Windows-1252 содержит дополнительные печатные символы, такие как знак Евро ( € ) и фигурные кавычки ( “ ” ) вместо управляющих кодов ISO-8859-1 . Эта функция не конвертирует такие символы Windows-1252 корректно. Используйте другую функцию, если нужна конвертация из Windows-1252 .

Список параметров

Возвращаемые значения

Возвращает строку string , преобразованную в кодировку в UTF-8.

Список изменений

Версия	Описание
8.2.0	This function has been deprecated.
7.2.0	Функция была перенесена из модуля XML в ядро PHP. В предыдущих версиях она была доступна только при установленном модуле XML.

Примеры

Пример #1 Простой пример

// Преобразование строки ‘Zoë’ из ISO 8859-1 в UTF-8
$iso8859_1_string = «\x5A\x6F\xEB» ;
$utf8_string = utf8_encode ( $iso8859_1_string );
echo bin2hex ( $utf8_string ), «\n» ;
?>

Результат выполнения данного примера:

Примечания

Замечание: Устаревание и альтернативы

Функция устарела, начиная с PHP 8.2.0 и будет удалена в будущей версии. Существующие варианты использования должны быть проверены и заменены подходящими альтернативами.

Аналогичной функциональности можно достичь с помощью функции mb_convert_encoding() , которая поддерживает ISO-8859-1 и многие другие кодировки символов.

$iso8859_1_string = «\xEB» ; // ‘ë’ (e с диерезисом) в UTF-8
$utf8_string = mb_convert_encoding ( $iso8859_1_string , ‘UTF-8’ , ‘ISO-8859-1’ );
echo bin2hex ( $utf8_string ), «\n» ;

$iso8859_7_string = «\xEB» ; // та же строка в ISO-8859-7 представляет собой ‘λ’ (греческая строчная лямбда)
$utf8_string = mb_convert_encoding ( $iso8859_7_string , ‘UTF-8’ , ‘ISO-8859-7’ );
echo bin2hex ( $utf8_string ), «\n» ;

$windows_1252_string = «\x80» ; // ‘€’ (Знак евро) в Windows-1252, но не в ISO-8859-1
$utf8_string = mb_convert_encoding ( $windows_1252_string , ‘UTF-8’ , ‘Windows-1252’ );
echo bin2hex ( $utf8_string ), «\n» ;
?>

Результат выполнения данного примера:

Другие опции, которые могут быть доступны в зависимости от установленных модулей: UConverter::transcode() и iconv() .

Все следующие варианты дают один и тот же результат:

$iso8859_1_string = «\x5A\x6F\xEB» ; // ‘Zoë’ в ISO-8859-1

$utf8_string = utf8_encode ( $iso8859_1_string );
echo bin2hex ( $utf8_string ), «\n» ;

$utf8_string = mb_convert_encoding ( $iso8859_1_string , ‘UTF-8’ , ‘ISO-8859-1’ );
echo bin2hex ( $utf8_string ), «\n» ;

$utf8_string = UConverter :: transcode ( $iso8859_1_string , ‘UTF8’ , ‘ISO-8859-1’ );
echo bin2hex ( $utf8_string ), «\n» ;

$utf8_string = iconv ( ‘ISO-8859-1’ , ‘UTF-8’ , $iso8859_1_string );
echo bin2hex ( $utf8_string ), «\n» ;
?>

Результат выполнения данного примера:

5a6fc3ab 5a6fc3ab 5a6fc3ab 5a6fc3ab

Смотрите также

utf8_decode() — Преобразует строку из UTF-8 в ISO-8859-1, заменяя недопустимые или непредставимые символы
mb_convert_encoding() — Преобразует строку из одной кодировки символов в другую
UConverter::transcode() — Преобразует строку из одной кодировки символов в другую
iconv() — Преобразует строку из одной кодировки символов в другую

User Contributed Notes 24 notes

Please note that utf8_encode only converts a string encoded in ISO-8859-1 to UTF-8. A more appropriate name for it would be «iso88591_to_utf8». If your text is not encoded in ISO-8859-1, you do not need this function. If your text is already in UTF-8, you do not need this function. In fact, applying this function to text that is not encoded in ISO-8859-1 will most likely simply garble that text.

If you need to convert text from any encoding to any other encoding, look at iconv() instead.

Here’s some code that addresses the issue that Steven describes in the previous comment;

/* This structure encodes the difference between ISO-8859-1 and Windows-1252,
as a map from the UTF-8 encoding of some ISO-8859-1 control characters to
the UTF-8 encoding of the non-control characters that Windows-1252 places
at the equivalent code points. */

$cp1252_map = array(
«\xc2\x80» => «\xe2\x82\xac» , /* EURO SIGN */
«\xc2\x82» => «\xe2\x80\x9a» , /* SINGLE LOW-9 QUOTATION MARK */
«\xc2\x83» => «\xc6\x92» , /* LATIN SMALL LETTER F WITH HOOK */
«\xc2\x84» => «\xe2\x80\x9e» , /* DOUBLE LOW-9 QUOTATION MARK */
«\xc2\x85» => «\xe2\x80\xa6» , /* HORIZONTAL ELLIPSIS */
«\xc2\x86» => «\xe2\x80\xa0» , /* DAGGER */
«\xc2\x87» => «\xe2\x80\xa1» , /* DOUBLE DAGGER */
«\xc2\x88» => «\xcb\x86» , /* MODIFIER LETTER CIRCUMFLEX ACCENT */
«\xc2\x89» => «\xe2\x80\xb0» , /* PER MILLE SIGN */
«\xc2\x8a» => «\xc5\xa0» , /* LATIN CAPITAL LETTER S WITH CARON */
«\xc2\x8b» => «\xe2\x80\xb9» , /* SINGLE LEFT-POINTING ANGLE QUOTATION */
«\xc2\x8c» => «\xc5\x92» , /* LATIN CAPITAL LIGATURE OE */
«\xc2\x8e» => «\xc5\xbd» , /* LATIN CAPITAL LETTER Z WITH CARON */
«\xc2\x91» => «\xe2\x80\x98» , /* LEFT SINGLE QUOTATION MARK */
«\xc2\x92» => «\xe2\x80\x99» , /* RIGHT SINGLE QUOTATION MARK */
«\xc2\x93» => «\xe2\x80\x9c» , /* LEFT DOUBLE QUOTATION MARK */
«\xc2\x94» => «\xe2\x80\x9d» , /* RIGHT DOUBLE QUOTATION MARK */
«\xc2\x95» => «\xe2\x80\xa2» , /* BULLET */
«\xc2\x96» => «\xe2\x80\x93» , /* EN DASH */
«\xc2\x97» => «\xe2\x80\x94» , /* EM DASH */

«\xc2\x98» => «\xcb\x9c» , /* SMALL TILDE */
«\xc2\x99» => «\xe2\x84\xa2» , /* TRADE MARK SIGN */
«\xc2\x9a» => «\xc5\xa1» , /* LATIN SMALL LETTER S WITH CARON */
«\xc2\x9b» => «\xe2\x80\xba» , /* SINGLE RIGHT-POINTING ANGLE QUOTATION*/
«\xc2\x9c» => «\xc5\x93» , /* LATIN SMALL LIGATURE OE */
«\xc2\x9e» => «\xc5\xbe» , /* LATIN SMALL LETTER Z WITH CARON */
«\xc2\x9f» => «\xc5\xb8» /* LATIN CAPITAL LETTER Y WITH DIAERESIS*/
);

function cp1252_to_utf8 ( $str ) global $cp1252_map ;
return strtr ( utf8_encode ( $str ), $cp1252_map );
>

For reference, it may be insightful to point out that:
utf8_encode($s)
is actually identical to:
recode_string(‘latin1..utf8’, $s)
and:
iconv(‘iso-8859-1’, ‘utf-8’, $s)
That is, utf8_encode is a specialized case of character set conversions.

If your string to be converted to utf-8 is something other than iso-8859-1 (such as iso-8859-2 (Polish/Croatian)), you should use recode_string() or iconv() instead rather than trying to devise complex str_replace statements.

If you haven’t guessed already: If the UTF-8 character has no representation in the ISO-8859-1 codepage, a ? will be returned. You might want to wrap a function around this to make sure you aren’t saving a bunch of . into your database.

If you need a function which converts a string array into a utf8 encoded string array then this function might be useful for you:

Источник