Utf8 to ascii php

Содержание

Saved searches
Use saved searches to filter your results more quickly
License
gjuric/utf8_to_ascii
Name already in use
Sign In Required
Launching GitHub Desktop
Launching GitHub Desktop
Launching Xcode
Launching Visual Studio Code
Latest commit
Git stats
Files
README
Преобразование UTF-8 в ASCII
Решение
Другие решения
Converting UTF-8 strings to ASCII using the ICU Transliterator
The desired result
The obvious choice: iconv
Transliteration
International Components for Unicode (ICU)
Using the ICU Transliterator

Saved searches

Use saved searches to filter your results more quickly

You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.

License

gjuric/utf8_to_ascii

This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?

Please sign in to use Codespaces.

Launching GitHub Desktop

If nothing happens, download GitHub Desktop and try again.

Launching GitHub Desktop

If nothing happens, download GitHub Desktop and try again.

Launching Xcode

If nothing happens, download Xcode and try again.

Launching Visual Studio Code

Your codespace will open once ready.

There was a problem preparing your codespace, please try again.

Latest commit

Git stats

Files

Failed to load latest commit information.

README

UTF8 TO ASCII US-ASCII transliterations of Unicode text Ported Sean M. Burke's Text::Unidecode Perl module http://search.cpan.org/~sburke/Text-Unidecode-0.04/ http://interglacial.com/~sburke/ Use is simple; Some notes; - Make sure you provide is well-formed UTF-8! http://phputf8.sourceforge.net/#UTF_8_Validation_and_Cleaning - For European languages, it should replace Unicode character with corresponding ascii characters and produce a readable result. For other languages, the results will be less meaningful - it's a "dumb" character for character replacement True trasliteration is a little more complex than this; See: http://en.wikipedia.org/wiki/Transliteration - For any characters for which there's no replacement character available, a (default) '?' will be inserted. The second argument can be used to define an alternative replacement char - Don't panic about all the files in the db subdirectory - they are not all loaded at once - in fact they are only loaded if they are needed to convert a given character (i.e. which files get loaded depends on the input) For a little more see; http://www.sitepoint.com/blogs/2006/03/03/us-ascii-transliterations-of-unicode-text/

Источник

Преобразование UTF-8 в ASCII

% D8% A8% D8% B2% D8% B1% Д.А.% D8% AF-% АА% D8% B1% DB% 8C% D9% 86-% D9% 88% D8% B1% D8% B2% D8% B4 % DA% A9% D8% A7% D8% B1% D8% A7% D9% 86-% D8% АА% D8% А7% D8% B1% DB% 8C% D8% AE-% D8% A7% D9% 84 % D9% 85% D9% BE% DB% 8C% DA% A9% D8% АА% D8% B5% D8% A7% D9% 88% DB% 8C% D8% В1

Я действительно запутался Кто-нибудь знает проблему?

ОБНОВЛЕНИЕ: Я также попробовал iconv:

echo iconv("UTF-8", "ASCII", $str), PHP_EOL;

Примечание: iconv (): обнаружен недопустимый символ во входной строке

Решение

% D8 не является кодировкой ascii. Ascii имеет 127 (или 255, если вы используете расширенный) символов (см. http://www.asciitable.com/ )

Таким образом, специальные символы, такие как Ø, не имеют эквивалента. mb_convert_encoding обрабатывает это, заменяя их на?, тогда как iconv выдает ошибку.

Вывод, который вы ищете, больше похож на кодировку URL.
Попробуй это:

Другие решения

На мой взгляд, проблема в этом случае состоит в том, что входная строка неверна, и преобразование между ASCII и UTF-8 не требуется.

$out = '%D8%A8%D8%B2%D8%B1%DA%AF-%D8%AA%D8%B1%DB%8C%D9%86-%D9%88%D8%B1%D8%B2%D8%B4%DA%A9%D8%A7%D8%B1%D8%A7%D9%86-%D8%AA%D8%A7%D8%B1%DB%8C%D8%AE-%D8%A7%D9%84%D9%85%D9%BE%DB%8C%DA%A9%D8%AA%D8%B5%D8%A7%D9%88%DB%8C%D8%B1';

Когда мы пытаемся получить кодирование этой строки с

echo mb_detect_encoding($out);

тогда мы можем видеть, что это ASCII конечно. Но, как мы видим, эта строка, очевидно, выглядит как результат urlencode функция. Давайте попробуем использовать urldecode функция, чтобы проверить, какова кодировка этого значения

$decoded = urldecode($out); echo mb_detect_encoding($decoded);

На выходе мы видим, что $decoded это UTF-8, поэтому пытается запустить этот код из вопроса

$str = "Ø§ÙˆÙ‚Ø§Øª-Ø´Ø±Ø¹ÛŒ-Ø¬Ù…Ø¹Ù‡-8-Ù…Ø±Ø¯Ø§Ø¯-Ù…Ø§Ù‡-Ø¨Ù‡-Ø§ÙÙ‚-Ø§Ø±Ø¯Ø¨ÛŒÙ„"echo mb_convert_encoding($str, "ASCII");

не имеет смысла, потому что не может быть кодировки ASCII.

Мне также было бы любопытно, что такое кодировка $str от вопроса, поэтому я подготовил что-то вроде этого, чтобы найти, могу ли я получить $str значение от $decoded значение

foreach (mb_list_encodings() as $chr)

Я был удивлен, что я не нашел никакой кодировки, которая может дать мне что-то похожее на $str значение. Я пытаюсь сделать больше и проверить преобразование, как в этом коде

foreach (mb_list_encodings() as $chr) < foreach (mb_list_encodings() as $chr2) < $test = mb_convert_encoding($decoded, $chr, $chr2); >>

и я наконец нашел, что некоторые значения похожи, но не равны. Я сделал то же самое с оригинальным $str но также безуспешно (я не получил вывод запроса из вопроса).

foreach (mb_list_encodings() as $chr) < foreach (mb_list_encodings() as $chr2) < //try with and without urlencode $test = urlencode(mb_convert_encoding($str, $chr, $chr2)); >>

Конечно, когда мы делаем это

$newOutput = urlencode($decoded);

тогда мы получаем $out значение.

Вывод заключается в том, что преобразование между ASCII и UTF-8, очевидно, в этом случае не является необходимым, и входная строка может быть неправильной (возможно, из-за некоторого ненужного обращения из UTF-8 к чему-то, что я не могу распознать).

Источник

Converting UTF-8 strings to ASCII using the ICU Transliterator

With the general availability and widespread support of UTF-8, character encoding issues are thankfully becoming a problem of the past. But unfortunately there are still tons of legacy systems out there that don’t support it. I ran into this exact problem quite recently. I had built a «Book an appointment» form for a client. All user input, including the customer’s name, is sent to the client’s legacy CRM via a proprietary HTTP API. It turned out that said CRM only accepts ASCII ☹️. That’s right: Just ASCII, not even Extended ASCII. Any attempt to send a string with non-ASCII characters resulted in an HTTP 400-error. That meant that people with names like Bjørn or François couldn’t use that form — because those names contain non-ASCII characters. Naturally, it is not acceptable by any means to exclude Bjørn and François from using our form just because their names contain letters that don’t appear on a 1960s teletypewriter. I consulted with the client but sadly the problem couldn’t (or wouldn’t) be fixed on their end, and they asked if I could provide a solution. So I needed to come up with a way to transform or convert the user’s input into ASCII.

The desired result

First, let’s define what the actual desired result is. I’ll be using this fictitious name: Daniël Renée François Bjørn in’t Veld . Every word in this string has a non-ASCII character. If we need to convert this string to ASCII, we should find characters that look similar. To be precise, I want the end result to be: Daniel Renee Francois Bjorn in’t Veld . In my opinion that is as close as we can get. At this point I want to stress that if you have a viable way to refrain from having to convert user input (e.g., someone’s name), you absolutely should!
In other words: if someone is called Bjørn, please go out of your way to make sure your systems call them Bjørn. Someone’s name is part of their identity and not something you want to mess up. I for one already get annoyed when a system autocapitalises my surname into «Van Raaij». Imagine my frustration if I were to be called «B@rt» just because a system doesn’t have the a character in their character set. That being said: given the choice between a) not being able to use a form or service at all or b) being called Bjorn, I’m sure that Bjørn would choose the latter. Enough talk, let’s code! Converting a UTF-8 string to ASCII can’t be hard, right? Note: I’ll be using PHP, but the examples are applicable to other languages as well.

The obvious choice: iconv

iconv — Convert string to requested character encoding iconv ( string $in_charset , string $out_charset , string $str ) : string Performs a character set conversion on the string str from in_charset to out_charset .

As the documentation states, there are three ‘modes’ in which iconv can operate: plain, IGNORE and TRANSLIT. Let’s not waste any time and put it to the test:

 $name = 'Daniël Renée François Bjørn in’t Veld'; $plain = iconv("UTF-8", "ASCII", $name); $ignore = iconv("UTF-8", "ASCII//IGNORE", $name); $translit = iconv("UTF-8", "ASCII//TRANSLIT", $name); var_dump($plain, $ignore, $translit);

Notice: iconv(): Detected an illegal character in input string in /in/RREJl on line 4 bool(false) string(32) "Danil Rene Franois Bjrn int Veld" string(37) "Dani?l Ren?e Fran?ois Bj?rn in't Veld"

The plain mode triggered an E_NOTICE and returned false . It means that iconv detected one or more characters that it couldn’t fit into the output charset, and it just gave up;
The IGNORE mode simply discarded the characters it couldn’t fit into ASCII;
The TRANSLIT mode tried to replace the non-ASCII characters with similarly looking ASCII characters, but failed. Except for ’ — the Right Single Quotation Mark, which is not uncommon in Dutch surnames — they’re all replaced by a question mark.

The PHP docs warn that this may happen: «TRANSLIT conversion is likely to fail for characters which are illegal for the out_charset.» And if you read the comments in the documentation you’ll find that iconv’s TRANSLIT mode behaves very inconsistently between different systems. So apparently we can’t rely on iconv’s TRANSLIT mode at all.

Technically I could’ve used the IGNORE mode of iconv and be done with it. It doesn’t contain any non-ASCII characters anymore so my API call wouldn’t fail anymore. But it’s not the result I set out for. Again: if my name is Bjørn, I want to be called Bjørn, I can live with «Bjorn» but not «Bjrn» and certainly not «Bj?rn».

Transliteration

Although iconv’s TRANSLIT mode doesn’t seem usable, I feel we are on the right track with transliteration. So what exactly is transliteration?

Transliteration, in the general sense of the word, is «conversion of a text from one script to another that involves swapping letters in predictable ways» (Wikipedia). It is, for example, the conversion of Игорь Стравинский (Cyrillic script) to Igor Stravinsky (Latin script).

Now think of a character set as a script, and immediately it makes sense to use transliteration to convert text from one character set to another. The character ø is in the UTF-8 ‘script’ but not in ASCII. Transliterating UTF-8 to ASCII would mean to find an ASCII-character that represents that character as good as possible.

Is it possible to perform these kinds of transliteration programmatically? Yes, it is!

International Components for Unicode (ICU)

Enter ICU. The International Components for Unicode constitute a «cross-platform Unicode based globalisation library» with components for «locale-sensitive string comparison, date/time/number/currency/message formatting, text boundary detection, character set conversion and so on». It’s built and provided by the Unicode Consortium as C/C++ and Java libraries, but wrappers exist for plenty of other languages, including PHP. In PHP it’s better known as the Internationalization extension, or ext-intl .

Speaking of which, this sentence on the ICU Related Projects page made me smile:

«The upcoming PHP 6 language is expected to support Unicode through ICU4C».

I could probably write a blog post for each and every component in the ICU library (I find internationalisation mighty interesting), but let’s focus and see if the ICU Transliterator can help us in our quest to correctly converting UTF8 to ASCII.

Using the ICU Transliterator

Let’s dive right in. The PHP function we’re looking for is transliterator_transliterate :

transliterator_transliterate — Transliterate a string

transliterator_transliterate ( mixed $transliterator , string $subject [, int $start [, int $end ]] )

Transforms a string or part thereof using an ICU transliterator.

Note: I’m using the procedural function here for brevity, but PHP also provides a Transliterator class.

The function call looks pretty straightforward at first, but the $transliterator parameter is where it gets a bit tricky. The docs are fairly brief and don’t give much guidance, but fortunately the ICU docs provide some insights:

Latin-ASCII: Converts non-ASCII-range punctuation, symbols, and Latin letters in an approximate ASCII-range equivalent.

 $name = 'Daniël Renée François Bjørn in’t Veld'; $translitRules = 'Latin-ASCII'; $nameAscii = transliterator_transliterate($translitRules, $name); var_dump($nameAscii);

Источник

Utf8 to ascii php

Saved searches

Use saved searches to filter your results more quickly

License

gjuric/utf8_to_ascii

Name already in use

Sign In Required

Launching GitHub Desktop

Launching GitHub Desktop

Launching Xcode

Launching Visual Studio Code

Latest commit

Git stats

Files

README

Преобразование UTF-8 в ASCII

Решение

Другие решения

Converting UTF-8 strings to ASCII using the ICU Transliterator

The desired result

The obvious choice: iconv

Transliteration

International Components for Unicode (ICU)

Using the ICU Transliterator