Php string encoding utf 8

Содержание

mb_convert_encoding
Parameters
Return Values
Errors/Exceptions
Changelog
Examples
See Also
User Contributed Notes 35 notes

mb_convert_encoding

Converts string from from_encoding , or the current internal encoding, to to_encoding . If string is an array , all its string values will be converted recursively.

Parameters

The string or array to be converted.

The desired encoding of the result.

The current encoding used to interpret string . Multiple encodings may be specified as an array or comma separated list, in which case the correct encoding will be guessed using the same algorithm as mb_detect_encoding() .

If from_encoding is omitted or null , the mbstring.internal_encoding setting will be used if set, otherwise the default_charset setting.

See supported encodings for valid values of to_encoding and from_encoding .

Return Values

The encoded string or array on success, or false on failure.

Errors/Exceptions

As of PHP 8.0.0, a ValueError is thrown if the value of to_encoding or from_encoding is an invalid encoding. Prior to PHP 8.0.0, a E_WARNING was emitted instead.

Changelog

Version	Description
8.0.0	mb_convert_encoding() will now throw a ValueError when to_encoding is passed an invalid encoding.
8.0.0	mb_convert_encoding() will now throw a ValueError when from_encoding is passed an invalid encoding.
8.0.0	from_encoding is nullable now.
7.2.0	This function now also accepts an array as string . Formerly, only string s have been supported.

Examples

Example #1 mb_convert_encoding() example

/* Convert internal character encoding to SJIS */
$str = mb_convert_encoding ( $str , «SJIS» );

/* Convert EUC-JP to UTF-7 */
$str = mb_convert_encoding ( $str , «UTF-7» , «EUC-JP» );

/* Auto detect encoding from JIS, eucjp-win, sjis-win, then convert str to UCS-2LE */
$str = mb_convert_encoding ( $str , «UCS-2LE» , «JIS, eucjp-win, sjis-win» );

/* If mbstring.language is «Japanese», «auto» is expanded to «ASCII,JIS,UTF-8,EUC-JP,SJIS» */
$str = mb_convert_encoding ( $str , «EUC-JP» , «auto» );
?>

User Contributed Notes 35 notes

For my last project I needed to convert several CSV files from Windows-1250 to UTF-8, and after several days of searching around I found a function that is partially solved my problem, but it still has not transformed all the characters. So I made this:

I’ve been trying to find the charset of a norwegian (with a lot of ø, æ, å) txt file written on a Mac, i’ve found it in this way:

$text = «A strange string to pass, maybe with some ø, æ, å characters.» ;

foreach( mb_list_encodings () as $chr ) <
echo mb_convert_encoding ( $text , ‘UTF-8’ , $chr ). » : » . $chr . «
» ;
>
?>

The line that looks good, gives you the encoding it was written in.

Hey guys. For everybody who’s looking for a function that is converting an iso-string to utf8 or an utf8-string to iso, here’s your solution:

public function encodeToUtf8($string) return mb_convert_encoding($string, «UTF-8», mb_detect_encoding($string, «UTF-8, ISO-8859-1, ISO-8859-15», true));
>

public function encodeToIso($string) return mb_convert_encoding($string, «ISO-8859-1», mb_detect_encoding($string, «UTF-8, ISO-8859-1, ISO-8859-15», true));
>

For me these functions are working fine. Give it a try

aaron, to discard unsupported characters instead of printing a ?, you might as well simply set the configuration directive:

in your php.ini. Be sure to include the quotes around none. Or at run-time with

ini_set ( ‘mbstring.substitute_character’ , «none» );
?>

My solution below was slightly incorrect, so here is the correct version (I posted at the end of a long day, never a good idea!)

Again, this is a quick and dirty solution to stop mb_convert_encoding from filling your string with question marks whenever it encounters an illegal character for the target encoding.

function convert_to ( $source , $target_encoding )
// detect the character encoding of the incoming file
$encoding = mb_detect_encoding ( $source , «auto» );

// escape all of the question marks so we can remove artifacts from
// the unicode conversion process
$target = str_replace ( «?» , «[question_mark]» , $source );

// convert the string to the target encoding
$target = mb_convert_encoding ( $target , $target_encoding , $encoding );

// remove any question marks that have been introduced because of illegal characters
$target = str_replace ( «?» , «» , $target );

// replace the token string «[question_mark]» with the symbol «?»
$target = str_replace ( «[question_mark]» , «?» , $target );

return $target ;
>
?>

Hope this helps someone! (Admins should feel free to delete my previous, incorrect, post for clarity)
-A

many people below talk about using
mb_convert_encode ( $s , ‘HTML-ENTITIES’ , ‘UTF-8’ );
?>
to convert non-ascii code into html-readable stuff. Due to my webserver being out of my control, I was unable to set the database character set, and whenever PHP made a copy of my $s variable that it had pulled out of the database, it would convert it to nasty latin1 automatically and not leave it in it’s beautiful UTF-8 glory.

So [insert korean characters here] turned into .

I found myself needing to pass by reference (which of course is deprecated/nonexistent in recent versions of PHP)
so instead of
mb_convert_encode (& $s , ‘HTML-ENTITIES’ , ‘UTF-8’ );
?>
which worked perfectly until I upgraded, so I had to use
call_user_func_array ( ‘mb_convert_encoding’ , array(& $s , ‘HTML-ENTITIES’ , ‘UTF-8’ ));
?>

Hope it helps someone else out

To add to the Flash conversion comment below, here’s how I convert back from what I’ve stored in a database after converting from Flash HTML text field output, in order to load it back into a Flash HTML text field:

function htmltoflash($htmlstr)
return str_replace(«<br />»,»\n»,
str_replace(» str_replace(«>»,»>»,
mb_convert_encoding(html_entity_decode($htmlstr),
«UTF-8″,»ISO-8859-1»))));
>

When you need to convert from HTML-ENTITIES, but your UTF-8 string is partially broken (not all chars in UTF-8) — in this case passing string to mb_convert_encoding($string, ‘UTF-8’, ‘HTML-ENTITIES’); — corrupts chars in string even more. In this case you need to replace html entities gradually to preserve character good encoding. I wrote such closure for this job :
$decode_entities = function( $string ) preg_match_all ( «/&#?\w+;/» , $string , $entities , PREG_SET_ORDER );
$entities = array_unique ( array_column ( $entities , 0 ));
foreach ( $entities as $entity ) $decoded = mb_convert_encoding ( $entity , ‘UTF-8’ , ‘HTML-ENTITIES’ );
$string = str_replace ( $entity , $decoded , $string );
>
return $string ;
>;
?>

If you are trying to generate a CSV (with extended chars) to be opened at Exel for Mac, the only that worked for me was:

I also tried this:

//Separado OK, chars MAL
iconv ( ‘MACINTOSH’ , ‘UTF8’ , $CSV );
//Separado MAL, chars OK
chr ( 255 ). chr ( 254 ). mb_convert_encoding ( $CSV , ‘UCS-2LE’ , ‘UTF-8’ );
?>

But the first one didn’t show extended chars correctly, and the second one, did’t separe fields correctly

If you have what looks like ISO-8859-1, but it includes «smart quotes» courtesy of Microsoft software, or people cutting and pasting content from Microsoft software, then what you’re actually dealing with is probably Windows-1252. Try this:

$cleanText = mb_convert_encoding ( $text , ‘UTF-8’ , ‘Windows-1252’ );
?>

The annoying part is that the auto detection (ie: the mb_detect_encoding function) will often think Windows-1252 is ISO-8859-1. Close, but no cigar. This is critical if you’re then trying to do unserialize on the resulting text, because the byte count of the string needs to be perfect.

Text-encoding HTML-ENTITIES will be deprecated as of PHP 8.2.

To convert all non-ASCII characters into entities (to produce pure 7-bit HTML output), I was using:

echo mb_convert_encoding ( htmlspecialchars ( $text , ENT_QUOTES , ‘UTF-8’ ), ‘HTML-ENTITIES’ , ‘UTF-8’ );
?>

I can get the identical result with:

echo mb_encode_numericentity ( htmlentities ( $text , ENT_QUOTES , ‘UTF-8’ ), [ 0x80 , 0x10FFFF , 0 , ~ 0 ], ‘UTF-8’ );
?>

The output contains well-known named entities for some often used characters and numeric entities for the rest.

/**
* Convert Windows-1250 to UTF-8
* Based on https://www.php.net/manual/en/function.mb-convert-encoding.php#112547
*/
class TextConverter
private const ENCODING_TO = ‘UTF-8’;
private const ENCODING_FROM = ‘ISO-8859-2’;

private array $mapChrChr = [
0x8A => 0xA9,
0x8C => 0xA6,
0x8D => 0xAB,
0x8E => 0xAE,
0x8F => 0xAC,
0x9C => 0xB6,
0x9D => 0xBB,
0xA1 => 0xB7,
0xA5 => 0xA1,
0xBC => 0xA5,
0x9F => 0xBC,
0xB9 => 0xB1,
0x9A => 0xB9,
0xBE => 0xB5,
0x9E => 0xBE
];

/**
* @param $text
* @return string
*/
public function execute($text): string
$map = $this->prepareMap();

return html_entity_decode(
mb_convert_encoding(strtr($text, $map), self::ENCODING_TO, self::ENCODING_FROM),
ENT_QUOTES,
self::ENCODING_TO
);
>

/**
* @return array
*/
private function prepareMap(): array
$maps[] = $this->arrayMapAssoc(function ($k, $v) return [chr($k), chr($v)];
>, $this->mapChrChr);

$maps[] = $this->arrayMapAssoc(function ($k, $v) return [chr($k), $v];
>, $this->mapChrString);

/**
* @param callable $function
* @param array $array
* @return array
*/
private function arrayMapAssoc(callable $function, array $array): array
return array_column(
array_map(
$function,
array_keys($array),
$array
),
1,
0
);
>
>

If you are attempting to convert «UTF-8» text to «ISO-8859-1» and the result is always returning in «ASCII», place the following line of code before the mb_convert_encoding:

It is necessary to force a specific search order for the conversion to work

It appears that when dealing with an unknown «from encoding» the function will both throw an E_WARNING and proceed to convert the string from ISO-8859-1 to the «to encoding».

instead of ini_set(), you can try this

Clean a string for use as filename by simply replacing all unwanted characters with underscore (ASCII converts to 7bit). It removes slightly more chars than necessary. Hope its useful.

For those wanting to convert from $set to MacRoman, use iconv():

$string = iconv ( ‘UTF-8’ , ‘macintosh’ , $string );

(‘macintosh’ is the IANA name for the MacRoman character set.)

Why did you use the php html encode functions? mbstring has it’s own Encoding which is (as far as I tested it) much more usefull:

$text = mb_convert_encoding($text, ‘HTML-ENTITIES’, «UTF-8»);

// convert UTF8 to DOS = CP850
//
// $utf8_text=UTF8-Formatted text;
// $dos=CP850-Formatted text;

$dos = mb_convert_encoding($utf8_text, «CP850», mb_detect_encoding($utf8_text, «UTF-8, CP850, ISO-8859-15», true));

Another sample of recoding without MultiByte enabling.
(Russian koi->win, if input in win-encoding already, function recode() returns unchanged string)

// 0 — win
// 1 — koi
function detect_encoding ( $str ) $win = 0 ;
$koi = 0 ;

for( $i = 0 ; $i < strlen ( $str ); $i ++) if( ord ( $str [ $i ]) > 224 && ord ( $str [ $i ]) < 255 ) $win ++;
if( ord ( $str [ $i ]) > 192 && ord ( $str [ $i ]) < 223 ) $koi ++;
>

if( $win < $koi ) return 1 ;
> else return 0 ;

// recodes koi to win
function koi_to_win ( $string )

$kw = array( 128 , 129 , 130 , 131 , 132 , 133 , 134 , 135 , 136 , 137 , 138 , 139 , 140 , 141 , 142 , 143 , 144 , 145 , 146 , 147 , 148 , 149 , 150 , 151 , 152 , 153 , 154 , 155 , 156 , 157 , 158 , 159 , 160 , 161 , 162 , 163 , 164 , 165 , 166 , 167 , 168 , 169 , 170 , 171 , 172 , 173 , 174 , 175 , 176 , 177 , 178 , 179 , 180 , 181 , 182 , 183 , 184 , 185 , 186 , 187 , 188 , 189 , 190 , 191 , 254 , 224 , 225 , 246 , 228 , 229 , 244 , 227 , 245 , 232 , 233 , 234 , 235 , 236 , 237 , 238 , 239 , 255 , 240 , 241 , 242 , 243 , 230 , 226 , 252 , 251 , 231 , 248 , 253 , 249 , 247 , 250 , 222 , 192 , 193 , 214 , 196 , 197 , 212 , 195 , 213 , 200 , 201 , 202 , 203 , 204 , 205 , 206 , 207 , 223 , 208 , 209 , 210 , 211 , 198 , 194 , 220 , 219 , 199 , 216 , 221 , 217 , 215 , 218 );
$wk = array( 128 , 129 , 130 , 131 , 132 , 133 , 134 , 135 , 136 , 137 , 138 , 139 , 140 , 141 , 142 , 143 , 144 , 145 , 146 , 147 , 148 , 149 , 150 , 151 , 152 , 153 , 154 , 155 , 156 , 157 , 158 , 159 , 160 , 161 , 162 , 163 , 164 , 165 , 166 , 167 , 168 , 169 , 170 , 171 , 172 , 173 , 174 , 175 , 176 , 177 , 178 , 179 , 180 , 181 , 182 , 183 , 184 , 185 , 186 , 187 , 188 , 189 , 190 , 191 , 225 , 226 , 247 , 231 , 228 , 229 , 246 , 250 , 233 , 234 , 235 , 236 , 237 , 238 , 239 , 240 , 242 , 243 , 244 , 245 , 230 , 232 , 227 , 254 , 251 , 253 , 255 , 249 , 248 , 252 , 224 , 241 , 193 , 194 , 215 , 199 , 196 , 197 , 214 , 218 , 201 , 202 , 203 , 204 , 205 , 206 , 207 , 208 , 210 , 211 , 212 , 213 , 198 , 200 , 195 , 222 , 219 , 221 , 223 , 217 , 216 , 220 , 192 , 209 );

$end = strlen ( $string );
$pos = 0 ;
do $c = ord ( $string [ $pos ]);
if ( $c > 128 ) $string [ $pos ] = chr ( $kw [ $c — 128 ]);
>

$enc = detect_encoding ( $str );
if ( $enc == 1 ) $str = koi_to_win ( $str );
>

Источник