Php default charset encoding

A Guide to UTF-8 Encoding in PHP and MySQL

Once you step beyond the comfortable confines of English-only character sets, you quickly find yourself entangled in the wonderfully wacky world of UTF-8.

Indeed, navigating through UTF-8 related issues can be a frustrating and hair-pulling experience. This post provides a concise cookbook for addressing these issues when working with PHP and MySQL in particular, based on practical experience and lessons learned.

authors are vetted experts in their fields and write on topics in which they have demonstrated experience. All of our content is peer reviewed and validated by Toptal experts in the same field.

Once you step beyond the comfortable confines of English-only character sets, you quickly find yourself entangled in the wonderfully wacky world of UTF-8.

Indeed, navigating through UTF-8 related issues can be a frustrating and hair-pulling experience. This post provides a concise cookbook for addressing these issues when working with PHP and MySQL in particular, based on practical experience and lessons learned.

Читайте также:  End and sep in python

authors are vetted experts in their fields and write on topics in which they have demonstrated experience. All of our content is peer reviewed and validated by Toptal experts in the same field.

Francisco Clariá's profile image

By Francisco Clariá

Verified Expert in Engineering

Francisco is an engineer focused on cross-platform apps (Ionic/Cordova) and specialized in hardware-software technology integration.

Expertise

Years of Experience

As a MySQL or PHP developer, once you step beyond the comfortable confines of English-only character sets, you quickly find yourself entangled in the wonderfully wacky world of UTF-8 encoding.

Unicode is a widely-used computing industry standard that defines a comprehensive mapping of unique numeric code values to the characters in most of today’s written character sets to aid with system interoperability and data interchange.

UTF-8 is a variable-width encoding that can represent every character in the Unicode character set. It was designed for backward compatibility with ASCII and to avoid the complications of endianness and byte order marks in UTF-16 and UTF-32. UTF-8 has become the dominant character encoding for the World Wide Web, accounting for more than half of all Web pages.

UTF-8 encodes each character using one to four bytes. The first 128 characters of Unicode correspond one-to-one with ASCII, making valid ASCII text also valid UTF-8-encoded text. It is for this reason that systems that are limited to use of the English character set are insulated from the complexities that can otherwise arise with UTF-8.

For example, the Unicode hexidecimal code for the letter A is U+0041, which in UTF-8 is simply encoded with the single byte 41. In comparison, the Unicode hexidecimal code for the character

utf8 symbol

On a previous job, we began running into data encoding issues when displaying bios of artists from all over the world. It soon became apparent that there were problems with the stored data, as sometimes the data was correctly encoded and sometimes it was not.

This led programmers to implement a hodge-podge of patches, sometimes with JavaScript, sometimes with HTML charset meta tags, sometimes with PHP, and so on. Soon, we ended up with a list of 600,000 artist bios with double- or triple-encoded information, with data being stored in different ways depending on who programmed the feature or implemented the patch. A classical technical rat’s nest.

Indeed, navigating through UTF-8 data encoding issues can be a frustrating and hair-pulling experience. This post provides a concise cookbook for addressing these UTF-8 issues when working with PHP and MySQL in particular, based on practical experience and lessons learned (and with thanks, in part, to information discovered here and here along the way).

Data encoding with UTF-8 unicode for PHP and MySQL makes complex languages simple.

Specifically, we’ll cover the following in this post:

  • Mods you’ll need to make to your php.ini file and PHP code.
  • Mods you’ll need to make to your my.ini file and other MySQL-related issues to be aware of (including config mods needed if you’re using Sphinx)
  • How to migrate data from a MySQL database previously encoded in latin1 to instead use a UTF-8 encoding

PHP UTF-8 Encoding – modifications to your php.ini file:

The first thing you need to do is to modify your php.ini file to use UTF-8 as the default character set:

(Note: You can subsequently use phpinfo() to verify that this has been set properly.)

OK cool, so now PHP and UTF-8 should work just fine together. Right?

Well, not exactly. In fact, not even close.

While this change will ensure that PHP always outputs UTF-8 as the character encoding (in browser response Content-type headers), you still need to make a number of modifications to your PHP code to make sure that it properly processes and generates UTF-8 characters.

PHP UTF-8 Encoding – modifications to your code:

To be sure that your PHP code plays well in the UTF-8 data encoding sandbox, here are the things you need to do:

    Set UTF-8 as the character set for all headers output by your PHP code In every PHP output header, specify UTF-8 as the encoding:

 header('Content-Type: text/html; charset=utf-8'); 
 function utf8_for_xml($string) < return preg_replace('/[^\x\x\x\x-\x\x-\x]+/u', ' ', $string); > 
 $safeString = utf8_for_xml($yourUnsafeString); 
 htmlspecialchars($str, ENT_NOQUOTES, "UTF-8") 
 $link = mysql_connect('localhost', 'user', 'password'); mysql_set_charset('utf8', $link); 

Note that, as of PHP 5.5.0, mysql_set_charset is deprecated, and mysqli::set_charset should be used instead:

 $mysqli = new mysqli("localhost", "my_user", "my_password", "test"); /* check connection */ if (mysqli_connect_errno()) < printf("Connect failed: %s\n", mysqli_connect_error()); exit(); >/* change character set to utf8 */ if (!$mysqli->set_charset("utf8")) < printf("Error loading character set utf8: %s\n", $mysqli->error); > else < printf("Current character set: %s\n", $mysqli->character_set_name()); > $mysqli->close(); 
  • The iconv functions that are available by default with PHP provide multibyte compatible versions of many of these functions (e.g., iconv_strlen , etc.). Remember, though, that the strings you provide to these functions must themselves be properly encoded.
  • There is also the mbstring extension to PHP (information on enabling and configuring it is available here). This extension provides a comprehensive set of functions that properly account for multibyte encoding.

MySQL UTF-8 Encoding – modifications to your my.ini file:

On the MySQL/UTF-8 side of things, modifications to the my.ini file are required as follows:

    Set the following config parameters after each corresponding tag:

 [client] default-character-set=UTF-8 [mysql] default-character-set=UTF-8 [mysqld] character-set-client-handshake = false #force encoding to uft8 character-set-server=UTF-8 collation-server=UTF-8_general_ci [mysqld_safe] default-character-set=UTF-8 
 mysql> show variables like 'char%'; 
 | character_set_client | UTF-8 | character_set_connection | UTF-8 | character_set_database | UTF-8 | character_set_filesystem | binary | character_set_results | UTF-8 | character_set_server | UTF-8 | character_set_system | UTF-8 | character_sets_dir | /usr/share/mysql/charsets/ 

MySQL UTF-8 Encoding – other things to consider:

  • MySQL UTF-8 is actually a partial implementation of the full UTF-8 character set. Specifically, MySQL UTF-8 encoding uses a maximum of 3 bytes, whereas 4 bytes are required for encoding the full UTF-8 character set. This is fine for all language characters, but if you need to support astral symbols (whose code points range from U+010000 to U+10FFFF), those require a four byte encoding which is not supported in MySQL UTF-8. In MySQL 5.5.3, this was addressed with the addition of support for the utf8mb4 character set which uses a maximum of four bytes per character and thereby supports the full UTF-8 character set. So if you’re using MySQL 5.5.3 or later, use utf8mb4 instead of UTF-8 as your database/table/row character set. More info is available here.
  • If the connecting client has no way to specify the encoding for its communication with MySQL, after the connection is established you may have to run the following command/query:

MySQL UTF-8 Encoding – if you use Sphinx:

sql_query_pre = SET CHARACTER_SET_RESULTS=UTF-8 sql_query_pre = SET NAMES UTF-8 

Migrating database data that is already encoded in latin1 to UTF-8

If you have an existing MySQL database that is already encoded in latin1, here’s how to convert the latin1 to UTF-8:

  1. Make sure you’ve made all the modifications to the configuration settings in your my.ini file, as described above.
  2. Execute the following command:
 ALTER SCHEMA `your-db-name` DEFAULT CHARACTER SET UTF-8; 
 mysql> show variables like 'char%'; 
 mysqldump -u USERNAME -pDB_PASSWORD --opt --skip-set-charset --default-character-set=latin1 --skip-extended-insert DATABASENAME --tables TABLENAME > DUMP_FILE_TABLE.sql 
 mysqldump -u root --opt --skip-set-charset --default-character-set=latin1 --skip-extended-insert artists-database --tables tbl_artist > tbl_artist.sql 
 perl -i -pe 's/DEFAULT CHARSET=latin1/DEFAULT CHARSET=UTF-8/' DUMP_FILE_TABLE.sql 
 mysql> source "DUMP_FILE_TABLE.sql"; 
    See if there are any records with multi-byte characters (if this query returns zero, then there don’t appear to be any records with multi-byte characters in your table and you can proceed to Step 8).
 mysql> select count(*) from MY_TABLE where LENGTH(MY_FIELD) != CHAR_LENGTH(MY_FIELD); 
 create table temptable ( select * from MY_TABLE where LENGTH(MY_FIELD) != CHAR_LENGTH(MY_FIELD)); 
 alter table temptable modify temptable.ArtistName varchar(128) character set latin1; 

Note: Be sure to use the correct field type for your table. In the example above, for our table, the correct field type for ‘ArtistName’ was varchar(128), but the field in your table could be text or any other type. Be sure to specify it properly! The problem is that now, if we set the column encoding back to UTF-8, MySQL will run the latin1 to UTF-8 data encoding for us again and we’ll be back to where we started. To avoid this, we change the column type to blob and THEN we set it to UTF-8. This exploits the fact that MySQL will not attempt to encode a blob. We are thereby able to “fool” the MySQL charset conversion to avoid the double encoding issue. e.g.:

 alter table temptable modify temptable.ArtistName blob; alter table temptable modify temptable.ArtistName varchar(128) character set UTF-8; 
 delete from MY_TABLE where LENGTH(MY_FIELD) = CHAR_LENGTH(MY_FIELD); 
 replace into MY_TABLE (select * from temptable); 

Source code and resource files

One other thing to remember and verify is that your source code files, resources files, and so on, are all being saved properly with UTF-8 data encoding. Otherwise, any “special” characters in these files may not be handled correctly.

In Netbeans, for example, you can right-click on your project, choose properties and then in “Sources” you will find the data encoding option (it usually defaults to UTF-8, but it’s worth checking).

Or in Windows Notepad, use the “Save As…” option in the File menu, and select the UTF-8 encoding option at the bottom of the dialog. (Note that the “Unicode” option that Notepad provides is actually UTF-16, so that’s not what you want.)

Wrap-up

Although it can be somewhat tedious, taking the time to go through these steps to systematically address your MySQL and PHP UTF-8 data encoding issues can ultimately save you a great deal of time and grief. In the long run, this type of methodical approach is far superior to the all-too-common tendency to just keep patching the system.

This guide hopefully emphasizes the importance of taking the charset definition into consideration when setting up a project environment in the first place and working in a software project environment that properly accounts for character encoding in its manipulation of text and strings.

Further Reading on the Toptal Blog:

Understanding the basics

What is UTF-8 character set?

Defined by the Unicode standard, UTF-8 is an 8-bit character encoding capable of storing ay Unicode character. It is backwards compatible with ASCII.

What does UTF-8 stand for?

UTF is short for Unicode Transformation Format, while the “8” suffix denotes the use of 8-bit blocks to represent characters.

How to insert Unicode characters in MySQL using PHP?

In order to insert Unicode characters in MySQL, you need to create a table with Unicode support, select the appropriate encoding/collation settings, and specify the charset in the MySQL connection. Then, you can proceed and employ PHP code to insert Unicode as you please.

Источник

Оцените статью