Php system utf 8

How do I use filesystem functions in PHP, using UTF-8 strings?

Just urlencode the string desired as a filename. All characters returned from urlencode are valid in filenames (NTFS/HFS/UNIX), then you can just urldecode the filenames back to UTF-8 (or whatever encoding they were in).

Caveats (all apply to the solutions below as well):

  • After url-encoding, the filename must be less that 255 characters (probably bytes).
  • UTF-8 has multiple representations for many characters (using combining characters). If you don’t normalize your UTF-8, you may have trouble searching with glob or reopening an individual file.
  • You can’t rely on scandir or similar functions for alpha-sorting. You must urldecode the filenames then use a sorting algorithm aware of UTF-8 (and collations).

Worse Solutions

The following are less attractive solutions, more complicated and with more caveats.

On Windows, the PHP filesystem wrapper expects and returns ISO-8859-1 strings for file/directory names. This gives you two choices:

  1. Use UTF-8 freely in your filenames, but understand that non-ASCII characters will appear incorrect outside PHP. A non-ASCII UTF-8 char will be stored as multiple single ISO-8859-1 characters. E.g. ó will be appear as ó in Windows Explorer.
  2. Limit your file/directory names to characters representable in ISO-8859-1. In practice, you’ll pass your UTF-8 strings through utf8_decode before using them in filesystem functions, and pass the entries scandir gives you through utf8_encode to get the original filenames in UTF-8.
  • If any byte passed to a filesystem function matches an invalid Windows filesystem character in ISO-8859-1, you’re out of luck.
  • Windows may use an encoding other than ISO-8859-1 in non-English locales. I’d guess it will usually be one of ISO-8859-#, but this means you’ll need to use mb_convert_encoding instead of utf8_decode .
Читайте также:  !DOCTYPE

This nightmare is why you should probably just transliterate to create filenames.

ISO-8859-1 is not more useful on Windows than ISO-8859-2 or ISO-8859-3. If you want to be safe, go with the 7-bit ASCII.

This answer doesn’t work for me. mkdir(‘Depósito’) creates Dep%C3%B3sito which I can’t really believe is what the OP wants, even though he accepted this answer. See Umberto Salsi’s answer for what is really going on and how to build a proper solution with setlocale() and iconv() .

Under Unix and Linux (and possibly under OS X too), the current file system encoding is given by the LC_CTYPE locale parameter (see function setlocale() ). For example, it may evaluate to something like en_US.UTF-8 that means the encoding is UTF-8. Then file names and their paths can be created with fopen() or retrieved by dir() with this encoding.

Under Windows, PHP operates as a «non-Unicode aware program», then file names are converted back and forth from the UTF-16 used by the file system (Windows 2000 and later) to the selected «code page». The control panel «Regional and Language Options», tab panel «Formats» sets the code page retrieved by the LC_CTYPE option, while the «Administrative -> Language for non-Unicode Programs» sets the translation code page for file names. In western countries the LC_CTYPE parameter evaluates to something like language_country.1252 where 1252 is the code page, also known as «Windows-1252 encoding» which is similar (but not exactly equal) to ISO-8859-1. In Japan the 932 code page is usually set instead, and so on for other countries. Under PHP you may create files whose name can be expressed with the current code page. Vice-versa, file names and paths retrieved from the file system are converted from UTF-16 to bytes using the «best-fit» current code page.

This mapping is approximated, so some characters might be mangled in an unpredictable way. For example, Caffé Brillì.txt would be returned by dir() as the PHP string Caff\xE9 Brill\xEC.txt as expected if the current code page is 1252, while it would return the approximate Caffe Brilli.txt on a Japanese system because accented vowels are missing from the 932 code page and then replaced with their «best-fit» non-accented vowels. Characters that cannot be translated at all are retrieved as ? (question mark). In general, under Windows there is no safe way to detect such artifacts.

More details are available in my reply to the PHP bug no. 47096.

Источник

Установка локали UTF-8 в PHP

В любом PHP приложении нужно настраивать локаль и кодировку вне зависимости от настроек сервера. Это предотвратит неверное отображение и работу сайта при переезде на другой хостинг и других ситуаций.

Setlocale

Основная функция, в случаи успеха возвращает устанавливаемое значение или FALSE . Влияет на строковые функции, даты и т.д.

setlocale(LC_ALL, 'ru_RU.utf8');

Возможен вариант:

Вместо LC_ALL можно указать отдельную категорию функций, на которые будет влиять локаль:

  • LC_COLLATE – функции сравнения строк,
  • LC_CTYPE – функции преобразования и классификации строк,
  • C_MONETARYL – для функции localeconv(),
  • LC_NUMERIC – задает символ десятичного разделения,
  • LC_TIME – форматирование даты/времени,
  • LC_MESSAGES – для системных сообщений.

MB_string

Настройка функций для работы с многобайтовыми строками.

mb_internal_encoding('UTF-8'); mb_regex_encoding('UTF-8'); mb_http_output('UTF-8'); mb_language('uni');

Часовой пояс

От него зависит результат работы функций с датами, подробнее о настройке временной зоны.

date_default_timezone_set('Europe/Moscow');

Кодировка контента

Ещё можно явно указать в какой кодировке передается контент, отправив заголовок:

header('Content-type: text/html; charset=utf-8');

Код целиком

// Локаль. setlocale(LC_ALL, 'ru_RU.utf8'); mb_internal_encoding('UTF-8'); mb_regex_encoding('UTF-8'); mb_http_output('UTF-8'); mb_language('uni'); header('Content-type: text/html; charset=utf-8'); date_default_timezone_set('Europe/Moscow');

Источник

Preparing PHP application to use with UTF-8

UTF-8 is de facto standard for web applications now, but PHP this is not a default encoding for PHP (until 6.0). Most of the server is set up for the ISO-8859-1 encoding by default. How to overload the default settings in the .htaccess to be sure that everything goes well for UTF-8, locale etc.? Any options for the web server, Unix OS? Is there any comprehensive list of those settings? E.g. mbstring options, iconv settings, locale etc I should set up for each multi language project? Any pre defined .htaccess as an example? (In my particular case I need setup for the languages: English, Dutch and Russian. The server is in Ukraine).

5 Answers 5

Some useful options to have in .htaccess :

######################################## # Locale settings ######################################## # See: http://php.net/manual/en/timezones.php php_value date.timezone "Europe/Amsterdam" SetEnv LC_ALL nl_NL.UTF-8 ######################################## # Set up UTF-8 encoding ######################################## AddDefaultCharset UTF-8 AddCharset UTF-8 .php php_value default_charset "UTF-8" php_value iconv.input_encoding "UTF-8" php_value iconv.internal_encoding "UTF-8" php_value iconv.output_encoding "UTF-8" php_value mbstring.internal_encoding UTF-8 php_value mbstring.http_output UTF-8 php_value mbstring.encoding_translation On php_value mbstring.func_overload 6 # See also php functions: # mysql_set_charset # mysql_client_encoding # database settings #CREATE DATABASE db_name # CHARACTER SET utf8 # DEFAULT CHARACTER SET utf8 # COLLATE utf8_general_ci # DEFAULT COLLATE utf8_general_ci # ; # #ALTER DATABASE db_name # CHARACTER SET utf8 # DEFAULT CHARACTER SET utf8 # COLLATE utf8_general_ci # DEFAULT COLLATE utf8_general_ci # ; #ALTER TABLE tbl_name # DEFAULT CHARACTER SET utf8 # COLLATE utf8_general_ci # ; 

You’re right UTF-8 is a good choice for webapplications.

Encoding is meta-information to the data that get’s processed. As long as you know the encoding of the (binary) data, you know what you’re dealing with. You start to get lost, if you don’t know the encoding. I often call this a chain, if the encoding-chain is broken, the data will be broken. This is both true for displaying data as well as for security.

As a rule of thumb, PHP is binary, it’s the context/you who specifies the encoding (e.g. how you save your php source-code files).

So let’s tackle a short (and incomplete) list:

The OS

Environment variables might tell you about the locale in use and the encoding. File-systems do have their encoding for names of files and directories for example. I’m not very firm to this subject, normally we try to name our files in english so to use only characters in the range of US-ASCII which is safe for the Latin extended charsets like ISO-8859-1 in your case as well as for UTF-8 .

Just keep this in mind when you save files your users upload: Just filter filenames to basic letters and punctation and you’ll have nearly no hassles ( a-z , A-Z , 0-9 , . , — , _ ), even make them all lowercase for visual purposes.

If you feel that this degrades usability and the file-system does not offer the unicode range of characters as of UTF-8, you can fallback to simple encodings like rawurlencode (Percent-Encoding, triplet) and offer files to download by resolving that name to disk.

Normally you just need to deal with what you have. Start asking a common sysadmin or programmer about character encoding and most will tell you that they are not really interested. Naturally that’s subjective, but if you need someone to configure something for you, this can make a difference.

HTML

This is merely independent to PHP, it’s about the output your scripts provide so the field of work.

Rule of thumb is: Specify it. If you didn’t specifiy it (HTML files, CSS files, Javascript files) don’t expect it to work precisely. Just do it then. Encoding is a chain, if there are many components, ensure that each knows about it’s encoding. Otherwise browsers can only guess. UTF-8 is a good choice so, but our job is to take care and make this precise and well defined.

PHP Settings

As a general rule of thumb, start reading the php.ini file that ships with the PHP package of your linux distro. It comes with readable documentation in it’s comments and further links. Some settings that come to my mind:

  • default_charset — PHP always outputs a character encoding by default in the Content-type: header. To disable sending of the charset, simply set it to be empty (Source). For general information see Setting the HTTP charset parameter W3C. If you want to improve your site’s output, e.g. for preserving the encoding information when users save the output with their browser, add the HTML http-equiv meta tag as well .
  • output_handler — This setting is worth to look at as it is specifying the output handler (Output Buffering Control Docs) and each handler ( mb , iconv ) can have it’s own encoding settings (see Strings).

Strings

  • StringsDocs — By default strings in PHP are binary. As long as you use them with binary safe functions, you get what you expect. Since PHP 5.2.1 you can cast strings explicitly to binary strings. That’s for forward compatibility of the said PHP 6 unicode support: $binary = (binary) $string; or $binary = b»binary string»; .
  • mb_internal_encoding()Docs — Gain or set it; mbstring.internal_encoding INI. The internal encoding is the character encoding name used for the HTTP input character encoding conversion, HTTP output character encoding conversion, and the default character encoding for string functions defined by the mbstring module.
  • iconv_set_encoding()Docs — Comparable for the iconv extension. See as well the iconv configuration settings.
  • Various: Some functions that deal with character sequences allow you to specify a charset encoding. For example htmlspecialchars Docs. Make use of these parameters and check the docs for their default value. Often it is ISO-8859-1 but you’re looking for UTF-8 . Other functions like html_entity_decode Docs are using UTF-8 per default. Some like htmlspecialchars_decode do not specify a charset at all, so you need to read the PHP source-code for a concrete specific understanding of how the function deals with the (binary) string.

To answer your question: The need of settings and parameters always depend on the components you use. For the general ones like the browser or the webserver, it’s possible to give recommendation settings to get it configured for UTF-8 . But with everything else it depends. The most important thing is to look for it and to ensure that you know the encoding and can configure/specify it. Often it’s documented. As long as you don’t need to deal with portable code, this is much simpler as you have control of the environment or you need to deal with a specific environment only. Write code defensively with encoding in mind and you should be fine.

Источник

Оцените статью