Php длина строки utf8

Содержание

How do I get the number of characters in PHP?
7 Answers 7
strlen
Смотрите также
User Contributed Notes 7 notes
strlen() and UTF-8 encoding [duplicate]
6 Answers 6

How do I get the number of characters in PHP?

mb_strlen only gives number of bytes, and it is not what I wanted. It should work with multibyte characters.

mb_substr gives you a substring of a given multi-byte string — this has nothing to do with the length of the string. Use mb_strlen as others have suggested.

byte size -> strlen() ex: strlen(‘a₹’) -> 4 . character count -> mb_strlen() ex: mb_strlen(‘a₹’, «UTF-8») -> 2 . Note: mb_strlen() is disabled by default in php.

7 Answers 7

It’s correct, perhaps PHP doesn’t recognize your string as Multibyte string? Try something like this: mb_strlen($your_multibyte_string,$encoding) where encoding should be something like «UTF-8» or «UTF-16»

strlen(): Returns the number of bytes rather than the number of characters in a string.

$name = "Perú"; // With accent mark echo strlen($name); // Display 5, because "ú" require 2 bytes. $name = "Peru"; // Without accent mark echo strlen($name); // Display 4

mb_strlen(): Returns the number of characters in a string having character encoding. A multi-byte character is counted as 1.

$name = "Perú"; // With accent mark echo mb_strlen($name); // Display 4, because "ú" is counted as 1. $name = "Peru"; // Without accent mark echo mb_strlen($name); // Display 4

iconv_strlen(): Returns the character count of a string, as an integer.

$name = "Perú"; // With accent mark echo iconv_strlen($name); // Display 4. $name = "Peru"; // Without accent mark echo iconv_strlen($name); // Display 4

Источник

strlen

Замечание:

Функция strlen() возвратит количество байт, а не число символов в строке.

Смотрите также

count() — Подсчитывает количество элементов массива или Countable объекте
grapheme_strlen() — Получает длину строки в единицах графемы
iconv_strlen() — Возвращает количество символов в строке
mb_strlen() — Получает длину строки

User Contributed Notes 7 notes

I want to share something seriously important for newbies or beginners of PHP who plays with strings of UTF8 encoded characters or the languages like: Arabic, Persian, Pashto, Dari, Chinese (simplified), Chinese (traditional), Japanese, Vietnamese, Urdu, Macedonian, Lithuanian, and etc.
As the manual says: «strlen() returns the number of bytes rather than the number of characters in a string.», so if you want to get the number of characters in a string of UTF8 so use mb_strlen() instead of strlen().

// the Arabic (Hello) string below is: 59 bytes and 32 characters
$utf8 = «السلام علیکم ورحمة الله وبرکاته!» ;

var_export ( strlen ( $utf8 ) ); // 59
echo «
» ;
var_export ( mb_strlen ( $utf8 , ‘utf8’ ) ); // 32
?>

Since PHP 8.0, passing null to strlen() is deprecated. To check for a blank string (not including ‘0’):

// PHP >= 8.0
if ( $text === null || $text === » )) echo ’empty’ ;
>

When checking for length to make sure a value will fit in a database field, be mindful of using the right function.

There are three possible situations:

1. Most likely case: the database column is UTF-8 with a length defined in unicode code points (e.g. mysql varchar(200) for a utf-8 database).

// ok if php.ini default_charset set to UTF-8 (= default value)
mb_strlen ( $value );
iconv_strlen ( $value );
// always ok
mb_strlen ( $value , «UTF-8» );
iconv_strlen ( $value , «UTF-8» );

// BAD, do not use:
strlen ( utf8_decode ( $value )); // breaks for some multi-byte characters
grapheme_strlen ( $value ); // counts graphemes, not code points
?>

2. The database column has a length defined in bytes (e.g. oracle’s VARCHAR2(200 BYTE))

// ok, but assumes mbstring.func_overload is 0 in php.ini (= default value)
strlen ( $value );
// ok, forces count in bytes
mb_strlen ( $value , «8bit» )
?>

3. The database column is in another character set (UTF-16, ISO-8859-1, etc. ) with a length defined in characters / code points.

Find the character set used, and pass it explicitly to the length function.

PHP’s strlen function behaves differently than the C strlen function in terms of its handling of null bytes (‘\0’).

In PHP, a null byte in a string does NOT count as the end of the string, and any null bytes are included in the length of the string.

In C, the same call would return 2.

Thus, PHP’s strlen function can be used to find the number of bytes in a binary string (for example, binary data returned by base64_decode).

We just ran into what we thought was a bug but turned out to be a documented difference in behavior between PHP 5.2 & 5.3. Take the following code example:

$attributes = array( ‘one’ , ‘two’ , ‘three’ );

if ( strlen ( $attributes ) == 0 && ! is_bool ( $attributes )) echo «We are in the ‘if’\n» ; // PHP 5.3
> else echo «We are in the ‘else’\n» ; // PHP 5.2
>

This is because in 5.2 strlen will automatically cast anything passed to it as a string, and casting an array to a string yields the string «Array». In 5.3, this changed, as noted in the following point in the backward incompatible changes in 5.3 (http://www.php.net/manual/en/migration53.incompatible.php):

«The newer internal parameter parsing API has been applied across all the extensions bundled with PHP 5.3.x. This parameter parsing API causes functions to return NULL when passed incompatible parameters. There are some exceptions to this rule, such as the get_class() function, which will continue to return FALSE on error.»

So, in PHP 5.3, strlen($attributes) returns NULL, while in PHP 5.2, strlen($attributes) returns the integer 5. This likely affects other functions, so if you are getting different behaviors or new bugs suddenly, check if you have upgraded to 5.3 (which we did recently), and then check for some warnings in your logs like this:

strlen() expects parameter 1 to be string, array given in /var/www/sis/lib/functions/advanced_search_lib.php on line 1028

If so, then you are likely experiencing this changed behavior.

I would like to demonstrate that you need more than just this function in order to truly test for an empty string. The reason being that will return 0. So how do you know if the value was null, or truly an empty string?

$foo = null ;
$len = strlen ( null );
$bar = » ;

echo «Length: » . strlen ( $foo ) . «
» ;
echo «Length: $len
» ;
echo «Length: » . strlen ( null ) . «
» ;

if ( strlen ( $foo ) === 0 ) echo ‘Null length is Zero
‘ ;
if ( $len === 0 ) echo ‘Null length is still Zero
‘ ;

Null length is Zero
Null length is still Zero

!is_null(): $foo is probably null
isset(): $foo is probably null

!is_null(): $bar is truly an empty string
isset(): $bar is truly an empty string
// End Output

So it would seem you need either is_null() or isset() in addition to strlen() if you care whether or not the original value was null.

There’s a LOT of misinformation here, which I want to correct! Many people have warned against using strlen(), because it is «super slow». Well, that was probably true in old versions of PHP. But as of PHP7 that’s definitely no longer true. It’s now SUPER fast!

I created a 20,00,000 byte string (~20 megabytes), and iterated ONE HUNDRED MILLION TIMES in a loop. Every loop iteration did a new strlen() on that very, very long string.

The result: 100 million strlen() calls on a 20 megabyte string only took a total of 488 milliseconds. And the strlen() calls didn’t get slower/faster even if I made the string smaller or bigger. The strlen() was pretty much a constant-time, super-fast operation

So either PHP7 stores the length of every string as a field that it can simply always look up without having to count characters. Or it caches the result of strlen() until the string contents actually change. Either way, you should now never, EVER worry about strlen() performance again. As of PHP7, it is super fast!

Here is the complete benchmark code if you want to reproduce it on your machine:

$iterations = 100000000 ; // 100 million
$str = str_repeat ( ‘0’ , 20000000 );

// benchmark loop and variable assignment to calculate loop overhead
$start = microtime ( true );
for( $i = 0 ; $i < $iterations ; ++ $i ) $len = 0 ;
>
$end = microtime ( true );
$loop_elapsed = 1000 * ( $end — $start );

// benchmark strlen in a loop
$len = 0 ;
$start = microtime ( true );
for( $i = 0 ; $i < $iterations ; ++ $i ) $len = strlen ( $str );
>
$end = microtime ( true );
$strlen_elapsed = 1000 * ( $end — $start );

// subtract loop overhead from strlen() speed calculation
$strlen_elapsed -= $loop_elapsed ;

echo «\nstring length: < $len >\ntest took: < $strlen_elapsed >milliseconds\n» ;

Источник

strlen() and UTF-8 encoding [duplicate]

Assuming UTF-8 encoding, and strlen() in PHP, is it possible that this string has a length of 4? I’m only interested to know about strlen(), not other functions This is the string:

I have tested it on my own computer, and I have verified UTF-8 encoding, and the answer I get is 6. I don’t see anything in the manual for strlen or anything I’ve read on UTF-8 that would explain why some of the characters above would count for less than one. PS: This question and answer (4) comes from a mock test for ZCE I bought on Ebay.

UTF-8 characters are multibyte characters, and count as as-many-characters-as-they-are-long-in-bytes when using strlen . Use php.net/manual/en/function.mb-strlen.php for expected results.

6 Answers 6

how about using mb_strlen() ?

But if you need to use strlen, its possible to configure your webserver by setting mbstring.func_overload directive to 2, so it will automatically replace using of strlen to mb_strlen in your scripts.

ew, I wasn’t aware of mbstrung.func_overload — enabling that would break a bunch of my code as I always assume strlen is the length in bytes.

The string you posted is six character long: $1ï¿½2 (dollar sign, digit one, lowercase i with diaeresis, upside-down question mark, one half fraction, digit two)

If strlen() was called with a UTF-8 representation of that string, you would get a result of nine (probably, though there are multiple representations with different lengths).

However, if we were to store that string as ISO 8859-1 or CP1252 we would have a six byte long sequence that would be legal as UTF-8. Reinterpreting those 6 bytes as UTF-8 would then result in 4 characters: $1�2 (dollar sign, digit one, Unicode Replacement Character, digit 2). That is, the UTF-8 encoding of the single character ‘�’ is identical to the ISO-8859-1 encoding of the three characters «ï¿½».

The replacement character often gets inserted when a UTF-8 decoder reads data that’s not valid UTF-8 data.

It appears that the original string was processed through multiple layers of misinterpretation; by the use of a UTF-8 decoder on non-UTF-8 data (producing $1�2), and then by whatever you used to analyze that data (producing $1ï¿½2).

Источник