Php iconv iso to utf 8

Batch convert latin-1 files to utf-8 using iconv

I’m having this one PHP project on my OSX which is in latin1 -encoding. Now I need to convert files to UTF8. I’m not much a shell coder and I tried something I found from internet:

mkdir new for a in `ls -R *`; do iconv -f iso-8859-1 -t utf-8 new/"$a" ; done 

But that does not create the directory structure and it gives me heck load of errors when run. Can anyone come up with neat solution?

12 Answers 12

You shouldn’t use ls like that and a for loop is not appropriate either. Also, the destination directory should be outside the source directory.

mkdir /path/to/destination find . -type f -exec iconv -f iso-8859-1 -t utf-8 "<>" -o /path/to/destination/"<>" \; 

No need for a loop. The -type f option includes files and excludes directories.

The OS X version of iconv doesn’t have the -o option. Try this:

find . -type f -exec bash -c 'iconv -f iso-8859-1 -t utf-8 "<>" > /path/to/destination/"<>"' \; 

I used this script with no luck. For exact parameters and results, see pastebin.com/U2D0PpWr . There was lots of output for each file (it printed them on screen) and error messages for each file, but i guess you get the idea from that one. I’d be grateful if you’d develop this a bit further 🙂

This doesn’t work if the file exists several subdirectories down because the echo or -o path says «No such file or directory» because it does not create the parent directories in the output location.

Читайте также:  font-size

If you have created a subfolder, than add find . -maxdepth 1 … to let it find only files in the present directory level, e.g.: mkdir ./utf-8 && find . -maxdepth 1 -type f -iname ‘*.tex’ -exec iconv -f iso-8859-1 -t utf-8 «<>» -o ./utf-8/»<>» «;»

This converts all files with the .php filename extension — in the current directory and its subdirectories — preserving the directory structure:

find . -name "*.php" -exec sh -c "iconv -f ISO-8859-1 -t UTF-8 '<>' > '<>'.utf8" \; -exec sh -c "mv '<>.utf8' '<>'" \; 

To get a list of files that will be targeted beforehand, just run the command without the -exec flags (like this: find . -name «*.php» ). Making a backup is a good idea.

Using sh like this allows piping and redirecting with -exec, which is necessary because not all versions of iconv support the -o flag.

Adding .utf8 to the filename of the output and then removing it might seem strange but it is necessary. Using the same name for output and input files can cause the following problems:

  • For large files (around 30 KB in my experience) it causes core dump (or termination by signal 7 )
  • Some versions of iconv seem to create the output-file before they read the input file, which means that if the input and output files have the same name, the input file is overwritten with an empty file before it is read.

This works nicely, thanks! However, when not all files are in Latin 1, how is it possible to only convert the files which need to? IOW, how to add the check with the file command?

A file is just a series of bytes, whether they make more sense interpreted as UTF-8 symbols or Latin 1 symbols only a human can know. However, if you are using the appearance of a certain symbol — for an example à — to determine whether a file must be re-encoded or not you can filter the files with grep and rencode with xargs , like: grep —files-with-matches —recursive ‘Ã’ | xargs «iconv -f ISO-8859-1 -t UTF-8 <> > <>.utf8 ; -exec mv <>.utf8 <>» (note: this code is untested, and make sure your shell is using UTF-8)

Some good answers, but I found this a lot easier in my case with a nested directory of hundreds of files to convert:

WARNING: This will write the files in place, so make a backup

$ vim $(find . -type f) # in vim, go into command mode (:) :set nomore :bufdo set fileencoding=utf8 | w 

You don’t have to enter vim to do this. The following command does the same thing: vim «+set nomore» «+bufdo set fileencoding=utf8 | w» «+q» $(find . -type f)

@user1093967 I tried to find a solution but couldn’t — tried piping this into sed and adding quotation marks, but that doesn’t do the trick. Maybe the solution lies within the documentation for find . However, this works without any problem in fish shell, but I stopped using bash a long time ago because of constant weirdness like this

To convert a complete directory tree recursively from iso-8859-1 to utf-8 including the creation of subdirectories none of the short solutions above worked for me because the directory structure was not created in the target. Based on Dennis Williamsons answer I came up with the following solution:

find . -type f -exec bash -c 't="/tmp/dest"; mkdir -p "$t/`dirname <>`"; iconv -f iso-8859-1 -t utf-8 "<>" > "$t/<>"' \; 

It will create a clone of the current directory subtree in /tmp/dest (adjust to your needs) including all subdirectories and with all iso-8859-1 files converted to utf-8 . Tested on macosx.

Btw: Check your file encodings with:

to get the encoding information.

I would add -iname ‘.php’: find . -type f -iname ‘.php’ -exec bash -c ‘t=»/tmp/dest»; mkdir -p «$t/ dirname <> «; iconv -futf8 -tl1 «<>» > «$t/<>«‘ \;

I create the following script that (i) backups all tex files in directory «converted», (ii) checks the encoding of every tex file, and (iii) converts to UTF-8 only the tex files in the ISO-8859-1 encoding.

FILES=*.tex for f in $FILES do filename="$" echo -n "$f" #file -I $f if file -I $f | grep -wq "iso-8859-1" then mkdir -p converted cp $f ./converted iconv -f ISO-8859-1 -t UTF-8 $f > "$_utf8.tex" mv "$_utf8.tex" $f echo ": CONVERTED TO UTF-8." else echo ": UTF-8 ALREADY." fi done 

+!1 that’s the correct solution, because I remember having troubles when a file was already utf-8, and yes it was a «mixed» project with iso-8859-1 and utf8 files. So I came up with a very similar solution. I added my answer.

If all the files you have to convert are .php you could use the following, which is recursive by default:

for a in $(find . -name "*.php"); do iconv -f iso-8859-1 -t utf-8 new/"$a" ; done 

I believe your errors were due to the fact that ls -R also produces an output that might not be recognized by iconv as a valid filename, something like ./my/dir/structure:

Or this which reuses the original file names: $ for a in $(find . -name «*.java»); do iconv -f iso-8859-1 -t utf-8 «$a».utf8 ; done $ for a in $(find . -name «*.java.utf8»); do mv «$a» dirname «$a» / basename «$a» .utf8 ; done

On unix.stackexchange.com a similar question was asked, and user manatwork suggested recode which does the trick very nicely.

I’ve been using it to convert ucs-2 to utf-8 in place

Everything’s fine with the above answers, but if this is a «mixed» project, i.e. there are already UTF8 files, then we may get into trouble, therefore here’s my solution, I’m checking file encoding first.

#!/bin/bash # file name: to_utf8 # current encoding: encoding=$(file -i "$1" | sed "s/.*charset=\(.*\)$/\1/") if [ "$" = "iso-8859-1" ] || [ "$" = "iso-8859-2" ]; then echo "recoding from $ to UTF-8 file : $1" recode ISO-8859-2..UTF-8 "$1" fi #example: #find . -name "*.php" -exec to_utf8 <> \; 

On Windows Git Bash, I got these errors with several of the proposed solutions:

  • find: Only one instance of <> is supported with -exec . +
  • find: In ‘-exec . <> +’ the ‘<>’ must appear by itself, but you specified ‘source=<>; . ’

But that (a mix of other proposed solutions) worked:

for fileToConvert in $(find . -type f -name \*.js); do iconv -f iso-8859-1 -t utf-8 ~/temp-iconv.txt ; mv -f ~/temp-iconv.txt "$fileToConvert" ; done 

Use mkdir -p «$»; before iconv.

Note that you are using a potentially dangerous for construct when there are spaces in filenames, see http://porkmail.org/era/unix/award.html.

Using the answers of Dennis Williamson and Alberto Zaccagni, I came up with the following script that converts all files of the specified file type from all subdirectories. The output is then collected in one folder that is given by /path/to/destination

mkdir /path/to/destination for a in $(find . -name "*.php"); do filename=$(basename $a); echo $filename iconv -f iso-8859-1 -t utf-8 "/path/to/destination/$filename"; done 

The function basename returns the filename without the path of the file.

Alternative (user interactive): Now I also created a user interactive script that lets you decide whether you want to overwrite the old files or just rename them. Additional thanks go to tbsalling

for a in $(find . -name "*.tex"); do iconv -f iso-8859-1 -t utf-8 "$a".utf8 ; done echo "Should the original files be replaced (Y/N)?" read replace if [ "$replace" == "Y" ]; then echo "Original files have been replaced." for a in $(find . -name "*.tex.utf8"); do file_no_suffix=$(basename -s .tex.utf8 "$a"); directory=$(dirname "$a"); mv "$a" "$directory"/"$file_no_suffix".tex; done else echo "Original files have been converted and converted files were saved with suffix '.utf8'" fi 

Have fun with this and I would be grateful for any comments to improve it, thanks!

Источник

PHP function iconv character encoding from iso-8859-1 to utf-8

I’m trying to convert a string from iso-8859-1 to utf-8. But when I find these two charachter € and • the function returns a charachter that is a square with two number inside. How can I solve this issue?

4 Answers 4

I think the encoding you are looking for is Windows code page 1252 (Western European). It is not the same as ISO-8859-1 (or 8859-15 for that matter); the characters in the range 0xA0-0xFF match 8859-1, but cp1252 adds an assortment of extra characters in the range 0x80-0x9F where ISO-8859-1 assigns little-used control codes.

The confusion comes about because when you serve a page as text/html;charset=iso-8859-1 , for historical reasons, browsers actually use cp1252 (and will hence submit forms in cp1252 too).

iconv('cp1252', 'utf-8', "\x80 and \x95") -> "\xe2\x82\xac and \xe2\x80\xa2" 

Thank you bobince! Now it works. I want to ask you another question now. How can I check all the sites that are sets in text/html;charset=iso-8859-1 really is in cp1252? (how did you explained in the answer).

If you see a byte in the range 0x80–0x9F, you are almost certainly looking at cp1252 rather than 8859-1, since the ‘C1 control codes’ are very rarely used (almost never, on the web). If the source of the “ISO-8859-1” string is web-based, it almost certainly means it’s really cp1252, since that’s what browsers use.

I’ve tried to do this -> mb_detect_encoding($string, ‘cp1252’); and then with the same string mb_detect_encoding($string, ‘ISO-8859-1’); the first returns me ‘false’ the second returns me that it is an ISO-8859-1 string. But it isn’t. How can I make a certain charset check?

You can’t make a certain charset check at all. Absolutely any sequence of bytes is a valid ISO-8859-1 string, and most single-byte encodings also map all or most bytes to valid characters. Only multi-byte encodings like UTF-8, where there are many invalid byte sequences, offer any realistic chance of ruling them out. So really you can only go on balance of probabilities, and the balance of probabilities when pitting cp1252 against ISO-8859-1 for text that’s come from the web is always cp1252.

Источник

Convert files encoding

I have a PHP application who’s files encoding is Greek ISO (iso-8859-7). I want to convert the files to utf-8 but simply saving the files with utf-8 isn’t enough since the Greek texts get garbled. Is there an «automatic» method to do this so that I can completely convert my app’s encoding without having to go through each file and rewrite the texts?

4 Answers 4

On a Linux system, if you are sure all files are currently encoded in ISO-8859-7, you can do this:

bash> find /your/path -name "*.php" -type f \ -exec iconv "<>" -f ISO88597 -t UTF8 -o "<>.tmp" \; \ -exec mv "<>.tmp" "<>" \; 

This converts all PHP script files located in /your/path as well as all sub-directories. Remove -name «*.php» to convert all files.

Since you are under Windows, the easiest option would be a PHP script like this:

 $file)< if($file->isFile()) file_put_contents( $fileName, iconv('ISO-8859-7', 'UTF-8', file_get_contents($fileName)) ); > 
$new_string = iconv("ISO-8859-7", "UTF-8", $old_string); 

This will only convert the contents, I would like to entirely convert the files, including the contents.

Ah, I read your last sentence as how to automatically convert the data without having to manually retype it. You are going to have to write your own function to transverse your app and update the encoding of your files. If iconv doesn’t work for you, try mb_convert_encoding (php.net/manual/en/function.mb-convert-encoding.php). Also when you say the texts gets garbled, is that when viewing the file in a text editor?, or when you output contents of the file within PHP?

Did you send a UTF8 content type header with the output? As well as set the content type to utf8 in the html?

Yes. The problem resides in the fact that the original app encoding was iso-8859-7, not only the data from the db but the files as well.

Источник

Оцените статью