Reading UTF-8 — BOM marker
I’m reading a file through a FileReader — the file is UTF-8 decoded (with BOM) now my problem is: I read the file and output a string, but sadly the BOM marker is outputted too. Why this occurs?
fr = new FileReader(file); br = new BufferedReader(fr); String tmp = null; while ((tmp = br.readLine()) != null)
UTF-8 is not supposed to have a BOM! It is neither necessary nor recommended by The Unicode Standard.
@tchrist I wish things were that simple. You create an application for the users, not for yourself. And the users use (partially) Microsoft software to create their files.
9 Answers 9
Locked. There are disputes about this answer’s content being resolved at this time. It is not currently accepting new interactions.
In Java, you have to consume manually the UTF8 BOM if present. This behaviour is documented in the Java bug database, here and here. There will be no fix for now because it will break existing tools like JavaDoc or XML parsers. The Apache IO Commons provides a BOMInputStream to handle this situation.
Very late to the game, but this seems to be very slow for large files. I tried using a buffer. If you use a buffer, it seems to leave some sort of trailing data, as well.
The easiest fix is probably just to remove the resulting \uFEFF from the string, since it is extremely unlikely to appear for any other reason.
The bad thing about «extremely unlikely» is that it turns up extremely rarely, so that locating the bug is extremely difficult. 🙂 So be extremely wary when using this code if you believe your software will be successful and long-lived, because sooner or later any existing situation will occur.
@StevePitchers but we must match it after decoding, when it is part of a String (which is always represented as UTF-16)
String defaultEncoding = "UTF-8"; InputStream inputStream = new FileInputStream(someFileWithPossibleUtf8Bom); try < BOMInputStream bOMInputStream = new BOMInputStream(inputStream); ByteOrderMark bom = bOMInputStream.getBOM(); String charsetName = bom == null ? defaultEncoding : bom.getCharsetName(); InputStreamReader reader = new InputStreamReader(new BufferedInputStream(bOMInputStream), charsetName); //use reader >finally
This code will only work with UTF-8 BOM detection and excluding. Check the implementation of bOMInputStream: «` /** * Constructs a new BOM InputStream that detects a * a <@link ByteOrderMark#UTF_8>and optionally includes it. * @param delegate the InputStream to delegate to * @param include true to include the UTF-8 BOM or * false to exclude it */ public BOMInputStream(InputStream delegate, boolean include) < this(delegate, include, ByteOrderMark.UTF_8); >«`@link>
Here’s how I use the Apache BOMInputStream, it uses a try-with-resources block. The «false» argument tells the object to ignore the following BOMs (we use «BOM-less» text files for safety reasons, haha):
try( BufferedReader br = new BufferedReader( new InputStreamReader( new BOMInputStream( new FileInputStream( file), false, ByteOrderMark.UTF_8, ByteOrderMark.UTF_16BE, ByteOrderMark.UTF_16LE, ByteOrderMark.UTF_32BE, ByteOrderMark.UTF_32LE ) ) ) ) < // use br here >catch( Exception e) >
If you want a String then you can skip the BufferedReader and InputStreamReader and use commons.io.IOUtils instead: String xml = IOUtils.toString(bomInputStream, StandardCharsets.UTF_8)
Consider UnicodeReader from Google which does all this work for you.
Charset utf8 = StandardCharsets.UTF_8; // default if no BOM present try (Reader r = new UnicodeReader(new FileInputStream(file), utf8.name()))
For example, let’s take a look on my code (used for reading a text file with both latin and cyrillic characters) below:
String defaultEncoding = "UTF-16"; InputStream inputStream = new FileInputStream(new File("/temp/1.txt")); BOMInputStream bomInputStream = new BOMInputStream(inputStream); ByteOrderMark bom = bomInputStream.getBOM(); String charsetName = bom == null ? defaultEncoding : bom.getCharsetName(); InputStreamReader reader = new InputStreamReader(new BufferedInputStream(bomInputStream), charsetName); int data = reader.read(); while (data != -1) < char theChar = (char) data; data = reader.read(); ari.add(Character.toString(theChar)); >reader.close();
As a result we have an ArrayList named «ari» with all characters from file «1.txt» excepting BOM.
Руководство Java FileReader
Следуйте за нами на нашей фан-странице, чтобы получать уведомления каждый раз, когда появляются новые статьи.
Facebook
1- FileReader
FileReader — это подкласс InputStreamReader, который используется для чтения текстовых файлов.
FileReader имеет только методы, унаследованные от InputStreamReader. На самом деле вы можете использовать InputStreamReader для чтения символов из любого источника, однако FileReader специально разработан для чтения символов из файловой системы.
FileReader(File file) FileReader(FileDescriptor fd) FileReader(File file, Charset charset) FileReader(String fileName) FileReader(String fileName, Charset charset)
Примечание: Конструкторы с параметром Charset были добавлены в FileReader, начиная с версией Java 11. Поэтому, если вы используете более раннюю версию Java и хотите прочитать файл с указанной кодировкой, вместо этого используйте класс InputStreamReader.
2- Examples
package org.o7planning.filereader.ex; import java.io.File; import java.io.FileReader; import java.io.IOException; import java.net.MalformedURLException; public class FileReaderEx1 < public static void main(String[] args) throws MalformedURLException, IOException < File file = new File("file-test.txt"); FileReader fis = new FileReader(file); int charCode; while((charCode = fis.read()) != -1) < System.out.println((char)charCode + " " + charCode); >fis.close(); > >
F 70 i 105 l 108 e 101 32 C 67 o 111 n 110 t 116 e 101 n 110 t 116
При чтении текстового файла лучше использовать комбинацию BufferedReader и FileReader для достижения лучшей производительности:
# Students: John P Sarah M # Sarah B Charles B Mary T Sophia B
package org.o7planning.filereader.ex; import java.io.BufferedReader; import java.io.File; import java.io.FileReader; import java.io.IOException; import java.io.Reader; import java.net.MalformedURLException; public class FileReaderEx2 < public static void main(String[] args) throws MalformedURLException, IOException < File file = new File("students.txt"); Reader reader = new FileReader(file); BufferedReader br = new BufferedReader(reader); String line; while((line = br.readLine()) != null) < System.out.println(line); >br.close(); > >
# Students: John P Sarah M # Sarah B Charles B Mary T Sophia B
Например: Чтение текстового файла и печать строк текста, которые не начинаются с символа «#» (строка комментария):
package org.o7planning.filereader.ex; import java.io.BufferedReader; import java.io.File; import java.io.FileReader; import java.io.IOException; import java.io.Reader; import java.net.MalformedURLException; public class FileReaderEx3 < public static void main(String[] args) throws MalformedURLException, IOException < File file = new File("students.txt"); Reader reader = new FileReader(file); BufferedReader br = new BufferedReader(reader); br.lines() // java.util.stream.Stream .filter(line ->!line.startsWith("#")) // Not starts with "#". .forEach(System.out::println); br.close(); > >
John P Sarah M Charles B Mary T Sophia B
3- Проблемы UTF-8 BOM!
До того, как UTF-8 стал популярным, генераторы файлов UTF-8 всегда добавляли первые 3 bytes, чтобы отметить, что этот файл был закодирован UTF-8, они часто называются BOM (Byte Order Mark). В то время как файлы UTF-8, созданные Java, не включают BOM.
FileReader не удаляет автоматически BOM при чтении файлов UTF-8. Команда разработчиков Java понимает это, однако никаких обновлений не будет, так как это нарушит предыдущие сторонние библиотеки на базе Java, такие как XML Parser и т.д.
Например, ниже приведен файл UTF-8 (BOM), созданный старым инструментом, вы можете скачать его для проверки обсуждаемой проблемы:
package org.o7planning.filereader.ex; import java.io.File; import java.io.FileInputStream; import java.io.FileReader; import java.io.IOException; import java.io.InputStream; import java.io.InputStreamReader; import java.net.MalformedURLException; import java.nio.charset.StandardCharsets; public class FileReader_Utf8_BOM < public static void main(String[] args) throws MalformedURLException, IOException < File file = new File("utf8-file-with-bom-test.txt"); System.out.println("--- Read by FileReader ---"); readByFileReader(file); System.out.println("--- Read by InputStreamReader ---"); readByInputStreamReader(file); >private static void readByFileReader(File file) throws IOException < FileReader fr = new FileReader(file, StandardCharsets.UTF_8); int charCode; while ((charCode = fr.read()) != -1) < System.out.println((char) charCode + " " + charCode); >fr.close(); > private static void readByInputStreamReader(File file) throws IOException < InputStream is = new FileInputStream(file); InputStreamReader isr = new InputStreamReader(is, StandardCharsets.UTF_8); int charCode; while ((charCode = isr.read()) != -1) < System.out.println((char) charCode + " " + charCode); >isr.close(); > >
--- Read by FileReader --- 65279 H 72 e 101 l 108 l 108 o 111 --- Read by InputStreamReader --- 65279 H 72 e 101 l 108 l 108 o 111
В результате появляется символ с кодом 65279, который является нежелательным символом.
Некоторые из следующих классов поддерживают исключение BOM, которое вы можете использовать:
BOMInputStream — это класс в библиотеке Apache Commons IO, который поддерживает удаление BOM.
package org.o7planning.filereader.ex; import java.io.File; import java.io.FileInputStream; import java.io.IOException; import java.io.InputStreamReader; import java.nio.charset.StandardCharsets; import org.apache.commons.io.input.BOMInputStream; public class BOMInputStreamEx1 < public static void main(String[] args) throws IOException < File file = new File("utf8-file-with-bom-test.txt"); FileInputStream fis = new FileInputStream(file); BOMInputStream bis = new BOMInputStream(fis); InputStreamReader isr = new InputStreamReader(bis, StandardCharsets.UTF_8); int charCode; while ((charCode = isr.read()) != -1) < System.out.println((char) charCode + " " + charCode); >isr.close(); > >
H 72 e 101 l 108 l 108 o 111
UnicodeReader — это класс в библиотеке «Google Data Java Client Library«, который поддерживает удаление BOM.
package org.o7planning.filereader.ex; import java.io.File; import java.io.FileInputStream; import java.io.IOException; import com.google.gdata.util.io.base.UnicodeReader; public class UnicodeReaderEx1 < public static void main(String[] args) throws IOException < File file = new File("utf8-file-with-bom-test.txt"); FileInputStream fis = new FileInputStream(file); UnicodeReader isr = new UnicodeReader(fis, "UTF-8"); int charCode; while ((charCode = isr.read()) != -1) < System.out.println((char) charCode + " " + charCode); >isr.close(); > >
H 72 e 101 l 108 l 108 o 111
View more Tutorials:
Это онлайн курс вне вебсайта o7planning, который мы представляем, он включает бесплатные курсы или курсы со скидкой.
* * Swift for Beginners, Learn Apple’s New Programming Language
* * Getting Started with Amazon Web Services
iOS 11, Swift 4, CoreML, ARKit & Kotlin — The full guide!
Design Patterns in C# and .NET
Introduction To SQL Server Reporting Services -SSRS
Java Object-Oriented Programming : Build a Quiz Application
Spring Framework Development (Java JEE) with AngularJS UI
SQL Server DBA Essentials for beginners
Advanced Java programming with JavaFx: Write an email client
Design Patterns In Ruby OOP Object Oriented/UML for Projects
Learn animation using CSS3, Javascript and HTML5
iOS Development — Create 4 Quiz Apps with Swift 3 & iOS 10
Create Complete Web Applications easily with APEX 5
Ruby On Rails: Understanding Ruby and The Rails Controller
Oracle Database 12c SQL Certified Associate 1Z0-071
Hibernate in Practice — The Complete Course
iOS программирование на Swift в Xcode — Уровень 2
UI&UX Design , Animation And Material design In Javafx
Connecting and working with Oracle Cloud DBaaS
Concepts of Object Oriented Programming with C++
* * ES6 / EcmaScript 6 for beginners — the essentials
Oracle: The Complete SQL Guide (Certification: 1Z0-061/ 071)
New to Python Automation. Try Step by Step Python 4 Testers
Изучаем Dart
* * Crash Course Into JavaFX: The Best Way to make GUI Apps
How to read a file in Java with specific character encoding?
I am trying to read a file in as either UTF-8 or Windows-1252 depending on the output of this method:
public Charset getCorrectCharsetToApply() < // Returns a Charset for either UTF-8 or Windows-1252. >
String fileName = getFileNameToReadFromUserInput(); InputStream is = new ByteArrayInputStream(fileName.getBytes()); InputStreamReader isr = new InputStreamReader(is, getCorrectCharsetToApply()); BufferedReader buffReader = new BufferedReader(isr);
- The name of the file itself ( fileName ) cannot be trusted to be a particular Charset ; sometime the file name will contain UTF-8 characters, and sometimes Windows-1252. Same goes for the file’s content (however if file name and file content will always have matching charsets).
- Only the logic inside getCorrectCharsetToApply() can select the charset to apply, so attempting to read a file by its name prior to calling this method could very well result with, Java trying to read the file name with the wrong encoding. which causes it to die!
3 Answers 3
So, first, as a heads up, do realize that fileName.getBytes() as you have there gets the bytes of the filename, not the file itself.
Second, reading inside the docs of FileReader:
The constructors of this class assume that the default character encoding and the default byte-buffer size are appropriate. To specify these values yourself, construct an InputStreamReader on a FileInputStream.
So, sounds like FileReader actually isn’t the way to go. If we take the advice in the docs, then you should just change your code to have:
String fileName = getFileNameToReadFromUserInput(); FileInputStream is = new FileInputStream(fileName); InputStreamReader isr = new InputStreamReader(is, getCorrectCharsetToApply()); BufferedReader buffReader = new BufferedReader(isr);
and not try to make a FileReader at all.