How to check if a file is plain text?
In my program, the user can load a file with links (it’s a webcrawler), but I need to verify if the file that the user chooses is plain text or something else (only plain text will be allowed). Is it possible to do this? If it’s useful, I’m using JFileChooser to open the file. EDIT: What is expected from the user: a text file containing URLs. What I want to avoid: the user loads an MP3 file or a document from the MS Word (examples).
6 Answers 6
A file is just a series of bytes, and without further information, you cannot tell whether these bytes are supposed to be code points in some string encoding (say, ASCII or UTF-8 or ANSI-something) or something else. You will have to resort to heuristics, such as:
- Try to parse the file in a number of known encodings and see if the parsing succeeds. If it does, chances are you have a text file.
- If you expect text files in Western languages only, you can assume that the majority of characters lies in the ASCII range (0..127), more specifically, (33..127) plus whitespace (tab, newline, carriage return, space). Count occurrences of each distinct byte value, and if the overwhelming part of your document is in the ‘typical western characters’ set, it’s usually safe to assume it’s a text file.
- Extending the previous approach; sample a sufficiently large quantity of text in the languages you expect, and build a character frequency profile. To check your file, compare the file’s character frequency profile against your test data and see if it’s close enough.
But here’s another solution: Just treat everything you receive as text, applying the necessary transformations where needed (e.g. HTML-encode when sending to a web browser). As long as you prevent the file from being interpreted as binary data (such as a user double-clicking the file), the worst you’ll produce is gibberish data.
How to identify the encoding charset of a file in Java?
I’ve already tried the juniversalchardet library and it works fine for UTF-8, UTF-16LE, UTF-16BE but it’s not detecting the format for US_ASCII and ISO-8859-1. I’ve also used the jchardet and it doesn’t achieve my goal. Plus the InputStreamReader is not also working in my situation. So, how can I detect the character-sets US_ASCII and ISO-8859-1, or for all the of the above character sets? Incidentally, I created these format files using editpad Lite 7.
AFAIR US_ASCII is a subset of UTF-8 and ISO-8859-1. Therefore if an text only contains ASCII characters all three encodings can be used. Hence there is no «correct» charset to be detected, as all three are correct. May be this is what you are facing. Try to use texts that use characters that only exist in ISO-8859-1 and see the results.
The encoding of the text file should come with the file’s bytes in the same or a separate communication or via convention, specification, etc. Why are you trying to guess it? You could guess it from one sample and then it be wrong for the next update to it. Oh, if you saved the files, then you are the one determining the encoding.
Please read Under what circumstances may I add “urgent” or other similar phrases to my question, in order to obtain faster answers? — the summary is that this is not an ideal way to address volunteers, and is probably counterproductive to obtaining answers. Please refrain from adding this to your questions.
3 Answers 3
As already mentioned there is no certain way to detect encoding. But there is a large amount of heuristics that allow to do a smart guess about file encoding.
If there is no way for you to get to know the encoding for sure you may have a look at Apache Tika project and EncodingDetector there.
hi, thanks for guiding me towards Apache tika, I didn’t used encodingDetector as it wasn’t sufficient for me, but ICU4J is very good.
By the sheer nature of character encodings, character encoding detectors cannot possibly be 100% reliable. They can only give a best guess.
ASCII is a subset of all other 8-bit encodings, consisting of code points in the range 0 to 127 (i.e. all values can be represented in just 7 bits). This means that if your file contains only ASCII characters, it can be read using ISO-8859-1, ISO-8859-2, etc., and UTF-8. I would expect a good charset detector to tell you if the contents are pure ASCII, so I don’t know why juniversalchardet didn’t when you tried it.
It’s tricky to tell the various single-byte encodings apart. For example, the character £ is a valid character in ISO-8859-1 but is equally valid (but displayed differently) in ISO-8859-2 and other encodings. So it’s not easy to tell which character was actually intended.
java detect if file is UTF-8 or Ansi
In Java is there a way to detect if a file is ANSI or UTF-8? The problem i am having is that if someone creates a CSV file in Excel it’s UTF-8. If they create it using note pad it’s ANSI. I am wondering if i can detect the type of file then handle it accordingly. Thanks.
You may be able to check for the UTF-8 BOM, if excel includes it (I don’t have a copy here to check). You could open as binary, read the first three bytes and check for 0xEF,0xBB,0xBF , or optimistically open as «Cp1252» («ANSI») and if you see  at the start, reopen it as UTF-8.
@user1158745 Those links seems to be quite useful and provide code example. If you want you are allowed to post an answer to write answer to your own question.
1 Answer 1
You could try something like this. It relies on Excel including a Byte Order Mark (BOM), which a quick search suggests it does although I can’t verify it, and on the fact that java treats the BOM as a particular «character» \uFEFF .
FileInputStream fis = new FileInputStream(file); BufferedReader br = new BufferedReader(new InputStreamReader(fis, "UTF-8")); String line = br.readLine(); if (line.startsWith("\uFEFF")) < // it's UTF-8, throw away the BOM character and continue line = line.substring(1); >else < // it's not UTF-8, reopen br.close(); // also closes fis fis = new FileInputStream(file); // reopen from the start br = new BufferedReader(new InputStreamReader(fis, "Cp1252")); line = br.readLine(); >// now line contains the first line, and br.readLine() will get the next