- Python check if text is english python regex
- Detect strings with non English characters in Python
- How to Match English and Non English Letters in Python
- Determine if text is in English?
- Pretrained Fast Text Model Worked Best For My Similar Needs
- Is it possible to check if string contains English word using python?
- Is there a way to check with python if a string from a list is a real word used in common English language? [duplicate]
- How to check if a word is English or not in Python
- 1. Using isalpha method
- 2. Using Regular Expression.
- 3. Using operator
- 4. Using lower and upper method
Python check if text is english python regex
After experimenting to find what worked best for my needs, which were making sure text files were in English in 60,000+ text files, I found that fasttext was an excellent tool. Solution 1: You can just check whether the string can be encoded only with ASCII characters (which are Latin alphabet + some other characters).
Detect strings with non English characters in Python
You can just check whether the string can be encoded only with ASCII characters (which are Latin alphabet + some other characters). If it can not be encoded, then it has the characters from some other alphabet.
Note the comment # -*- coding: . . It should be there at the top of the python file (otherwise you would receive some error about encoding)
# -*- coding: utf-8 -*- def isEnglish(s): try: s.encode(encoding='utf-8').decode('ascii') except UnicodeDecodeError: return False else: return True assert not isEnglish('slabiky, ale liší se podle významu') assert isEnglish('English') assert not isEnglish('ގެ ފުރަތަމަ ދެ އަކުރު ކަ') assert not isEnglish('how about this one : 通 asfަ') assert isEnglish('?fd4))45s&')
IMHO it is the simpliest solution:
def isEnglish(s): return s.isascii() print(isEnglish("Test")) print(isEnglish("_1991_اف_جي2")) Output: True False
If you work with strings (not unicode objects), you can clean it with translation and check with isalnum() , which is better than to throw Exceptions:
import string def isEnglish(s): return s.translate(None, string.punctuation).isalnum() print isEnglish('slabiky, ale liší se podle významu') print isEnglish('English') print isEnglish('ގެ ފުރަތަމަ ދެ އަކުރު ކަ') print isEnglish('how about this one : 通 asfަ') print isEnglish('?fd4))45s&') print isEnglish('Текст на русском') > False > True > False > False > True > False
Also you can filter non-ascii characters from string with this function:
ascii = set(string.printable) def remove_non_ascii(s): return filter(lambda x: x in ascii, s) remove_non_ascii('slabiky, ale liší se podle významu') > slabiky, ale li se podle vznamu
How to check if a python string is a valid Bengali word using regular, You can pip install regex and use bool(regex.fullmatch(r’\P
How to Match English and Non English Letters in Python
In this video, we show how to match English and non-English characters with Python regular Duration: 4:07
Determine if text is in English?
There is a library called langdetect. It is ported from Google’s language-detection available here:
It supports 55 languages out of the box.
You might be interested in my paper The WiLI benchmark dataset for written language identification. I also benchmarked a couple of tools.
- CLD-2 is pretty good and extremely fast
- lang-detect is a tiny bit better, but much slower
- langid is good, but CLD-2 and lang-detect are much better
- NLTK’s Textcat is neither efficient nor effective.
You can install lidtk and classify languages:
$ lidtk cld2 predict --text "this is some text written in English" eng $ lidtk cld2 predict --text "this is some more text written in English" eng $ lidtk cld2 predict --text "Ce n'est pas en anglais" fra
Pretrained Fast Text Model Worked Best For My Similar Needs
I arrived at your question with a very similar need. I appreciated Martin Thoma’s answer. However, I found the most help from Rabash’s answer part 7 HERE.
After experimenting to find what worked best for my needs, which were making sure text files were in English in 60,000+ text files, I found that fasttext was an excellent tool.
With a little work, I had a tool that worked very fast over many files. Below is the code with comments. I believe that you and others will be able to modify this code for your more specific needs.
class English_Check: def __init__(self): # Don't need to train a model to detect languages. A model exists # that is very good. Let's use it. pretrained_model_path = 'location of your lid.176.ftz file from fasttext' self.model = fasttext.load_model(pretrained_model_path) def predictionict_languages(self, text_file): this_D = <> with open(text_file, 'r') as f: fla = f.readlines() # fla = file line array. # fasttext doesn't like newline characters, but it can take # an array of lines from a file. The two list comprehensions # below, just clean up the lines in fla fla = [line.rstrip('\n').strip(' ') for line in fla] fla = [line for line in fla if len(line) > 0] for line in fla: # Language predict each line of the file language_tuple = self.model.predictionict(line) # The next two lines simply get at the top language prediction # string AND the confidence value for that prediction. prediction = language_tuple[0][0].replace('__label__', '') value = language_tuple[1][0] # Each top language prediction for the lines in the file # becomes a unique key for the this_D dictionary. # Everytime that language is found, add the confidence # score to the running tally for that language. if prediction not in this_D.keys(): this_D[prediction] = 0 this_D[prediction] += value self.this_D = this_D def determine_if_file_is_english(self, text_file): self.predictionict_languages(text_file) # Find the max tallied confidence and the sum of all confidences. max_value = max(self.this_D.values()) sum_of_values = sum(self.this_D.values()) # calculate a relative confidence of the max confidence to all # confidence scores. Then find the key with the max confidence. confidence = max_value / sum_of_values max_key = Python check if string is english == max_value][0] # Only want to know if this is english or not. return max_key == 'en'
Below is the application / instantiation and use of the above class for my needs.
file_list = # some tool to get my specific list of files to check for English en_checker = English_Check() for file in file_list: check = en_checker.determine_if_file_is_english(file) if not check: print(file)
Checking If Word Is English Python, with open(«dict.txt», «r») as f: text = # set comprehension word = input(«Type a word to check if it’s english.
Is it possible to check if string contains English word using python?
You’ll need a list of all English words you care to detect. There are a number of places to get these. I’d suggest looking at the dictionary files for a spellchecker, like aspell, since you don’t care about the definitions. Aspell has a command to dump wordlists.
aspell -d en dump master | aspell -l en expand > words.en.txt
Next, get an iterable of the words. You’ll probably want to filter out trivially short words like a and I , and any words with special characters that can’t appear in an address. Format the word list into a regex with alternations, i.e. ‘|’.join(wordlist) .
Since Python’s backtracking regex engine doesn’t handle alternations efficiently, you’ll want a faster engine. Try pip install rure , which uses Rust’s regex engine, and use that to compile the regex instead. (See Rust’s regex optimization guide.) If you care about which word it found, you can wrap the whole regex in () to make it a capturing group.
Then just run the compiled regex (maybe case-insensitive) against each address in turn. If it matches, you’ll get the word.
What is the regex to find any English words for find and replace, Some PCRE-compatible regular expression libraries can match character classes based on their Unicode properties (e.g. \p
Is there a way to check with python if a string from a list is a real word used in common English language? [duplicate]
import nltk nltk.download('words') from nltk.corpus import words samplewords=['apple','a%32','j & quod','rectangle','house','fsdfdsoij','fdfd'] [i for i in samplewords if i in words.words()] ['apple', 'rectangle', 'house']
How can I check if a string contains ANY letters from the alphabet?, Regex should be a fast approach: re.search(‘[a-zA-Z]’, the_string).
How to check if a word is English or not in Python
Here I introduce several ways to identify if the word consists of the English alphabet or not.
1. Using isalpha method
In Python, string object has a method called isalpha
word = "Hello" if word.isalpha(): print("It is an alphabet") word = "123" if word.isalpha(): print("It is an alphabet") else: print("It is not an alphabet")
However, this approach has a minor problem; for example, if you use the Korean alphabet, it still considers the Korean word as an alphabet. (Of course, for the non-Korean speaker, it wouldn’t be a problem 😅 )
To avoid this behavior, you should add encode method before call isalpha.
word = "한글" if word.encode().isalpha(): print("It is an alphabet") else: print("It is not an alphabet")
2. Using Regular Expression.
I think this is a universal approach, regardless of programming language.
import re word="hello" reg = re.compile(r'[a-zA-Z]') if reg.match(word): print("It is an alphabet") else: print("It is not an alphabet") word="123" reg = re.compile(r'[a-z]') if reg.match(word): print("It is an alphabet") else: print("It is not an alphabet")
3. Using operator
It depends on the precondition; however, we will just assume the goal is if all characters should be the English alphabet or not.
Therefore, we can apply the comparison operator.
Note that we have to consider both upper and lower cases. Also, we shouldn’t use the entire word because the comparison would work differently based on the length of the word.
We can also simplify this code using the lower or upper method in the string.
4. Using lower and upper method
This is my favorite approach. Since the English alphabet has Lower and Upper cases, unlike other characters (number or Korean), we can leverage this characteristic to identify the word.
word = "hello" if word.upper() != word.lower(): print("It is an alphabet") else: print("It is not an alphabet")