Spelling checker in python

Quickstart¶

pyspellchecker is designed to be easy to use to get basic spell checking.

Installation¶

The best experience is likely to use pip :

pip install pyspellchecker 

If you are using virtual environments, it is recommended to use pipenv to combine pip and virtual environments:

pipenv install pyspellchecker 

Basic Usage¶

Setting up the spell checker requires importing and initializing the instance.

from spellchecker import SpellChecker spell = SpellChecker() 

There are several methods to determine if a word is in the word frequency list:

from spellchecker import SpellChecker spell = SpellChecker() spell['morning'] # True 'morning' in spell # True # find those words from a list of words that are found in the dictionary spell.known(['morning', 'hapenning']) # # find those words from a list of words that are not found in the dictionary spell.unknown(['morning', 'hapenning']) #

Once a word is identified as misspelled, you can find the likeliest replacement:

from spellchecker import SpellChecker spell = SpellChecker() misspelled = spell.unknown(['morning', 'hapenning']) # for word in misspelled: spell.correction(word) # 'happening' 

If using a set of long words that is taking a long time to process corrections then the Levenshtein distance can be set to 1. The default, is 2.

from spellchecker import SpellChecker spell = SpellChecker(distance=1) # set the Levenshtein Distance parameter # do additional work # now for shorter words, we can revert to Levenshtein Distance of 2! spell.distance = 2 

Or if the word identified as the likeliest is not correct, a list of candidates can also be pulled:

from spellchecker import SpellChecker spell = SpellChecker() misspelled = spell.unknown(['morning', 'hapenning']) # for word in misspelled: spell.correction(word) #

Changing Language¶

To set the language of the dictionary to load, one must set the language parameter on initialization.

from spellchecker import SpellChecker spell = SpellChecker(language='es') # Spanish dictionary print(spell['mañana']) 

Multiple Languages¶

If you would like to check multiple default languages, it is possible to pass a list of language identifiers to the constructor to load each:

Adding and Removing Terms from a Dictionary¶

There are several ways to add additional terms to your word frequency dictionary including by filepath, string of text, or by a list of words.

To load a pre-defined dictionary file (either as a json file or a gzipped json file):

from spellchecker import SpellChecker spell = SpellChecker() spell.word_frequency.load_dictionary('./path-to-my-word-frequency.json') 

To load a text document that will be parsed into individual words and each word added to the frequency list:

from spellchecker import SpellChecker spell = SpellChecker() spell.word_frequency.load_text_file('./path-to-my-text-doc.txt') 

To load plain text from input or another source:

from spellchecker import SpellChecker spell = SpellChecker() spell.word_frequency.load_text('Text to be parsed and added to the system') 

Or update using a list of words:

from spellchecker import SpellChecker spell = SpellChecker() spell.word_frequency.load_words(['Text', 'to', 'be','added', 'to', 'the', 'system']) 
from spellchecker import SpellChecker spell = SpellChecker() spell.word_frequency.add('Text') 

Removing words is as simple as adding words:

from spellchecker import SpellChecker spell = SpellChecker() spell.word_frequency.remove_words(['Text', 'to', 'be','removed', 'from', 'the', 'system']) # or remove a single word spell.word_frequency.remove('meh') 

Iterating Over a Dictionary¶

Iterating over the dictionary is as easy as writing a simple for loop:

from spellchecker import SpellChecker spell = SpellChecker() for word in spell: print("<>: <>".format(word, spell[word])) 

The iterator returns the word. To get the number of times that the word is found in the WordFrequency object one can use a simple lookup.

How to Build a New Dictionary¶

Building a custom or new language dictionary is relatively straight forward. To begin, you will need to have either a word frequency list or text files that represent the usage of the terms. Since pyspellchecker uses word frequency, it is better to have the most common words have higher frequencies!

Once you have the corpus, code similar to the following should build out the dictionary:

from spellchecker import SpellChecker # turn off loading a built language dictionary, case sensitive on (if desired) spell = SpellChecker(language=None, case_sensitive=True) # if you have a dictionary. spell.word_frequency.load_dictionary('./path-to-my-json-dictionary.json') # or. if you have text spell.word_frequency.load_text_file('./path-to-my-text-doc.txt') # export it out for later use! spell.export('my_custom_dictionary.gz', gzipped=True) 

It is also possible to build a dictionary from other sources outside of pyspellchecker , it requires that the data be in the following format and saved as a json object:

Note that the data does not need to be sorted!

A quick, command line spell checking program¶

Setting up a quick and easy command line program using pyspellchecker is straight forward:

from spellchecker import SpellChecker # could add command line arguments to set the parameters of the spell # check class; setup what type of information to present back, etc. spell = SpellChecker() print("To exit, hit return without input!") while True: word = input('Input a word to spell check: ') if word == '': # not sure, but need a way to kill the program. break word = word.lower() if word in spell: print("'<>' is spelled correctly!".format(word)) else: cor = spell.correction(word) print("The best spelling for '<>' is '<>'".format(word, cor)) print("If that is not enough; here are all possible candidate words:") print(spell.candidates(word)) 

Using with PyInstaller¶

It is possible to use pyspellchecker with tools such as PyInstaller to add spell-checking to your executable program. To do so, you will need to add the required dictionaries to the executable.

You will need to add the files to a folder in your executable called spellchecker/resources/ to match the location that pyspellchecker checks for the supported dictionaries.

pyinstaller --add-binary="spellchecker/resources/en.json.gz:spellchecker/resources" my_prog.py 

On windows one should use a semi-colon instead of the colon:

pyinstaller --add-binary="spellchecker/resources/en.json.gz;spellchecker/resources" my_prog.py 

© Copyright 2018, Tyler Barrus. Revision 29c9210a .

Versions latest stable Downloads pdf html epub On Read the Docs Project Home Builds Free document hosting provided by Read the Docs.

Источник

pyspellchecker 0.7.2

Pure Python Spell Checking based on Peter Norvig’s blog post on setting up a simple spell checking algorithm.

It uses a Levenshtein Distance algorithm to find permutations within an edit distance of 2 from the original word. It then compares all permutations (insertions, deletions, replacements, and transpositions) to known words in a word frequency list. Those words that are found more often in the frequency list are more likely the correct results.

pyspellchecker supports multiple languages including English, Spanish, German, French, Portuguese, Arabic and Basque. For information on how the dictionaries were created and how they can be updated and improved, please see the Dictionary Creation and Updating section of the readme!

pyspellchecker supports Python 3

pyspellchecker allows for the setting of the Levenshtein Distance (up to two) to check. For longer words, it is highly recommended to use a distance of 1 and not the default 2. See the quickstart to find how one can change the distance parameter.

Installation

The easiest method to install is using pip:

pip install pyspellchecker
git clone https://github.com/barrust/pyspellchecker.git pyspellchecker -m build

For python 2.7 support, install release 0.5.6 but note that no future updates will support python 2.

Quickstart

After installation, using pyspellchecker should be fairly straight forward:

If the Word Frequency list is not to your liking, you can add additional text to generate a more appropriate list for your use case.

If the words that you wish to check are long, it is recommended to reduce the distance to 1. This can be accomplished either when initializing the spell check class or after the fact.

Non-English Dictionaries

pyspellchecker supports several default dictionaries as part of the default package. Each is simple to use when initializing the dictionary:

The currently supported dictionaries are:

  • English — ‘en’
  • Spanish — ‘es’
  • French — ‘fr’
  • Portuguese — ‘pt’
  • German — ‘de’
  • Russian — ‘ru’
  • Arabic — ‘ar’
  • Basque — ‘eu’
  • Latvian — ‘lv’

Dictionary Creation and Updating

The creation of the dictionaries is, unfortunately, not an exact science. I have provided a script that, given a text file of sentences (in this case from OpenSubtitles) it will generate a word frequency list based on the words found within the text. The script then attempts to *clean up* the word frequency by, for example, removing words with invalid characters (usually from other languages), removing low count terms (misspellings?) and attempts to enforce rules as available (no more than one accent per word in Spanish). Then it removes words from a list of known words that are to be removed. It then adds words into the dictionary that are known to be missing or were removed for being too low frequency.

The script can be found here: scripts/build_dictionary.py` . The original word frequency list parsed from OpenSubtitles can be found in the `scripts/data/` folder along with each language’s include and exclude text files.

Any help in updating and maintaining the dictionaries would be greatly desired. To do this, a discussion could be started on GitHub or pull requests to update the include and exclude files could be added.

Additional Methods

On-line documentation is available; below contains the cliff-notes version of some of the available functions:

correction(word) : Returns the most probable result for the misspelled word

candidates(word) : Returns a set of possible candidates for the misspelled word

known([words]) : Returns those words that are in the word frequency list

unknown([words]) : Returns those words that are not in the frequency list

word_probability(word) : The frequency of the given word out of all words in the frequency list

The following are less likely to be needed by the user but are available:

edit_distance_1(word) : Returns a set of all strings at a Levenshtein Distance of one based on the alphabet of the selected language

edit_distance_2(word) : Returns a set of all strings at a Levenshtein Distance of two based on the alphabet of the selected language

Credits

  • Peter Norvig blog post on setting up a simple spell checking algorithm
  • P Lison and J Tiedemann, 2016, OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016)

Источник

Читайте также:  Javascript if else или
Оцените статью