- How to generate n-grams in Python without using any external libraries
- Steps to generate n-grams from a large string of text
- Codes to preprocess a large string of text and break them into a list of words
- Codes to generate n-grams from a list of words
- Putting together process_text and generate_ngrams functions to generate n-grams
- About Clivant
- Implement N-Grams using Python NLTK – A Step-By-Step Guide
- Understanding N-grams
- Implementing n-grams in Python
- Sample Output
- Also Read:
- # Generating N-grams from Sentences in Python
- # The Pure Python Way
- # Using NLTK
How to generate n-grams in Python without using any external libraries
There are many text analysis applications that utilize n-grams as a basis for building prediction models. The term «n-grams» refers to individual or group of words that appear consecutively in text documents.
In this post, I document the Python codes that I typically use to generate n-grams without depending on external python libraries.
Steps to generate n-grams from a large string of text
I usually break up the task of generating n-grams from a large string of text into the following subtasks:
- Preprocess a large string of text and break them into a list of words.
- Generate n-grams from a list of words.
Codes to preprocess a large string of text and break them into a list of words
I typically use the following function to preprocess the text before the generation of n-grams:
def process_text(text): text = text.lower() text = text.replace(',', ' ') text = text.replace('/', ' ') text = text.replace('(', ' ') text = text.replace(')', ' ') text = text.replace('.', ' ') # Convert text string to a list of words return text.split()
The process_text function accepts an input parameter as the text which we want to preprocess.
It first converts all the characters in the text to lowercases. After that, it replaces commas, forward slashes, brackets and full stops with single whitespaces. Finally, it uses the split function on the text to split words by spaces and returns the result.
I will add more character replacement codes depending on where I anticipate the text input comes from. For example, if I am anticipating that the text is coming from a web crawler, I will perform HTML decoding on the text input as well.
Codes to generate n-grams from a list of words
I typically use the following function to generate n-grams out of a list of individual words:
def generate_ngrams(words_list, n): ngrams_list = [] for num in range(0, len(words_list)): ngram = ' '.join(words_list[num:num + n]) ngrams_list.append(ngram) return ngrams_list
The generate_ngrams function accepts two input parameters:
- A list of individual words which can come from the output of the process_text function.
- A number which indicates the number of words in a text sequence.
Upon receiving the input parameters, the generate_ngrams function declares a list to keep track of the generated n-grams. It then loops through all the words in words_list to construct n-grams and appends them to ngram_list .
When the loop completes, the generate_ngrams function returns ngram_list back to the caller.
Putting together process_text and generate_ngrams functions to generate n-grams
The following is an example of how I would use the process_text and generate_ngrams functions in tandem to generate n-grams:
if __name__ == '__main__': text = 'A quick brown fox jumps over the lazy dog.' words_list = process_text(text) unigrams = generate_ngrams(words_list, 1) bigrams = generate_ngrams(words_list, 2) trigrams = generate_ngrams(words_list, 3) fourgrams = generate_ngrams(words_list, 4) fivegrams = generate_ngrams(words_list, 5) print (unigrams + bigrams + trigrams + fourgrams + fivegrams)
The function first declares the text with the string ‘A quick brown fox jumps over the lazy dog.’. It then convert the text to a list of individual words with the process_text function. Once process_text completes, it uses the generate_ngrams function to create 1-gram, 2-gram, 3-gram, 4-gram and 5-gram sequences. Lastly, it prints the generated n-gram sequences to standard output.
Putting the the codes together in a Python script and running them will give me the following output:
['a', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog', 'a quick', 'quick brown', 'brown fox', 'fox jumps', 'jumps over', 'over the', 'the lazy', 'lazy dog', 'dog', 'a quick brown', 'quick brown fox', 'brown fox jumps', 'fox jumps over', 'jumps over the', 'over the lazy', 'the lazy dog', 'lazy dog', 'dog', 'a quick brown fox', 'quick brown fox jumps', 'brown fox jumps over', 'fox jumps over the', 'jumps over the lazy', 'over the lazy dog', 'the lazy dog', 'lazy dog', 'dog', 'a quick brown fox jumps', 'quick brown fox jumps over', 'brown fox jumps over the', 'fox jumps over the lazy', 'jumps over the lazy dog', 'over the lazy dog', 'the lazy dog', 'lazy dog', 'dog']
About Clivant
Clivant a.k.a Chai Heng enjoys composing software and building systems to serve people. He owns techcoil.com and hopes that whatever he had written and built so far had benefited people. All views expressed belongs to him and are not representative of the company that he works/worked for.
Implement N-Grams using Python NLTK – A Step-By-Step Guide
In this tutorial, we will discuss what we mean by n-grams and how to implement n-grams in the Python programming language.
Understanding N-grams
Text n-grams are commonly utilized in natural language processing and text mining. It’s essentially a string of words that appear in the same window at the same time.
When computing n-grams, you normally advance one word (although in more complex scenarios you can move n-words). N-grams are used for a variety of purposes.
For example, while creating language models, n-grams are utilized not only to create unigram models but also bigrams and trigrams.
Google and Microsoft have created web-scale grammar models that may be used for a variety of activities such as spelling correction, hyphenation, and text summarization.
Implementing n-grams in Python
In order to implement n-grams, ngrams function present in nltk is used which will perform all the n-gram operation.
from nltk import ngrams sentence = input("Enter the sentence: ") n = int(input("Enter the value of n: ")) n_grams = ngrams(sentence.split(), n) for grams in n_grams: print(grams)
Sample Output
Enter the sentence: Let's test the n-grams implementation with this sample sentence! Yay! Enter the value of n: 3 ("Let's", 'test', 'the') ('test', 'the', 'n-grams') ('the', 'n-grams', 'implementation') ('n-grams', 'implementation', 'with') ('implementation', 'with', 'this') ('with', 'this', 'sample') ('this', 'sample', 'sentence!') ('sample', 'sentence!', 'Yay!')
See how amazing the results are! You can try out the same code for a number of sentences. Happy coding! 😇
Also Read:
# Generating N-grams from Sentences in Python
N-grams are contiguous sequences of n-items in a sentence. N can be 1, 2 or any other positive integers, although usually we do not consider very large N because those n-grams rarely appears in many different places.
When performing machine learning tasks related to natural language processing, we usually need to generate n-grams from input sentences. For example, in text classification tasks, in addition to using each individual token found in the corpus, we may want to add bi-grams or tri-grams as features to represent our documents. This post describes several different ways to generate n-grams quickly from input sentences in Python.
# The Pure Python Way
In general, an input sentence is just a string of characters in Python. We can use build in functions in Python to generate n-grams quickly. Let’s take the following sentence as a sample input:
s = """ Natural-language processing (NLP) is an area of computer science and artificial intelligence concerned with the interactions between computers and human (natural) languages. """
If we want to generate a list of bi-grams from the above sentence, the expected output would be something like below (depending on how do we want to treat the punctuations, the desired output can be different):
[ "natural language", "language processing", "processing nlp", "nlp is", "is an", "an area", . ]
The following function can be used to achieve this:
import re def generate_ngrams(s, n): # Convert to lowercases s = s.lower() # Replace all none alphanumeric characters with spaces s = re.sub(r'[^a-zA-Z0-9\s]', ' ', s) # Break sentence in the token, remove empty tokens tokens = [token for token in s.split(" ") if token != ""] # Use the zip function to help us generate n-grams # Concatentate the tokens into ngrams and return ngrams = zip(*[token[i:] for i in range(n)]) return [" ".join(ngram) for ngram in ngrams]
Applying the above function to the sentence, with n=5 , gives the following output:
>>> generate_ngrams(s, n=5) ['natural language processing nlp is', 'language processing nlp is an', 'processing nlp is an area', 'nlp is an area of', 'is an area of computer', 'an area of computer science', 'area of computer science and', 'of computer science and artificial', 'computer science and artificial intelligence', 'science and artificial intelligence concerned', 'and artificial intelligence concerned with', 'artificial intelligence concerned with the', 'intelligence concerned with the interactions', 'concerned with the interactions between', 'with the interactions between computers', 'the interactions between computers and', 'interactions between computers and human', 'between computers and human natural', 'computers and human natural languages']
The above function makes use of the zip function, which creates a generator that aggregates elements from multiple lists (or iterables in genera). The blocks of codes and comments below offer some more explanation of the usage:
# Sample sentence s = "one two three four five" tokens = s.split(" ") # tokens = ["one", "two", "three", "four", "five"] sequences = [tokens[i:] for i in range(3)] # The above will generate sequences of tokens starting # from different elements of the list of tokens. # The parameter in the range() function controls # how many sequences to generate. # # sequences = [ # ['one', 'two', 'three', 'four', 'five'], # ['two', 'three', 'four', 'five'], # ['three', 'four', 'five']] bigrams = zip(*sequences) # The zip function takes the sequences as a list of inputs # (using the * operator, this is equivalent to # zip(sequences[0], sequences[1], sequences[2]). # Each tuple it returns will contain one element from # each of the sequences. # # To inspect the content of bigrams, try: # print(list(bigrams)) # which will give the following: # # [ # ('one', 'two', 'three'), # ('two', 'three', 'four'), # ('three', 'four', 'five') # ] # # Note: even though the first sequence has 5 elements, # zip will stop after returning 3 tuples, because the # last sequence only has 3 elements. In other words, # the zip function automatically handles the ending of # the n-gram generation.
# Using NLTK
Instead of using pure Python functions, we can also get help from some natural language processing libraries such as the Natural Language Toolkit (NLTK). In particular, nltk has the ngrams function that returns a generator of n-grams given a tokenized sentence. (See the documentaion of the function here)
import re from nltk.util import ngrams s = s.lower() s = re.sub(r'[^a-zA-Z0-9\s]', ' ', s) tokens = [token for token in s.split(" ") if token != ""] output = list(ngrams(tokens, 5))
The above block of code will generate the same output as the function generate_ngrams() as shown above.