Count unique words in python

Count Number of Word Occurrences in List Python

Counting the word frequency in a list element in Python is a relatively common task — especially when creating distribution data for histograms.

Say we have a list [‘b’, ‘b’, ‘a’] — we have two occurrences of «b» and one of «a». This guide will show you three different ways to count the number of word occurrences in a Python list:

  • Using Pandas and NumPy
  • Using the count() Function
  • Using the Collection Module’s Counter
  • Using a Loop and a Counter Variable

In practice, you’ll use Pandas/Numpy, the count() function or a Counter as they’re pretty convenient to use.

Using Pandas and NumPy

The shortest and easiest way to get value counts in an easily-manipulable format ( DataFrame ) is via NumPy and Pandas. We can wrap the list into a NumPy array, and then call the value_counts() method of the pd instance (which is also available for all DataFrame instances):

import numpy as np import pandas as pd words = ['hello', 'goodbye', 'howdy', 'hello', 'hello', 'hi', 'bye'] pd.value_counts(np.array(words)) 

This results in a DataFrame that contains:

hello 3 goodbye 1 bye 1 howdy 1 hi 1 dtype: int64 

You can access its values field to get the counts themselves, or index to get the words themselves:

df = pd.value_counts(np.array(words)) print('Index:', df.index) print('Values:', df.values) 
Index: Index(['hello', 'goodbye', 'bye', 'howdy', 'hi'], dtype='object') Values: [3 1 1 1 1] 

Using the count() Function

The «standard» way (no external libraries) to get the count of word occurrences in a list is by using the list object’s count() function.

Читайте также:  Cannot make static reference to non static method in java

The count() method is a built-in function that takes an element as its only argument and returns the number of times that element appears in the list.

The complexity of the count() function is O(n), where n is the number of factors present in the list.

The code below uses count() to get the number of occurrences for a word in a list:

words = ['hello', 'goodbye', 'howdy', 'hello', 'hello', 'hi', 'bye'] print(f'"hello" appears "hello")> time(s)') print(f'"howdy" appears "howdy")> time(s)') 

This should give us the same output as before using loops:

"hello" appears 3 time(s) "howdy" appears 1 time(s) 

The count() method offers us an easy way to get the number of word occurrences in a list for each individual word.

Using the Collection Module’s Counter

The Counter class instance can be used to, well, count instances of other objects. By passing a list into its constructor, we instantiate a Counter which returns a dictionary of all the elements and their occurrences in a list.

From there, to get a single word’s occurrence you can just use the word as a key for the dictionary:

from collections import Counter words = ['hello', 'goodbye', 'howdy', 'hello', 'hello', 'hi', 'bye'] word_counts = Counter(words) print(f'"hello" appears "hello"]> time(s)') print(f'"howdy" appears "howdy"]> time(s)') 
"hello" appears 3 time(s) "howdy" appears 1 time(s) 

Using a Loop and a Counter Variable

Ultimately, a brute force approach that loops through every word in the list, incrementing a counter by one when the word is found, and returning the total word count will work!

Free eBook: Git Essentials

Check out our hands-on, practical guide to learning Git, with best-practices, industry-accepted standards, and included cheat sheet. Stop Googling Git commands and actually learn it!

Of course, this method gets more inefficient as the list size grows, it’s just conceptually easy to understand and implement.

The code below uses this approach in the count_occurrence() method:

def count_occurrence(words, word_to_count): count = 0 for word in words: if word == word_to_count: # update counter variable count = count + 1 return count words = ['hello', 'goodbye', 'howdy', 'hello', 'hello', 'hi', 'bye'] print(f'"hello" appears "hello")> time(s)') print(f'"howdy" appears "howdy")> time(s)') 

If you run this code you should see this output:

"hello" appears 3 time(s) "howdy" appears 1 time(s) 

Most Efficient Solution?

Naturally — you’ll be searching for the most efficient solution if you’re dealing with a large corpora of words. Let’s benchmark all of these to see how they perform.

The task can be broken down into finding occurrences for all words or a single word, and we’ll be doing benchmarks for both, starting with all words:

import numpy as np import pandas as pd import collections def pdNumpy(words): def _pdNumpy(): return pd.value_counts(np.array(words)) return _pdNumpy def countFunction(words): def _countFunction(): counts = [] for word in words: counts.append(words.count(word)) return counts return _countFunction def counterObject(words): def _counterObject(): return collections.Counter(words) return _counterObject import timeit words = ['hello', 'goodbye', 'howdy', 'hello', 'hello', 'hi', 'bye'] print("Time to execute:\n") print("Pandas/NumPy: %ss" % timeit.Timer(pdNumpy(words)).timeit(1000)) print("count(): %ss" % timeit.Timer(countFunction(words)).timeit(1000)) print("Counter: %ss" % timeit.Timer(counterObject(words)).timeit(1000)) 
Time to execute: Pandas/NumPy: 0.33886080000047514s count(): 0.0009540999999444466s Counter: 0.0019409999995332328s 

The count() method is extremely fast compared to the other variants, however, it doesn’t give us the labels associated with the counts like the other two do.

If you need the labels — the Counter outperforms the inefficient process of wrapping the list in a NumPy array and then counting.

On the other hand, you can make use of DataFrame’s methods for sorting or other manipulation that you can’t do otherwise. Counter has some unique methods as well.

Ultimately, you can use the Counter to create a dictionary and turn the dictionary into a DataFrame as as well, to leverage the speed of Counter and the versatility of DataFrame s:

df = pd.DataFrame.from_dict([Counter(words)]).T 

If you don’t need the labels — count() is the way to go.

Alternatively, if you’re looking for a single word:

import numpy as np import pandas as pd import collections def countFunction(words, word_to_search): def _countFunction(): return words.count(word_to_search) return _countFunction def counterObject(words, word_to_search): def _counterObject(): return collections.Counter(words)[word_to_search] return _counterObject def bruteForce(words, word_to_search): def _bruteForce(): counts = [] count = 0 for word in words: if word == word_to_search: # update counter variable count = count + 1 counts.append(count) return counts return _bruteForce import timeit words = ['hello', 'goodbye', 'howdy', 'hello', 'hello', 'hi', 'bye'] print("Time to execute:\n") print("count(): %ss" % timeit.Timer(countFunction(words, 'hello')).timeit(1000)) print("Counter: %ss" % timeit.Timer(counterObject(words, 'hello')).timeit(1000)) print("Brute Force: %ss" % timeit.Timer(bruteForce(words, 'hello')).timeit(1000)) 
Time to execute: count(): 0.0001573999998072395s Counter: 0.0019498999999996158s Brute Force: 0.0005682000000888365s 

The brute force search and count() methods outperform the Counter , mainly because the Counter inherently counts all words instead of one.

Conclusion

In this guide, we explored finding the occurrence of the word in a Python list, assessing the efficiency of each solution and weighing when each is more suitable.

Источник

Count unique words in python

The list.append() method adds an item to the end of the list.

Copied!
my_list = ['bobby', 'hadz'] my_list.append('com') print(my_list) # 👉️ ['bobby', 'hadz', 'com']

The last step is to use the len() function to get the number of unique words in the string.

# Count the unique words in a text File using a for loop

This is a five-step process:

  1. Declare a new variable that stores an empty list.
  2. Read the contents of the file into a string and split it into words.
  3. Use a for loop to iterate over the list.
  4. Use the list.append() method to append all unique words to the list.
  5. Use the len() function to get the length of the list.
Copied!
unique_words = [] with open('example.txt', 'r', encoding='utf-8') as f: words = f.read().split() print(words) # 👉️ ['one', 'one', 'two', 'two', 'three', 'three'] for word in words: if word not in unique_words: unique_words.append(word) print(len(unique_words)) # 👉️ 3 print(unique_words) # 👉️ ['one', 'two', 'three']

We read the contents of the file into a string and used the str.split() method to split the string into a list of words.

On each iteration, we use the not in operator to check if the word is not present in the list of unique words.

If the condition is met, we use the list.append() method to append the value to the list.

The in operator tests for membership. For example, x in l evaluates to True if x is a member of l , otherwise it evaluates to False .

Источник

Python – Find unique words in Text File

Finding unique words in a text file requires text cleaning, finding the words, and then finding the unique.

In this tutorial, we will learn how to find unique words in a text file.

Steps to find unique words

To find unique words in a text file, follow these steps.

  1. Read text file in read mode.
  2. Convert text to lower case or upper case. We do not want ‘apple’ to be different from ‘Apple’.
  3. Split file contents into list of words.
  4. Clean the words that are infested with punctuation marks. Something like stripping the words from full-stops, commas, etc.
  5. Also, remove apostrophe-s ‘s.
  6. You may also add more text cleaning steps here.
  7. Now find the unique words in the list using a Python For Loop and Python Membership Operator.
  8. After finding unique words, sort them for presentation.

In the text cleaning, you can also remove helping verbs, etc.

Example 1: Find unique words in text file

Now, we will put all the above mentioned steps into working using a Python program.

Consider that we are taking the following text file.

Apple is a very big company. An apple a day keeps doctor away. A big fat cat came across the road beside doctor's office. The doctor owns apple device.

Python Program

text_file = open('data.txt', 'r') text = text_file.read() #cleaning text = text.lower() words = text.split() words = [word.strip('. ;()[]') for word in words] words = [word.replace("'s", '') for word in words] #finding unique unique = [] for word in words: if word not in unique: unique.append(word) #sort unique.sort() #print print(unique)
['a', 'across', 'an', 'apple', 'away', 'beside', 'big', 'came', 'cat', 'company', 'day', 'device', 'doctor', 'fat', 'is', 'keeps', 'office', 'owns', 'road', 'the', 'very']

Translation of Steps into Python Code

Following is the list of Python concepts we used in the above program to find the unique words.

  • open() function to get a reference to the file object.
  • file.read() method to read contents of the file.
  • str.lower() method to convert text to lower case.
  • str.split() method to split the text into words separated by white space characters like single space, new line, tab, etc.
  • str.strip() method to strip the punctuation marks from the edges of words.
  • str.replace() method to replace ‘s with nothing, at the end of words.
  • for loop to iterate for each word in the words list.
  • in – membership operator to check if the word is present in unique.
  • list.append() method to append the word to unique list.
  • list.sort() method to sort unique words in lexicographic ascending order.
  • print() function to print the unique words list.

Summary

In this tutorial of Python Examples, we learned how to find unique words in a text file, with the help of example program.

Источник

Оцените статью