- How to count the number of words in a sentence, ignoring numbers, punctuation and whitespace?
- 8 Answers 8
- How to find the count of a word in a string
- 9 Answers 9
- Python: Count Words in a String or File
- Reading a Text File in Python
- Count Number of Words In Python Using split()
- Count Number of Words In Python Using Regex
- Calculating Word Frequencies in Python
- Using defaultdict To Calculate Word Frequencies in Python
- Using Counter to Create Word Frequencies in Python
- Conclusion
- Additional Resources
How to count the number of words in a sentence, ignoring numbers, punctuation and whitespace?
How would I go about counting the words in a sentence? I’m using Python. For example, I might have the string:
string = "I am having a very nice 23!@$ day. "
That would be 7 words. I’m having trouble with the random amount of spaces after/before each word as well as when numbers or symbols are involved.
To accomodate the numbers, you can change the regex. \w matches [a-zA-Z0-9] Now, you need to define what your use case is. What happens to I am fine2 ? Would it be 2 words or 3 ?
You needed to explicitly add «ignoring numbers, punctuation and whitespace» since that’s part of the task.
FYI some punctuation symbols may merit separate consideration. Otherwise, «carry-on luggage» becomes three words, as does «U.S.A.» So answers may want to parameterize what punctuation is allowed, rather than blanket regex like \S+
8 Answers 8
str.split() without any arguments splits on runs of whitespace characters:
>>> s = 'I am having a very nice day.' >>> >>> len(s.split()) 7
From the linked documentation:
If sep is not specified or is None , a different splitting algorithm is applied: runs of consecutive whitespace are regarded as a single separator, and the result will contain no empty strings at the start or end if the string has leading or trailing whitespace.
One (very minor) disadvantage of this would be that you could have punctuation groups counted as words. For example, in ‘I am having a very nice day — or at least I was.’ , you’d get — counted as a word. isalnum might help, I guess, depending on the OP’s definition of «word».
import re line = " I am having a very nice day." count = len(re.findall(r'\w+', line)) print (count)
s = "I am having a very nice 23!@$ day. " sum([i.strip(string.punctuation).isalpha() for i in s.split()])
The statement above will go through each chunk of text and remove punctuations before verifying if the chunk is really string of alphabets.
1. Using i as a nonindex variable is really misleading; 2. you don’t need to create a list, it’s just wasting memory. Suggestion: sum(word.strip(string.punctuation).isalpha() for word in s.split())
This is a simple word counter using regex. The script includes a loop which you can terminate it when you’re done.
#word counter using regex import re while True: string =raw_input("Enter the string: ") count = len(re.findall("[a-zA-Z_]+", string)) if line == "Done": #command to terminate the loop break print (count) print ("Terminated")
Ok here is my version of doing this. I noticed that you want your output to be 7 , which means you dont want to count special characters and numbers. So here is regex pattern:
Where [a-zA-Z_] means it will match any character beetwen a-z (lowercase) and A-Z (upper case).
About spaces. If you want to remove all extra spaces, just do:
string = string.rstrip().lstrip() # Remove all extra spaces at the start and at the end of the string while " " in string: # While there are 2 spaces beetwen words in our string. string = string.replace(" ", " ") # . replace them by one space!
How to find the count of a word in a string
Depending on your use case, there’s one more thing you might need to consider: some words have their meanings change depending upon their capitalization, like Polish and polish . Probably that won’t matter for you, but it’s worth remembering.
Could you define you data set more for us, will you worry about punctuation such as in I’ll , don’t etc .. some of these raised in comments below. And differences in case?
9 Answers 9
If you want to find the count of an individual word, just use count :
Use collections.Counter and split() to tally up all the words:
from collections import Counter words = input_string.split() wordCount = Counter(words)
I’m copying part of a comment by @DSM left for me since I also used str.count() as my initial solution — this has a problem since «am ham».count(«am») will yield 2 rather than 1
@Levon: You’re absolutely right. I believe using Counter, along with a regex word collector is probably the best option. Will edit answer accordingly.
Well .. credit goes to @DSM who made me aware of this in the first place (since I was using str.count() too)
from collections import * import re Counter(re.findall(r"[\w']+", text.lower()))
Using re.findall is more versatile than split , because otherwise you cannot take into account contractions such as «don’t» and «I’ll», etc.
>>> countWords("Hello I am going to I with hello am") Counter()
If you expect to be making many of these queries, this will only do O(N) work once, rather than O(N*#queries) work.
>>> from collections import Counter >>> counts = Counter(sentence.lower().split())
The vector of occurrence counts of words is called bag-of-words.
Scikit-learn provides a nice module to compute it, sklearn.feature_extraction.text.CountVectorizer . Example:
import numpy as np from sklearn.feature_extraction.text import CountVectorizer vectorizer = CountVectorizer(analyzer = "word", \ tokenizer = None, \ preprocessor = None, \ stop_words = None, \ min_df = 0, \ max_features = 50) text = ["Hello I am going to I with hello am"] # Count train_data_features = vectorizer.fit_transform(text) vocab = vectorizer.get_feature_names() # Sum up the counts of each vocabulary word dist = np.sum(train_data_features.toarray(), axis=0) # For each, print the vocabulary word and the number of times it # appears in the training set for tag, count in zip(vocab, dist): print count, tag
2 am 1 going 2 hello 1 to 1 with
Part of the code was taken from this Kaggle tutorial on bag-of-words.
Python: Count Words in a String or File
In this tutorial, you’ll learn how to use Python to count the number of words and word frequencies in both a string and a text file. Being able to count words and word frequencies is a useful skill. For example, knowing how to do this can be important in text classification machine learning algorithms.
By the end of this tutorial, you’ll have learned:
- How to count the number of words in a string
- How to count the number of words in a text file
- How to calculate word frequencies using Python
Reading a Text File in Python
The processes to count words and calculate word frequencies shown below are the same for whether you’re considering a string or an entire text file. Because of this, this section will briefly describe how to read a text file in Python.
If you want a more in-depth guide on how to read a text file in Python, check out this tutorial here. Here is a quick piece of code that you can use to load the contents of a text file into a Python string:
# Reading a Text File in Python file_path = '/Users/datagy/Desktop/sample_text.txt' with open(file_path) as file: text = file.read()
I encourage you to check out the tutorial to learn why and how this approach works. However, if you’re in a hurry, just know that the process opens the file, reads its contents, and then closes the file again.
Count Number of Words In Python Using split()
One of the simplest ways to count the number of words in a Python string is by using the split() function. The split function looks like this:
# Understanding the split() function str.split( sep=None # The delimiter to split on maxsplit=-1 # The number of times to split )
By default, Python will consider runs of consecutive whitespace to be a single separator. This means that if our string had multiple spaces, they’d only be considered a single delimiter. Let’s see what this method returns:
# Splitting a string with .split() text = 'Welcome to datagy! Here you will learn Python and data science.' print(text.split()) # Returns: ['Welcome', 'to', 'datagy!', 'Here', 'you', 'will', 'learn', 'Python', 'and', 'data', 'science.']
We can see that the method now returns a list of items. Because we can use the len() function to count the number of items in a list, we’re able to generate a word count. Let’s see what this looks like:
# Counting words with .split() text = 'Welcome to datagy! Here you will learn Python and data science.' print(len(text.split())) # Returns: 11
Count Number of Words In Python Using Regex
Another simple way to count the number of words in a Python string is to use the regular expressions library, re . The library comes with a function, findall() , which lets you search for different patterns of strings.
Because we can use regular expression to search for patterns, we must first define our pattern. In this case, we want patterns of alphanumeric characters that are separated by whitespace.
For this, we can use the pattern \w+ , where \w represents any alphanumeric character and the + denotes one or more occurrences. Once the pattern encounters whitespace, such as a space, it will stop the pattern there.
Let’s see how we can use this method to generate a word count using the regular expressions library, re :
# Counting words with regular expressions import re text = 'Welcome to datagy! Here you will learn Python and data science.' print(len(re.findall(r'\w+', text))) # Returns: 11
Calculating Word Frequencies in Python
In order to calculate word frequencies, we can use either the defaultdict class or the Counter class. Word frequencies represent how often a given word appears in a piece of text.
Using defaultdict To Calculate Word Frequencies in Python
Let’s see how we can use defaultdict to calculate word frequencies in Python. The defaultdict extend on the regular Python dictionary by providing helpful functions to initialize missing keys.
Because of this, we can loop over a piece of text and count the occurrences of each word. Let’s see how we can use it to create word frequencies for a given string:
# Creating word frequencies with defaultdict from collections import defaultdict import re text = 'welcome to datagy! datagy will teach data. data is fun. data data data!' counts = defaultdict(int) for word in re.findall('\w+', text): counts[word] += 1 print(counts) # Returns: # defaultdict(, )
Let’s break down what we did here:
- We imported both the defaultdict function and the re library
- We loaded some text and instantiated a defaultdict using the int factory function
- We then looped over each word in the word list and added one for each time it occurred
Using Counter to Create Word Frequencies in Python
Another way to do this is to use the Counter class. The benefit of this approach is that we can even easily identify the most frequent word. Let’s see how we can use this approach:
# Creating word frequencies with Counter from collections import Counter import re text = 'welcome to datagy! datagy will teach data. data is fun. data data data!' counts = Counter(re.findall('\w+', text)) print(counts) # Returns: # Counter()
Let’s break down what we did here:
- We imported our required libraries and classes
- We passed the resulting list from the findall() function into the Counter class
- We printed the result of this class
One of the perks of this is that we can easily find the most common word by using the .most_common() function. The function returns a sorted list of tuples, ordering the items from most common to least common. Because of this, we can simply access the 0th index to find the most common word:
# Finding the Most Common Word from collections import Counter import re text = 'welcome to datagy! datagy will teach data. data is fun. data data data!' counts = Counter(re.findall('\w+', text)) print(counts.most_common()[0]) # Returns: # ('data', 5)
Conclusion
In this tutorial, you learned how to generate word counts and word frequencies using Python. You learned a number of different ways to count words including using the .split() method and the re library. Then, you learned different ways to generate word frequencies using defaultdict and Counter . Using the Counter method, you were able to find the most frequent word in a string.
Additional Resources
To learn more about related topics, check out the tutorials below: