Count words in text python

Word count from a txt file program

The funny symbols you’re encountering are a UTF-8 BOM (Byte Order Mark). To get rid of them, open the file using the correct encoding (I’m assuming you’re on Python 3):

file = open(r"D:\zzzz\names2.txt", "r", encoding="utf-8-sig") 

Furthermore, for counting, you can use collections.Counter :

from collections import Counter wordcount = Counter(file.read().split()) 
>>> for item in wordcount.items(): print("<>\t<>".format(*item)) . snake 1 lion 2 goat 2 horse 3 
#!/usr/bin/python file=open("D:\\zzzz\\names2.txt","r+") wordcount=<> for word in file.read().split(): if word not in wordcount: wordcount[word] = 1 else: wordcount[word] += 1 for k,v in wordcount.items(): print k, v 
FILE_NAME = 'file.txt' wordCounter = <> with open(FILE_NAME,'r') as fh: for line in fh: # Replacing punctuation characters. Making the string to lower. # The split will spit the line into a list. word_list = line.replace(',','').replace('\'','').replace('.','').lower().split() for word in word_list: # Adding the word into the wordCounter dictionary. if word not in wordCounter: wordCounter[word] = 1 else: # if the word is already in the dictionary update its count. wordCounter[word] = wordCounter[word] + 1 print(''.format('Word','Count')) print('-' * 18) # printing the words and its occurrence. for (word,occurance) in wordCounter.items(): print(''.format(word,occurance)) 
 Word Count ------------------ of 6 examples 2 used 2 development 2 modified 2 open-source 2 

Источник

How to count the number of words in a sentence, ignoring numbers, punctuation and whitespace?

How would I go about counting the words in a sentence? I’m using Python. For example, I might have the string:

string = "I am having a very nice 23!@$ day. " 

That would be 7 words. I’m having trouble with the random amount of spaces after/before each word as well as when numbers or symbols are involved.

Читайте также:  Java двойной тернарный оператор

To accomodate the numbers, you can change the regex. \w matches [a-zA-Z0-9] Now, you need to define what your use case is. What happens to I am fine2 ? Would it be 2 words or 3 ?

You needed to explicitly add «ignoring numbers, punctuation and whitespace» since that’s part of the task.

FYI some punctuation symbols may merit separate consideration. Otherwise, «carry-on luggage» becomes three words, as does «U.S.A.» So answers may want to parameterize what punctuation is allowed, rather than blanket regex like \S+

8 Answers 8

str.split() without any arguments splits on runs of whitespace characters:

>>> s = 'I am having a very nice day.' >>> >>> len(s.split()) 7 

From the linked documentation:

If sep is not specified or is None , a different splitting algorithm is applied: runs of consecutive whitespace are regarded as a single separator, and the result will contain no empty strings at the start or end if the string has leading or trailing whitespace.

One (very minor) disadvantage of this would be that you could have punctuation groups counted as words. For example, in ‘I am having a very nice day — or at least I was.’ , you’d get — counted as a word. isalnum might help, I guess, depending on the OP’s definition of «word».

import re line = " I am having a very nice day." count = len(re.findall(r'\w+', line)) print (count) 
s = "I am having a very nice 23!@$ day. " sum([i.strip(string.punctuation).isalpha() for i in s.split()]) 

The statement above will go through each chunk of text and remove punctuations before verifying if the chunk is really string of alphabets.

1. Using i as a nonindex variable is really misleading; 2. you don’t need to create a list, it’s just wasting memory. Suggestion: sum(word.strip(string.punctuation).isalpha() for word in s.split())

This is a simple word counter using regex. The script includes a loop which you can terminate it when you’re done.

#word counter using regex import re while True: string =raw_input("Enter the string: ") count = len(re.findall("[a-zA-Z_]+", string)) if line == "Done": #command to terminate the loop break print (count) print ("Terminated") 

Ok here is my version of doing this. I noticed that you want your output to be 7 , which means you dont want to count special characters and numbers. So here is regex pattern:

Where [a-zA-Z_] means it will match any character beetwen a-z (lowercase) and A-Z (upper case).

About spaces. If you want to remove all extra spaces, just do:

string = string.rstrip().lstrip() # Remove all extra spaces at the start and at the end of the string while " " in string: # While there are 2 spaces beetwen words in our string. string = string.replace(" ", " ") # . replace them by one space! 

Источник

How do the count the number of sentences, words and characters in a file?

I have written the following code to tokenize the input paragraph that comes from the file samp.txt. Can anybody help me out to find and print the number of sentences, words and characters in the file? I have used NLTK in python for this.

>>>import nltk.data >>>import nltk.tokenize >>>f=open('samp.txt') >>>raw=f.read() >>>tokenized_sentences=nltk.sent_tokenize(raw) >>>for each_sentence in tokenized_sentences: . words=nltk.tokenize.word_tokenize(each_sentence) . print each_sentence #prints tokenized sentences from samp.txt >>>tokenized_words=nltk.word_tokenize(raw) >>>for each_word in tokenized_words: . words=nltk.tokenize.word_tokenize(each_word) . print each_words #prints tokenized words from samp.txt 

8 Answers 8

Try it this way (this program assumes that you are working with one text file in the directory specified by dirpath ):

import nltk folder = nltk.data.find(dirpath) corpusReader = nltk.corpus.PlaintextCorpusReader(folder, '.*\.txt') print "The number of sentences =", len(corpusReader.sents()) print "The number of patagraphs =", len(corpusReader.paras()) print "The number of words =", len([word for sentence in corpusReader.sents() for word in sentence]) print "The number of characters mt24"> 
)" data-controller="se-share-sheet" data-se-share-sheet-title="Share a link to this answer" data-se-share-sheet-subtitle="" data-se-share-sheet-post-type="answer" data-se-share-sheet-social="facebook twitter devto" data-se-share-sheet-location="2" data-se-share-sheet-license-url="https%3a%2f%2fcreativecommons.org%2flicenses%2fby-sa%2f3.0%2f" data-se-share-sheet-license-name="CC BY-SA 3.0" data-s-popover-placement="bottom-start">Share
)" title="">Improve this answer
)">edited Feb 24, 2016 at 14:05
hd1
33.8k 5 gold badges 80 silver badges 91 bronze badges
answered Feb 22, 2011 at 6:38
Add a comment |
3

With nltk, you can also use FreqDist (see O'Reillys Book Ch3.1)

And in your case:

import nltk raw = open('samp.txt').read() raw = nltk.Text(nltk.word_tokenize(raw.decode('utf-8'))) fdist = nltk.FreqDist(raw) print fdist.N() 

Источник

Оцените статью