Python count word frequencies

Содержание

How to count word frequencies within a file in python
Python — Finding word frequencies of list of words in text file
Count word frequency efficient in Python using dictionary
4 Answers 4

How to count word frequencies within a file in python

Although obviously it’s a lot bigger then that, this is essentially it.Basically I’m trying to sum how many times each individual string is in the file (each letter/string is on a separate line, so technically the file is C\nV\nEH\n etc. However when I try to convert these files into a list, and then use the count function on, it separates out letters so that strings such as ‘IRQ’ are [‘\n’I’,’R’,’Q’,’\n’] so then when I count it I get the frequencies of each individual letter and not of the strings. Here is the code that I have written so far,

def countf(): fh = open("C:/x.txt","r") fh2 = open("C:/y.txt","w") s = [] for line in fh: s += line for x in s: fh2.write(" - ".format(x,s.count(x))

C 10 V 32 EH 7 A 1 IRQ 9 H 8

Does it have to be done in python? sort yourfile.txt | uniq -c will give you word counts (you mention C:\ so you seem to be on windows, sort and uniq are standard unix commands that you can get if you install cygwin or unxutils.sourceforge.net).

@therefromhere — I think the OP wants word counts. The python code is generating letter counts the way that it is written. sort and uniq will technically generate line counts. Not sure if this is correct or not.

Word counts, just some of those words happen to be composed of a single letter, it’s for biological research. As for doing it in python, that and R are the only languages I’m familiar with and tbh I’d like to figure this out within python

Источник

Python — Finding word frequencies of list of words in text file

I am trying to speed up my project to count word frequencies. I have 360+ text files, and I need to get the total number of words and the number of times each word from another list of words appears. I know how to do this with a single text file.

>>> import nltk >>> import os >>> os.chdir("C:\Users\Cameron\Desktop\PDF-to-txt") >>> filename="1976.03.txt" >>> textfile=open(filename,"r") >>> inputString=textfile.read() >>> word_list=re.split('\s+',file(filename).read().lower()) >>> print 'Words in text:', len(word_list) #spits out number of words in the textfile >>> word_list.count('inflation') #spits out number of times 'inflation' occurs in the textfile >>>word_list.count('jobs') >>>word_list.count('output')

Its too tedious to get the frequencies of ‘inflation’, ‘jobs’, ‘output’ individual. Can I put these words into a list and find the frequency of all the words in the list at the same time? Basically this with Python. Example: Instead of this:

>>> word_list.count('inflation') 3 >>> word_list.count('jobs') 5 >>> word_list.count('output') 1

>>> list1='inflation', 'jobs', 'output' >>>word_list.count(list1) 'inflation', 'jobs', 'output' 3, 5, 1

My list of words is going to have 10-20 terms, so I need to be able to just point Python toward a list of words to get the counts of. It would also be nice if the output was able to be copy+paste into an excel spreadsheet with the words as columns and frequencies as rows Example:

inflation, jobs, output 3, 5, 1

And finally, can anyone help automate this for all of the textfiles? I figure I just point Python toward the folder and it can do the above word counting from the new list for each of the 360+ text files. Seems easy enough, but I’m a bit stuck. Any help? An output like this would be fantastic: Filename1 inflation, jobs, output 3, 5, 1

Filename2 inflation, jobs, output 7, 2, 4 Filename3 inflation, jobs, output 9, 3, 5

Источник

Count word frequency efficient in Python using dictionary

Why is Method 2 efficient? Isn’t in both the cases, the number of hash functions calls is same in contradiction to this http://blackecho.github.io/blog/programming/2016/03/23/python-underlying-data-structures.html?

The most efficient probably is Counter from the standard library: from collections import Counter; c = Counter(words) .

there are many methods to do it. Some of which are more efficient than Method2 . See collections.defaultDict or even better collections.Counter .

Ev.kounis. My question is how come Method2 is efficient than Method1 in case of number of hash function calls?

4 Answers 4

It depends on the input. If on average most words are already in the dict then you will not get many exceptions. If most words are unique then the overhead of the exceptions will make the second method slower.

I had a quick look but I don’t see what sort of reaction you hope for. Many random blogs are happy to provide incomplete or dubious programming advice.

The blog is saying number of function calls in Method 1 is more compared to the Method 2. I unable to digest it. I feel that number of function calls in both the cases is same.

It says it performs a lot of unnecessary hash function computations, but that seems like a false claim. Why is this a comment to my answer, and not e.g. a new question, or a comment to the blog’s author?

@SheikhArbaz you can check the validity of this blog post’s assertion by setting up a quick and simple benchmark like this one: gist.github.com/BrunoDesthuilliers/… — as you’ll find out, on a long enough real-life text (with lot of different words), the containment test is significantly faster (by a factor 2 in python 3.6, and by a factor four with python 2.7). For a dummy text with the same three words repeated over and over.

In your first method: for every word it will search in the dictionary 3 times so it will access total 3 * len(words) or 3 * 4 = 12

In second method: it will only search 2 times if not found; otherwise 1 time: so 2 * 4 = 8

Theoretically, both have same time complexity.

Thanks to Thierry Lathuille for pointing out. Indeed method 1 should be more efficient than method 2. Python dictionary use hashmap so accessing a key complexity would be O(n) but in average case, it is O(1). and CPython implementation is quite efficient. On the other hand, try/catch exception handling is slow.

you can use defaultdict in your method 1 for more clean code.

Источник