Java stream count words

Содержание

Word counting using streams
6 Answers 6
Pattern#splitAsStream(CharSequence)
How to count words in a text file, java 8-style
How to count the number of occurrences of words in a text
5 Answers 5

Word counting using streams

But countWords(«asd») return 97. Why? I thought that chars returns IntStream which actually consists of char s. So, I just cast it to char . What’s wrong?

6 Answers 6

While your question refers to counting words your code seems to be designed to count spaces. If that’s your intention then I would suggest:

input.chars().filter(Character::isSpaceChar).count();

That avoids many of the casting complications you have in your code and you can change it to isWhitespace if that makes more sense for your domain.

If, however, you wish to count words then the simplest solution is to split on whitespace then count non-empty words:

Pattern.compile("\\s+").splitAsStream(input).filter(word -> !word.isEmpty()).count();

Though the latter still counts subsequences of adjacent non-whitespace characters, which are not necessarily words.

I would suggest more functional approach:

public static long countWords(String s) < return Arrays .stream(s.split(" ")) .filter(w ->!w.isEmpty()) .count(); >

@Aconcagua Even the answer is verbose, it’s more informative. 🙂 I don’t know which one is more appropriate when it comes to performance

@CrazyNinja If it comes to performance, then we’d probably rather stick with the reduce approach (just iterating, no new objects. (except for the one the lambda is mapped to)). Between our two variants? Well, if you split via «\\w+», too, you’d construct less string objects, might give you a little performance gain. trim() vs. stream().filter()? Would tend to say ‘trim()’, but don’t dare to give definite answer without benchmarking.

Читайте также: Php test if mail sent

There are different overloads of the reduce operator:

Optional reduce(BinaryOperator accumulator)
T reduce(T identity, BinaryOperator accumulator)
U reduce(U identity, BiFunction accumulator, BinaryOperator combiner)

If you don’t specify the identity value for ‘x’, the reduce operator takes the first value from the stream. So ‘x’ ends up being the letter ‘a’, as an integer, which is 97. You probably want to change your code to this:

public static int countWords(String s) < return s.chars().reduce(0, (x, y) ->< if((char)y == ' ') return x + 1; return x; >); >

Unfirtunately, your example is not quite correct. First of all it returns 0 if there’s one word. Secodenly, orElse(0) does not compile.

You’re right — removed the .orElse, OP can decide how this should handle 1 word cases, as there are plenty of alternatives suggested by others as well. This answer was intended to explain why the OP is getting the behaviour they described.

Besides it is an int , not an Integer , I didn’t notice that it is returned as well. Still, returning the value without modifying the parameter is much cleaner. Oh, and the whole thing still is an abuse of reduce …

Yes, as I’ve already commented, this answer was intended to explain why the OP is getting the behaviour they described.

Maybe it helps, understanding that CharSequence.chars() returns an IntStream instead of a Stream . That influences the available reduce methods.

While using reduce, you’re working with tuples : x being the first char or an accumulator and y the secone char variable.

Here, x always points to a which ASCII value is 97

Pattern#splitAsStream(CharSequence)

You may want to use this method in your case, it does the job just right and you write an easier to maintain code.

public static int countWords(String s)

That’s not entirely true — the values are passed in to the accumulator function one at a time, along with the result of accumulation. It’s not a tuple of values from the stream, but a tuple of the accumulated value and the next value to be added, so ‘x’ could just as easily be a StringBuilder, a List, or anything else.

@Andrew Williamson: The documentation clearly says that the reduction operator must be associative, which implies that (a op b) op c must have the same result as a op (b op c) . This is violated if a is a counter and b and c are characters. So even if it happens to work with the current implementation, it’s not correct. And you will get surprises in parallel execution.

Counting spaces: see sprinter’s response; performance: see comments to erkfel’s response; correct application of reduce: see Andrew Williamson’s response.

Now I combined them all to the following:

public static int countWords(String s) < int c = s.chars().reduce(0, (x, y) -> < if(x < 0) < if(Character.isWhitespace(y)) < x = -x; >> else < if(!Character.isWhitespace(y)) < x = -(x + 1); >> return x; >); return c

This counts real words, not the whitespace, in a very efficient way. There is a little trick hid within: I am using negative values to represent the state «within a word» and positive values to represent «within whitespace sequence». I chose this for not having to carry an additional boolean value, saving us from writing an explicit class implementing IntBinaryOperation (additionally, this keeps the lamda expression stateless, still parallelising as talked of in the reduction article would not be possible as this operator is not associative. )).

Edit: As Holger pointed out (I think rightly), this usage is abuse of how recude actually is intended (have several alike values and reduce them to a single one still alike the original ones; example: summating or multiplying a list of numerical values, result still is numerical — or concatenating a list of strings, result still is a string).

So simply iterating over the string seems to be more appropriate:

public static int countWords(String s) < int count = 0; boolean isWord = false; for(int i = 0; i < s.length(); i++) < if(isWord) < if(Character.isWhitespace(s.charAt(i))) < isWord = false; >> else < if(!Character.isWhitespace(s.charAt(i))) < ++count; isWord = true; >> return count; >

I personally like compact variants, although less understandable:

public static int countWords(String s) < int count = 0; boolean isWord = false; for(int i = 0; i < s.length(); i++) < boolean isChange = isWord == Character.isWhitespace(s.charAt(i)); isWord ^= isChange; count += isWord & isChange ? 1 : 0; >return count; >

Источник

How to count words in a text file, java 8-style

I’m trying to perform an assignment that first counts the number of files in a directory and then give a word count within each file. I got the file count alright, but I’m having a hard time converting some code my instructor gave me from a class that does a frequency count to the simpler word count. Moreover, I can’t seem to find the proper code to look at each file to count the words (I’m trying to find something «generic» rather than a specific, but I trying to test the program using a specific text file). This is the intended output:

Count 11 files: word length: 1 ==> 80 word length: 2 ==> 321 word length: 3 ==> 643

primes.txt but are sometimes sense refrigerator make haiku dont they funny word length: 1 ==> . Count 11 files:

import java.io.IOException; import java.nio.file.Files; import java.nio.file.Path; import java.nio.file.Paths; import java.util.AbstractMap.SimpleEntry; import java.util.Arrays; import java.util.Map; import static java.util.stream.Collectors.counting; import static java.util.stream.Collectors.groupingBy; /** * * @author */ public class WordCount < /** * * @param filename * @return * @throws java.io.IOException */ public Mapcount(String filename) throws IOException < //Streamlines = Files.lines(Paths.get(filename)); Path path = Paths.get("haiku.txt"); Map wordMap = Files.lines(path) .parallel() .flatMap(line -> Arrays.stream(line.trim().split(" "))) .map(word -> word.replaceAll("[^a-zA-Z]", "").toLowerCase().trim()) .filter(word -> word.length() > 0) .map(word -> new SimpleEntry<>(word, 1)) //.collect(Collectors.toMap(s -> s, s -> 1, Integer::sum)); .collect(groupingBy(SimpleEntry::getKey, counting())); wordMap.forEach((k, v) -> System.out.println(String.format(k,v))); return wordMap; > >

import java.io.IOException; import java.nio.file.DirectoryStream; import java.nio.file.Files; import java.nio.file.Path; import java.nio.file.Paths; import java.util.ArrayList; import java.util.List; /* * To change this license header, choose License Headers in Project Properties. * To change this template file, choose Tools | Templates * and open the template in the editor. */ /** * * @author */ public class FileCatch8 < public static void main(String args[]) < ListfileNames = new ArrayList<>(); try < DirectoryStreamdirectoryStream = Files.newDirectoryStream (Paths.get("files")); int fileCounter = 0; WordCount wordCnt = new WordCount(); for (Path path : directoryStream) < System.out.println(path.getFileName()); fileCounter++; fileNames.add(path.getFileName().toString()); System.out.println("word length: " + fileCounter + " ==>" + wordCnt.count(path.getFileName().toString())); > > catch(IOException ex) < >System.out.println("Count: "+fileNames.size()+ " files"); > >

Источник

How to count the number of occurrences of words in a text

I am working on a project to write a program that finds the 10 most used words in a text, but I got stuck and don’t know what I should do next. Can someone help me please? I came this far only:

import java.io.File; import java.io.FileNotFoundException; import java.util.ArrayList; import java.util.Collections; import java.util.List; import java.util.Scanner; import java.util.regex.Pattern; public class Lab4 < public static void main(String[] args) throws FileNotFoundException < Scanner file = new Scanner(new File("text.txt")).useDelimiter("[^a-zA-Z]+"); Listwords = new ArrayList(); while (file.hasNext()) < String tx = file.next(); // String x = file.next().toLowerCase(); words.add(tx); >Collections.sort(words); // System.out.println(words); > >

A List of words is not sufficient, you also need a count of each occurrence of the words. What data structures would you use for such a task? (Clearly, this is homework, which is why I am posing this question)

I think you have a bug with how you’re reading the file. file.next() will eventually be null, so you should check for that.

5 Answers 5

And here is how to find the words with the highest count in a Multiset: Simplest way to iterate through a Multiset in the order of element frequency?

UPDATE I wrote this answer in 2012. Since then we have Java 8, and now it is possible to find the 10 most used words in a few lines without external libraries:

List words = . // map the words to their count Map frequencyMap = words.stream() .collect(toMap( s -> s, // key is the word s -> 1, // value is 1 Integer::sum)); // merge function counts the identical words // find the top 10 List top10 = words.stream() .sorted(comparing(frequencyMap::get).reversed()) // sort by descending frequency .distinct() // take only unique values .limit(10) // take only the first 10 .collect(toList()); // put it in a returned list System.out.println("top10 mt24">)" data-controller="se-share-sheet" data-se-share-sheet-title="Share a link to this answer" data-se-share-sheet-subtitle="" data-se-share-sheet-post-type="answer" data-se-share-sheet-social="facebook twitter devto" data-se-share-sheet-location="2" data-se-share-sheet-license-url="https%3a%2f%2fcreativecommons.org%2flicenses%2fby-sa%2f3.0%2f" data-se-share-sheet-license-name="CC BY-SA 3.0" data-s-popover-placement="bottom-start">Share
)" title="">Improve this answer
)">edited May 23, 2017 at 12:00
 
CommunityBot111 silver badge
answered Dec 20, 2012 at 19:52
 
lbalazscslbalazscs17.4k77 gold badges4242 silver badges5050 bronze badges