- Why does this Java code not utilize all CPU cores?
- Java fixed size Thread Pool and optimal usage of all CPU cores
- Approach 1, probably complete deadlock
- Approach 2, the current one, at least somebody working
- Approach 3, the only solution I see
- Other solutions
- [EDIT]
- Accepted Solution and why I try something different (based on approach 2)
- Note:
- How to make Java application use all CPU groups
- How to tell JRE to use more than one CPU node
- 1 Answer 1
Why does this Java code not utilize all CPU cores?
and it will start 8 threads that do nothing else than looping and adding 2 to an integer. Something that runs in registers and not even allocates new memory. The problem we are facing now is, that we do not get a 24 core machine loaded (AMD 2 sockets with 12 cores each), when running this simple program (with 24 threads of course). Similar things happen with 2 programs each 12 threads or smaller machines. So our suspicion is that the JVM (Sun JDK 6u20 on Linux x64) does not scale well. Did anyone see similar things or has the ability to run it and report whether or not it runs well on his/her machine (>= 8 cores only please)? Ideas? I tried that on Amazon EC2 with 8 cores too, but the virtual machine seems to run different from a real box, so the loading behaves totally strange.
package com.test; import java.util.concurrent.ExecutorService; import java.util.concurrent.Executors; import java.util.concurrent.Future; import java.util.concurrent.TimeUnit; public class VMTest < public class IntTask implements Runnable < @Override public void run() < int i = 0; while (true) < i = i + 2; >> > public class StringTask implements Runnable < @Override public void run() < int i = 0; String s; while (true) < i++; s = "s" + Integer.valueOf(i); >> > public class ArrayTask implements Runnable < private final int size; public ArrayTask(int size) < this.size = size; >@Override public void run() < int i = 0; String[] s; while (true) < i++; s = new String[size]; >> > public void doIt(String[] args) throws InterruptedException < final String command = args[1].trim(); ExecutorService executor = Executors.newFixedThreadPool(Integer.valueOf(args[0])); for (int i = 0; i < Integer.valueOf(args[0]); i++) < Runnable runnable = null; if (command.equalsIgnoreCase("int")) < runnable = new IntTask(); >else if (command.equalsIgnoreCase("string")) < runnable = new StringTask(); >Future submit = executor.submit(runnable); > executor.awaitTermination(1, TimeUnit.HOURS); > public static void main(String[] args) throws InterruptedException < if (args.length < 3) < System.err.println("Usage: VMTest threadCount taskDef size"); System.err.println("threadCount: Number 1..n"); System.err.println("taskDef: int string array"); System.err.println("size: size of memory allocation for array, "); System.exit(-1); >new VMTest().doIt(args); > >
Java fixed size Thread Pool and optimal usage of all CPU cores
How to use exactly 8 threads for the ‘expensive’ parts all the time? I have a number crunching problem for which I created a simple framework. My problem is to find an elegant and simple way to use all CPU cores optimally. To get good performance I use a thread pool with fixed size of 8. The idea is to use as many threads as there are hardware threads for optimal performance. Simplified pseudo code usage of the framework is as follows:
interface Task < data[] compute(data[]); >Task task = new Loop(new Chain(new DoX(), new DoY(), new Split(2, new DoZ()))); result = task.compute(data);
- Loop Task would loop until some termination criteria is met
- Chain Task would chain tasks (e.g. in the above r = t1.compute(r); r = t2.compute(r); r = t3.compute(r); return r;)
- Split Task would split the data and execute a task on the parts (e.g. create 2 parts and return new data[] )
The threading is implemented in the Split Task at the moment. So the Split Task would hand the computation of t1.compute(part1) and t1.compute(part2) to the thread pool.
Approach 1, probably complete deadlock
My first approach was that the Split Task has an array of futures and calls get() on one after another. But that means if the Split Task is inside another Split Task, that the blocking wait in future.get() will block the thread that the outer Split Task took from the thread pool. So I have less than 8 threads really working. If this hierarchy is deep, I maybe have nobody working and wait forever.
1) I assume future.get() will not return the thread to the thread pool, right? So if done like that I will wait in future.get() but no more threads to ever start the work? [I cannot easy test that because I already changed the approach]
Approach 2, the current one, at least somebody working
My current approach (not much better) is to do the final part (partN) of a split with the current thread. If finished I check if partN-1 was already started, if yes I wait for all tasks in future.get() otherwise current thread does partN-1 too, and if needed partN-2. So now I should always have at least one thread of the pool working.
But since the answer to question 1) probably is that future.get() will block my threads, with this approach I will have only few working threads on deep hierarchies.
Approach 3, the only solution I see
I assume I must use 2 thread pools, one for the hard work and one for all the waiting. So I would have a fixed size thread pool for the hard work and (a dynamic?) one for the waiting.
3.a.: But that means that the Split Task must only spawn threads from the waiting pool and the Task doing real work will spawn a new thread from the work pool and wait for it to complete. Ugly, but should work. Ugly because at the moment the whole threading support is all in the Split Task, but with this solution other Tasks doing the hard work must know about threading.
3.b.: Another approach would be that Split spawns worker threads, but inside split each waiting must be done by a waiting thread while the current thread also does worker thread tasks in the meantime. With this, all threading support is in Split Task class, but I’m not sure how to implement that.
2a) How can I wait for the tasks without blocking the current thread?
2b) Can I return the current thread to the worker thread pool, let a waiter thread wait and then after waiting continue with the previous current thread or a thread from the worker pool? How?
Other solutions
Don’t use a fixed sized thread pool.
3) Is my idea to have 8 threads wrong? But how many then if the hierarchies can be deep? And isn’t there the risk that the JVM starts many tasks in parallel and switches a lot between them?
4) What do I miss or what would you do to solve that problem?
[EDIT]
Accepted Solution and why I try something different (based on approach 2)
I accepted the ForkJoinPool as the correct solution.
However, some of the details and possible overhead and loss of control make me want to try another approach. But the more I think about it, the more I come back to using ForkJoinPool (see Note at the end for the reason). Sorry for the amount of text.
«However, no such adjustments are guaranteed in the face of blocked IO or other unmanaged synchronization.»
«maximum number of running threads to 32767»
«The documentation for the ForkJoin framework suggests creating parallel subtasks until the number of basic computation steps is somewhere over 100 and less than 10,000.»
The ‘hard work’ Tasks read a lot of data from disk and it is very far from 10,000 basic computations. Actually I could fork/join it down to maybe acceptable levels, but this is too much work now because that part of the code is rather complex.
I think approach 3a is basically an implementation of ForkJoin, except that I would have more control and probably less overhead and the problems just mentioned above should not exist (but no automatic adaption to CPU resources provided by OS, but I will force the OS to give me what I want if I have to).
I maybe try to use approach 2 with some changes: that way I can work with an exact thread number and I don’t have any waiting threads, ForkJoinPool seems to work with waiting threads if I understand it correctly.
Current thread does jobs until all jobs in this Split instance are being run by a worker thread (so work stealing in Split node like previously), but then it will not call future.get(), but just check if all futures are ready with future.isDone(). If not all are done, it will steal a job from the thread pool and execute it, then it checks the futures again. That way I will never wait as long as there is a single job that is not running.
The Ugly: if there is no job to steal I would have to sleep for a short time and then check the futures again or steal a new job from the pool (is there a way to wait for multiple Futures to all be complete with a timeout that will not cancel the computations if it triggers?)
So I think I have to use a Completion Service for the ThreadPool in each Split Task, then I can poll using a timeout and do not need to sleep.
Assumption: the ThreadPool in the Completion Service can still be used like a normal ThreadPool (e.g. Job stealing). One ThreadPool can be in many Completion Services.
I think this is the optimal solution for the problem detailed in the question. However, there is a small problem with that, see the following.
Note:
After looking at the ‘hard’ tasks again, I see that they can be parallelized for many of their instantiations. So adding threading there too is the next logical step. These are always leaf nodes and the work they do is optimally done with a completion service (in some cases sub-jobs can have different runtimes but any 2 results can build a new job). To do them with the ForkJoinPool I would have to use managedBlock() and implement ForkJoinPool.ManagedBlocker, which make the code more complex. However, at the same time the usage of CompletionService in these leave nodes means my approach 2 based solution will probably need waiting threads too, so I maybe better go with ForkJoinPool.
How to make Java application use all CPU groups
I am running a java application on Intel(R) Xeon(R) machine which has 72 CPUs (with Hyperthreading on). Since, Microsoft groups CPUs into two groups if there are more than 64 CPUs, the java application uses only 1 group (or in other words 36 CPUs). See the snapshots below for the grouping and CPU usage details. As we can see, the java application is using 36 CPUs at capacity but not able to use other CPUs. I added +UseNUMA in JVM parameters but it did not work. Does anyone know of JVM option to make it use all CPU groups? The windows server machine has 72 CPUs after hyperthreading. Windows, by default, groups the CPUs into two memory nodes if number of CPUs are more than 64 (https://msdn.microsoft.com/en-us/library/dd405503(VS.85).aspx). Java application uses only 1 nodes and thus not using the computer at capacity. I enabled node-interleaving but it didn’t help either. So, my question is if there is a way to span Java application over multiple memory nodes.
Just curious: What does Java show for this code Runtime.getRuntime().availableProcessors(); on the machine?
Not sure if that helps, but maybe you could try other JVMs, like ibm.com/developerworks/java/jdk/java8 (I believe you can download IBM java for free, worst case, you might have to register an IBM id).
Michael, The available number of processors is shown to be 36 when I run Runtime.getRuntime().availableProcessors()
How to tell JRE to use more than one CPU node
I have a .jar file that I want to run on a supercomputer. There are some 40 CPU nodes available but Java uses only one of them when running my program. Is there any way to tell Java to use all the available nodes to run a given program (preferably without recompiling the program)?
Also, your software is multi-threaded, right? Upgrading to Java 8 won’t give you anything if your software runs in one thread anyway. What you would do then is start multiple instances, one per core, and ensure that they are properly orchestrated, if required.
Yes, it is. I can specify the number of threads the program uses (which I set to a pretty large number). But it still uses only one core (JRE’s fault I guess).
1 Answer 1
Java always uses all the available CPUs by default. By default when you start it creates a thread for every logical CPU just for performing the GC.
There are some 40 CPU nodes available but Java uses only one of them when running my program
The problem is that your program only uses one thred because that is how you wrote the program. If you want your program to use more CPUs, you will have to perform more independant tasks in your program so it can utilise more CPUs.
Is there any way to tell Java to use all the available nodes to run a given program
Yes, but you have to do this in your code, otherwise it won’t magically use more CPUs than your program tells it to.
Either the JRE which was released 20 years ago has a bug which stops multi-threading working and no one has fixed it in that time even though about 10 million developers use it, or your program you just wrote which only one person uses has a bug in it. Which one sounds like it is more likely to have a bug?
If you want to see that you can use all the CPUs on your machine.
public class Main < public static void main(String[] args) < IntStream.range(0, 1000).parallel().forEach(i -> < long end = System.currentTimeMillis() + 1000; while (end >System.currentTimeMillis()) < // busy wait >System.out.println("task " + i + " finished"); >); > >
This will use all the logical CPUs on your machine without any special options.
I have seen Java running on servers with 120 CPUs and 3 TB of main memory in production work just fine. You run into some problems when you have a Java program use multiple NUMA regions, but I doubt that is your issue here.