Xeon - not utilising more than 32 threads

Forum to discuss and compare Hardware profiles and Benchmarking
noetus
Boinc Corporal
Boinc Corporal
Posts: 50
Joined: Tue May 30, 2017 3:15 am

#61 Re: Xeon - not utilising more than 32 threads

Post by noetus »

OK, this is along the lines of what I assumed. Regarding single-threaded command line applications; multiple instances of these will also be assigned NUMA nodes according to best practices for efficiency I assume.
User avatar
Bryan
Boinc Brigadier
Boinc Brigadier
Posts: 2621
Joined: Thu May 21, 2015 6:18 pm

#62 Re: Xeon - not utilising more than 32 threads

Post by Bryan »

Correct. You would want to keep track of how many you have running and where. You also want to "balance" the 2 nodes so they are both using approximately the same number of threads.
Image
User avatar
scole of TSBT
Boinc Major General
Boinc Major General
Posts: 5980
Joined: Mon Feb 03, 2014 2:38 pm
Location: Goldsboro, (Eastern) North Carolina, USA

#63 Re: Xeon - not utilising more than 32 threads

Post by scole of TSBT »

I've been running my 72 thread system under Win 7 again this afternoon and can confirm again that if you run a single Boinc client and allow the OS to manage the nodes you will see CPU utilization vary from 50-100%. Most of the time I saw utilization between 90-96% and it rarely but occasionally peaked at 100% but just as often fell below 75%. It ran this way for a couple hours. Then I ran two clients specifying the NODE and AFFINITY for each client and CPU utilization stayed pegged at 100%.
Image
noetus
Boinc Corporal
Boinc Corporal
Posts: 50
Joined: Tue May 30, 2017 3:15 am

#64 Re: Xeon - not utilising more than 32 threads

Post by noetus »

OK, so I now have the 44 core / 88 thread machine up and running, and it is full of lovely scrumptious goodness. I was easily able to run at full 100% utilisation on each core (for 4 hours continuously) with my own crunching project, starting 88 separate threads from the command line. It also seems appreciably quicker per core than my two 28 core E5-2658 v4 powered machines, unbelievably quicker, like 40% quicker (per core), I have to look into that.

I have noticed that the generic BOINC client downloaded from the Berkeley site seems to generate fewer points when crunching for WCG than the WCG client downloaded from www.worldcommunitygrid.org. Which in itself is strange, have to look into that too. I downloaded the WCG-only client and tried running it (single instance) on the 44 core machine. Interestingly, even though it is 32 bit, it is running 64 processes concurrently (not 32 as before) and using 92% of CPU availability as reported by Task Manager. Have they updated this client? I don't think so. So what is going on?

I imagine that by running an additional client (as per the instructions mentioned above) I will get full 100% CPU utilisation, even when using the 32 bit client. Interesting!
noetus
Boinc Corporal
Boinc Corporal
Posts: 50
Joined: Tue May 30, 2017 3:15 am

#65 Re: Xeon - not utilising more than 32 threads

Post by noetus »

Guys... I have been absent for a while, crunching my own projects. Is there still interest in this topic?

I have some weird results. Recall I have an 44 core machine running Windows 10. Each machine has two sockets, each with 22 core Xeon, for a total of 88 threads or virtual cores with HT on.

For my own crunching, I have been running 88 separate threads, each thread a separate console process. Each process is single-threaded; hence I am relying on the OS to handle the parallelization of the data processing and scheduling of multiple threads when I launch multiple copies of the same process from the command line (each one handling a different chunk of data to be crunched via command line arguments).

The Windows scheduler does some funny stuff. It seems to cycle from employing half the virtual cores at 100%, half the vCores at 0%, to all vCores at 100%, to the other half at 100% and the first half that was at 0% before to 100% now, then repeating. Suppose we divide the 88 threads into two groups, A and B. So the Windows scheduler at first assigns equal processing to 88 threads divided between group A and group B. Then it slowly moves to giving all priority to group A and none to group B (you can see this in task manager - half the vCores are at 0%, half at 100%, divided neatly in the middle of the core activity display, and also half the threads are active, half more or less inactive), then back to half to group A and half to group B, then transitioning to giving no priority to group A and everything to group B (and here you see the core activity display in task manager inverted). The CPU frequency is throttled accordingly, too; when half the vCores are active, overall turboboost is 3.7 GHz, when all cores are active, overall turboboost drops to only to 2.7 or 2.8 GHz.

Does this make any sense to anyone? The OS has Group 0 and Group 1 NUMA nodes. It doesn't seem to make a difference if I explicitly start half the processes on Group 0 and half on Group 1 (using Start / Node 0 and Start / Node 1). And you wouldn't expect it to, either - clearly the OS is capable of assigning all the processes across both Nodes evenly even without this explicit Node assignment, as evidenced that all vCores are at 100% some of the time, when turboboost is at 2.7/2.8.

It's frustrating because clearly I am not getting the full potential out of this expensive machine. I also have two 28 core machines. These run the exact same processes, 56 at a time with HT on and with 100% utilisation all the time, for days or weeks at a time, no issues whatsoever, no core dropouts below 100%, nothing. Per core processing on the 44 core machines is up to 50 % slower as evidenced by some quick and dirty benchmarking, and this despite the fact that the 44 core machine has faster cores - turboboosting all cores simultaneously to 2.8 GHZ max, while the 28 core machines turboboost all cores to 2.5 GHz.

If anyone can help me figure out what is going on I would be most grateful. I am going to try fiddling with settings in the BIOS. One thing to try will be to turn HT off. But I know that if HT is on and all cores are utilised at 100%, overall per-thread processing is 10-15% faster, so turning HT off is not really the ideal solution even if it improves things a bit.

One final detail that might offer a clue as to what is going on is that this cycling from 100% to 50% and back again correlates with the way the processing of the data chunks is lined up in batch files. Each process is launched from the command line as I mentioned. I have a script that automatically writes batch files, one batch file per virtual core. So there are 88 batch files, and each one contains a series of command line launch commands to be executed one after the other. The idea is that each core will process these one at a time. The cycling from initial 100% vCore usage to 50% correlates with the time all the cores take to process the first command line in each of their batch files. It then cycles back from 50% to 100% with the processing of the second line in the batch file (but with the 50% usage corresponding to the other half of the vCores). And so on. So somehow the scheduling of the command line processes is interacting with the way Windows 10 schedules the processes across all the vCores and Groups to produce this weird cycling behaviour and under-utilisation of the overall processing power available. So perhaps there is a way of scheduling the process tasks from the command line that will not produce this effect. But I can't really think beyond that.
User avatar
scole of TSBT
Boinc Major General
Boinc Major General
Posts: 5980
Joined: Mon Feb 03, 2014 2:38 pm
Location: Goldsboro, (Eastern) North Carolina, USA

#66 Re: Xeon - not utilising more than 32 threads

Post by scole of TSBT »

What is the command line you're using? When you use the start command, are you setting just the node or node and affinity?
Image
noetus
Boinc Corporal
Boinc Corporal
Posts: 50
Joined: Tue May 30, 2017 3:15 am

#67 Re: Xeon - not utilising more than 32 threads

Post by noetus »

START /NODE 0 command.bat

One of these for each vCore, splitting equally between NODE 0 and NODE 1. The file command.bat then contains a list of commands to run serially. I have been assuming that the commands in the batch file will inherit the NODE setting from the calling START command, but now that I think about it, perhaps that doesn't make sense.

[Edit]. I am adding screenshots of the sequence so you can see what happens. It starts out at 100%on all vCores, then gradually there is a reduction in the CPU usage for one half of all the cores, as you can see in these Task Manager grabs. After some minutes there is only CPU activity for half the vCores, then the other half gradually start up until they are all at 100%. Then the opposite half starts gradually reducing, until there are only half the number (the other half) with activity. And so on.
You do not have the required permissions to view the files attached to this post.
noetus
Boinc Corporal
Boinc Corporal
Posts: 50
Joined: Tue May 30, 2017 3:15 am

#68 Re: Xeon - not utilising more than 32 threads

Post by noetus »

So I did some quick and dirty benchmarking. I ran 440 identical processes on a single set of data overall, which each process analysing 1/440 of the data. With hyperthreading on, I had 88 processes running simultaneously, with 5 repetitions of each process (on a different chunk of data each). The total time to complete all the processes, across all cores, was 7704.5 mins or 128.4 minutes or 2 and 1/8 hours.

Then I turned hyperthreading off, and ran the test again. This time I had 44 process running simultaneously, half the number, with double the number of repetitions of each process (on a different chunk of data each time) per core, i.e. 10. The total time to complete all the processes, across all cores, was 4605.5 secs or 76.76 minutes or an hour and a quarter.

That's an astounding difference. Clearly all the core slowdowns and massive scheduling and rescheduling Windows does to all the processes with hyperthreading on, really kills it (despite the increased turboboost when fewer cores are utilised at the same time, as can be seen in the above grabs of Task Manager). Or is there another explanation? I'd like to run the benchmark again with hyperthreading on but with full 100% utilisation of the CPUs all the way through - but I don't know how to get it to do that on this machine with >64 cores.

For now I'll simply turn hyperthreading off. Once my two 28 core machines have finished their current task (another week) I will run the test on them, too, to see how much of a difference hyperthreading makes on those machines for a bunch of OS-managed parallelization of single-thread processes.
User avatar
scole of TSBT
Boinc Major General
Boinc Major General
Posts: 5980
Joined: Mon Feb 03, 2014 2:38 pm
Location: Goldsboro, (Eastern) North Carolina, USA

#69 Re: Xeon - not utilising more than 32 threads

Post by scole of TSBT »

On your 88 thread system use these commands. Start half the processes with one command line and the use the other command line for the other half.

START /NODE 0 /AFFINITY 0xFFFFFFFFFFF command.bat

START /NODE 1 /AFFINITY 0xFFFFFFFFFFF command.bat
Image
noetus
Boinc Corporal
Boinc Corporal
Posts: 50
Joined: Tue May 30, 2017 3:15 am

#70 Re: Xeon - not utilising more than 32 threads

Post by noetus »

This unfortunately made no difference. You still see a cycle of both Numa nodes utilised at 100%, then gradually one of them goes to zero, then it increases back to 100%, then the other decreases to zero, then it goes back to 100%, and so on. I confirmed it this time with the Numa Node view in Task Manager.
User avatar
scole of TSBT
Boinc Major General
Boinc Major General
Posts: 5980
Joined: Mon Feb 03, 2014 2:38 pm
Location: Goldsboro, (Eastern) North Carolina, USA

#71 Re: Xeon - not utilising more than 32 threads

Post by scole of TSBT »

I would guess it has something to do with your specific application and a bottleneck with the other system resources such as disk, network or memory. What do your memory, disk and network resource stats look like with no instances of your program running and what do they look like with 1, then 2, then 3 and so on running?
Image
noetus
Boinc Corporal
Boinc Corporal
Posts: 50
Joined: Tue May 30, 2017 3:15 am

#72 Re: Xeon - not utilising more than 32 threads

Post by noetus »

The processes are compute-bound. There is very little system resources used while they are active. Disk I/O, Network I/O, and memory use are all low (the system has 32 GB of RAM and each process uses about 100 MB, about 9-10GB of RAM total, divided between the 88 processes). The processes run great on my other two systems, of 28 cores each, with either hyperthreading turned on or off. It must have something to do with the way scheduling is interacting with the NUMA nodes for >64 processes, don't you think?
User avatar
Bryan
Boinc Brigadier
Boinc Brigadier
Posts: 2621
Joined: Thu May 21, 2015 6:18 pm

#73 Re: Xeon - not utilising more than 32 threads

Post by Bryan »

I have .bat files setup to launch instances of BOINC. I'm trying to blow off the cobwebs, but I ran into a problem with making the START command inclusive to the .bat file. What works on my systems without fail is to create the .bat file that sets up the program/thread and then call a .bat file that contains the START command and points to the other .bat file. What I'm suggesting is you create the file you want to execute (filename.bat) and then call it from another command file that has the START/node/affinity filename.bat.

Install ProcessLasso onto your Windows system. From there you can see where each of the threads is allowed to run. It will show the affinity. If you've set them up as 1/2 on each CPU it should show the affinity on 1/2 of them as 0-23 and the other 1/2 should show 24-43. If it doesn't then the NUMA node is not being setup correctly.
Image
User avatar
scole of TSBT
Boinc Major General
Boinc Major General
Posts: 5980
Joined: Mon Feb 03, 2014 2:38 pm
Location: Goldsboro, (Eastern) North Carolina, USA

#74 Re: Xeon - not utilising more than 32 threads

Post by scole of TSBT »

Exactly what model CPUs are these? are they retail or ES?

And running some WCG for the next 24 hours might help us diagnose the issue :-)
Image
noetus
Boinc Corporal
Boinc Corporal
Posts: 50
Joined: Tue May 30, 2017 3:15 am

#75 Re: Xeon - not utilising more than 32 threads

Post by noetus »

Thanks for all these comments. The affinity is correct as you described, I had already checked that. As I mentioned before, the core dropout on one NUMA node corresponds to when one batch file is finishing up and the next one is taking over from it. The explanation has to have something to do with that. However I've now checked on my 28 core systems (an identical hardware setup apart from the fewer cores) that are working perfectly whether hyperthreading makes any difference. For these single-threaded command line processes that run in parallel in the OS, there is no measurable difference between hyperthreading on or off. I can run twice the number of process with hyperthreading on, and have a batch list that is half as long, or have fewer processes with the hyperthreading off and a longer batch list. The total time to completion is within statistical error of being identical. Therefore the simplest solution for me now is to turn hyperthreading off on the 44 core system, which brings the thread count below 64, and alll is well (I've already tested it).

Perhaps I will come back to this later to try to get to the root of the problem!

[EDIT} The CPU model nos are E5-2696 @ 2.8 GHz when all cores are at 100%. They are retail chips.
User avatar
Dirk Broer
Corsair
Corsair
Posts: 1962
Joined: Thu Feb 20, 2014 11:24 pm
Location: Leiden, South Holland, Netherlands
Contact:

#76 Re: Xeon - not utilising more than 32 threads

Post by Dirk Broer »

With the new Threadripper coming out with 32 cores/64 threads and future EPYCs rumoured to get yet more cores (64c/128t), this thread about threads is getting more and more interesting...A dual EPYC2 mobo might run into yet another limit with its 256 threads, I fear.
Image
User avatar
Dirk Broer
Corsair
Corsair
Posts: 1962
Joined: Thu Feb 20, 2014 11:24 pm
Location: Leiden, South Holland, Netherlands
Contact:

#77 Re: Xeon - not utilising more than 32 threads

Post by Dirk Broer »

There is computing outside of x86, of course. IBM's Power9 consists out of either 12 cores with 8-way SMT or 24 cores with 4-way SMT and their systems can house sixteen (16!) sockets.
That is 16x 12x 8, or 16x 24x 4, both ending up with 1536 threads per system, max. These systems are not meant for Joe Sixpack though (but who is stopping sysadmin Sixpack from 'burning in' the new company server with a 7-day long BOINC run? :wink: ), and they won't be running Windows either.
Image
Post Reply Previous topicNext topic

Return to “Benchmarking and Hardware”