Xeon - not utilising more than 32 threads

Forum to discuss and compare Hardware profiles and Benchmarking
User avatar
scole of TSBT
Boinc Major General
Boinc Major General
Posts: 5980
Joined: Mon Feb 03, 2014 2:38 pm
Location: Goldsboro, (Eastern) North Carolina, USA

#51 Re: Xeon - not utilising more than 32 threads

Post by scole of TSBT »

Bryan wrote: Fri Jun 02, 2017 8:38 pm
scole250 wrote: Fri Jun 02, 2017 8:26 pm I can't recall if I reinstalled win7 on the 72 thread system last time it flaked out but I'll try to give it a whirl this weekend.
That was too easy, I should have asked for more :roll:
And I'll be glad to pass along my findings for just a few measly bitcoins. :lol:
Image
User avatar
Bryan
Boinc Brigadier
Boinc Brigadier
Posts: 2621
Joined: Thu May 21, 2015 6:18 pm

#52 Re: Xeon - not utilising more than 32 threads

Post by Bryan »

scole250 wrote: Fri Jun 02, 2017 8:52 pm
Bryan wrote: Fri Jun 02, 2017 8:38 pm
scole250 wrote: Fri Jun 02, 2017 8:26 pm I can't recall if I reinstalled win7 on the 72 thread system last time it flaked out but I'll try to give it a whirl this weekend.
That was too easy, I should have asked for more :roll:
And I'll be glad to pass along my findings for just a few measly bitcoins. :lol:
:lol: I knew there was a catch! The bitcoins are in the mail watch for them.
Image
noetus
Boinc Corporal
Boinc Corporal
Posts: 50
Joined: Tue May 30, 2017 3:15 am

#53 Re: Xeon - not utilising more than 32 threads

Post by noetus »

A point of clarification: was the earlier claim of a limit of 64 threads (by Bryan) related to Boinc specifically or was it supposed to cover any set of processes one might care to run?

In my own coding I do not code multi-threaded apps. I tried that and it didn't work out well (thread management was an issue and requires more complex coding than I am really capable of right now). Instead for naturally parallelisable tasks I code for single threads and then run multiple instances from the command line, letting the OS manage the multitasking.
User avatar
Bryan
Boinc Brigadier
Boinc Brigadier
Posts: 2621
Joined: Thu May 21, 2015 6:18 pm

#54 Re: Xeon - not utilising more than 32 threads

Post by Bryan »

It is a basic function/limitation of any flavor of the Windows OS to include the server versions. It has nothing to do with BOINC specifically. You will run into the same problem with your program.

We did find the solution using the start /node X /affinity 0xFFFFFFFF [your_program.exe]. From the command window or from a batch/cmd file you can execute the command. When the program is launched you assign it to either node 0 or node 1 and then you set the affinity of which processors in that node the program is allowed to use. The 0xFFFFFFFFF is a bit mask for the allowed threads in the node. What I showed would allow the program to use any of 36 threads. In your case you will need to add 2 more F's.

Check out the syntax/manual for the "start" command. Down in the multiprocessor section you will find the NUMA stuff.

https://ss64.com/nt/start.html

There are also NUMA commands you can use programmatically that will do the same thing and with those you can actualy go down to assigning at the thread level rather than at the program level.
Image
User avatar
Bryan
Boinc Brigadier
Boinc Brigadier
Posts: 2621
Joined: Thu May 21, 2015 6:18 pm

#55 Re: Xeon - not utilising more than 32 threads

Post by Bryan »

To be clear about this, you can run your 88 threads from a program without a problem. However the efficiency will be the same as if you were running 64 threads.

There are 64 NUMA memory pipes available. If you run 64 threads or less then each thread will run at 100% load. If you enable 88 threads then Windows will start "sharing" the memory pipes. So the 24 threads (above the 64) would start sharing the memory pipes with 24 of the 1st 64 and therefore would only be running at 50% efficiency because 1/2 the time they are running and 1/2 the time they are waiting for memory.

You would wind up with 40 threads running at full load and 48 threads running at 50% load ... ie 64 threads. Since the memory channels would have to be continuously loaded/unloaded it would actually be less efficient than just running 64 threads.

So to overcome this limitation you would set your program up to launch 44 threads at Node 0 (CPU0) and then 44 threads to Node 1 (CPU1). Then all 88 threads would be running at 100% loading.

We proved the technique works as I had describe in my post to Pete. We launched one BOINC client to node 0 and a 2nd BOINC client to node 1. On our 72 thread machines all threads were running at 100% load.
Image
User avatar
scole of TSBT
Boinc Major General
Boinc Major General
Posts: 5980
Joined: Mon Feb 03, 2014 2:38 pm
Location: Goldsboro, (Eastern) North Carolina, USA

#56 Re: Xeon - not utilising more than 32 threads

Post by scole of TSBT »

Actually, our observations have been that even if you limit the number of threads to 64 they still won't run as efficient as they could because another thing that happens with NUMA is it was designed to use memory on the bus of the other CPU if needed (or not) and it will do so at a loss of performance. To get the best performance, you must use the start command directives /NODE and /AFFINITY to restrict processes to the same NUMA node wired to the processor.

Seeing is believing and there are two things to get setup so you can observe efficiency and affect it:
1. Install BoincTasks so you can see what kind of CPU utilization WUs are running at. http://efmer.com/b/?q=boinctasks_download If you are running Boinc on more than one system, it's the only way to go to manage things. It will require you to configure each Boinc client to allow remote GUI access from other systems.
2. Setup your systems to run multiple Boinc clients. See info here...https://www.tsbt.co.uk/forum/viewtopic.php?f=172&t=3140 Not sure if you have access to that area yet. If not, let us know an we'll get it moved to an open access area.

A lot of things to do and understand in those 2 items. Going to take a little effort to set it all up. I don't think we can easily give a "go do A, B, C, D" list of instructions. If you have questions though, feel free to ask.
Image
User avatar
Bryan
Boinc Brigadier
Boinc Brigadier
Posts: 2621
Joined: Thu May 21, 2015 6:18 pm

#57 Re: Xeon - not utilising more than 32 threads

Post by Bryan »

There are a couple of caveats to those wanting to run more than 64 threads of BOINC.

1. There are a couple of projects, like Yoyo, that don't allow multiple BOINC clients.

2. If the project uses VBox then it doesn't work because you only have one instance of VBox installed and it therefore falls under the 64 NUMA thread rule. You can assign a boinc client to each node but when it calls VBox then VBox is limited to 64 threads.

On standard BOINC projects it is phenomenal ... beats the heck out of having to turn off HT when you have no choice but to run Windows (like Gerasim).
Image
noetus
Boinc Corporal
Boinc Corporal
Posts: 50
Joined: Tue May 30, 2017 3:15 am

#58 Re: Xeon - not utilising more than 32 threads

Post by noetus »

Let me see if I'm reading you right. So you're saying if I take a 44 core / 88 thread machine, and limit it to 32 cores / 64 threads in BIOS, then run Cinebench on Windows, then enable all cores in BIOS, and run Cinebench again, I won't see any significant difference in the benchmark scores?
User avatar
scole of TSBT
Boinc Major General
Boinc Major General
Posts: 5980
Joined: Mon Feb 03, 2014 2:38 pm
Location: Goldsboro, (Eastern) North Carolina, USA

#59 Re: Xeon - not utilising more than 32 threads

Post by scole of TSBT »

noetus wrote: Thu Jun 08, 2017 12:57 pm Let me see if I'm reading you right. So you're saying if I take a 44 core / 88 thread machine, and limit it to 32 cores / 64 threads in BIOS, then run Cinebench on Windows, then enable all cores in BIOS, and run Cinebench again, I won't see any significant difference in the benchmark scores?
I think you'll see a higher benchmark with 88 threads vs. 64 but you won't get anywhere near 100% utilization on all 88 threads at the same time. In order to get the most utilization of all 88 threads under Windows, you'll need to make sure 44 processes run on threads 0-43 on NUMA node 0 and the other 44 processes run on threads 44-87 on NUMA node 1. That doesn't occur automatically. To get Boinc to run under those constraints, you must setup 2 boinc clients and run each using the start command with the correct NODE and AFFINITY options.
Image
User avatar
Bryan
Boinc Brigadier
Boinc Brigadier
Posts: 2621
Joined: Thu May 21, 2015 6:18 pm

#60 Re: Xeon - not utilising more than 32 threads

Post by Bryan »

noetus wrote: Thu Jun 08, 2017 12:57 pm Let me see if I'm reading you right. So you're saying if I take a 44 core / 88 thread machine, and limit it to 32 cores / 64 threads in BIOS, then run Cinebench on Windows, then enable all cores in BIOS, and run Cinebench again, I won't see any significant difference in the benchmark scores?
No, I think Cinebench will use all 88 threads to their fullest capability. That is a professioanlly written benchmarking suite. That suite is used by every Tom, Dick, and Harry who tests computer systems. They would be fools to not take NUMA into account. If I were writing code to run benchmarks one of the 1st things I would do is check the system topology to see how many CPUs, cores/threads, and NUMA nodes that were available. I have absolutely no doubt that they do this and then manage the threads of the benchmark accordingly.

Don't forget the quote from Microsoft concerning the issue;
The reason for initially limiting all threads to a single group is that 64 processors is more than adequate for the typical application. An application that requires the use of multiple groups so that it can run on more than 64 processors must intentionally determine where to run its threads. The application is responsible for setting thread affinities to the desired groups.
I really think that Microsoft knows their product and how it operates :lol: Then again something may have changed in Win10 Enterprise but I haven't found any documentation along those lines. It also isn't mentioned in the new server editions including the "data center" versions. If it were to be changed that would be the logical place for it to occur.
Image
noetus
Boinc Corporal
Boinc Corporal
Posts: 50
Joined: Tue May 30, 2017 3:15 am

#61 Re: Xeon - not utilising more than 32 threads

Post by noetus »

OK, this is along the lines of what I assumed. Regarding single-threaded command line applications; multiple instances of these will also be assigned NUMA nodes according to best practices for efficiency I assume.
User avatar
Bryan
Boinc Brigadier
Boinc Brigadier
Posts: 2621
Joined: Thu May 21, 2015 6:18 pm

#62 Re: Xeon - not utilising more than 32 threads

Post by Bryan »

Correct. You would want to keep track of how many you have running and where. You also want to "balance" the 2 nodes so they are both using approximately the same number of threads.
Image
User avatar
scole of TSBT
Boinc Major General
Boinc Major General
Posts: 5980
Joined: Mon Feb 03, 2014 2:38 pm
Location: Goldsboro, (Eastern) North Carolina, USA

#63 Re: Xeon - not utilising more than 32 threads

Post by scole of TSBT »

I've been running my 72 thread system under Win 7 again this afternoon and can confirm again that if you run a single Boinc client and allow the OS to manage the nodes you will see CPU utilization vary from 50-100%. Most of the time I saw utilization between 90-96% and it rarely but occasionally peaked at 100% but just as often fell below 75%. It ran this way for a couple hours. Then I ran two clients specifying the NODE and AFFINITY for each client and CPU utilization stayed pegged at 100%.
Image
noetus
Boinc Corporal
Boinc Corporal
Posts: 50
Joined: Tue May 30, 2017 3:15 am

#64 Re: Xeon - not utilising more than 32 threads

Post by noetus »

OK, so I now have the 44 core / 88 thread machine up and running, and it is full of lovely scrumptious goodness. I was easily able to run at full 100% utilisation on each core (for 4 hours continuously) with my own crunching project, starting 88 separate threads from the command line. It also seems appreciably quicker per core than my two 28 core E5-2658 v4 powered machines, unbelievably quicker, like 40% quicker (per core), I have to look into that.

I have noticed that the generic BOINC client downloaded from the Berkeley site seems to generate fewer points when crunching for WCG than the WCG client downloaded from www.worldcommunitygrid.org. Which in itself is strange, have to look into that too. I downloaded the WCG-only client and tried running it (single instance) on the 44 core machine. Interestingly, even though it is 32 bit, it is running 64 processes concurrently (not 32 as before) and using 92% of CPU availability as reported by Task Manager. Have they updated this client? I don't think so. So what is going on?

I imagine that by running an additional client (as per the instructions mentioned above) I will get full 100% CPU utilisation, even when using the 32 bit client. Interesting!
noetus
Boinc Corporal
Boinc Corporal
Posts: 50
Joined: Tue May 30, 2017 3:15 am

#65 Re: Xeon - not utilising more than 32 threads

Post by noetus »

Guys... I have been absent for a while, crunching my own projects. Is there still interest in this topic?

I have some weird results. Recall I have an 44 core machine running Windows 10. Each machine has two sockets, each with 22 core Xeon, for a total of 88 threads or virtual cores with HT on.

For my own crunching, I have been running 88 separate threads, each thread a separate console process. Each process is single-threaded; hence I am relying on the OS to handle the parallelization of the data processing and scheduling of multiple threads when I launch multiple copies of the same process from the command line (each one handling a different chunk of data to be crunched via command line arguments).

The Windows scheduler does some funny stuff. It seems to cycle from employing half the virtual cores at 100%, half the vCores at 0%, to all vCores at 100%, to the other half at 100% and the first half that was at 0% before to 100% now, then repeating. Suppose we divide the 88 threads into two groups, A and B. So the Windows scheduler at first assigns equal processing to 88 threads divided between group A and group B. Then it slowly moves to giving all priority to group A and none to group B (you can see this in task manager - half the vCores are at 0%, half at 100%, divided neatly in the middle of the core activity display, and also half the threads are active, half more or less inactive), then back to half to group A and half to group B, then transitioning to giving no priority to group A and everything to group B (and here you see the core activity display in task manager inverted). The CPU frequency is throttled accordingly, too; when half the vCores are active, overall turboboost is 3.7 GHz, when all cores are active, overall turboboost drops to only to 2.7 or 2.8 GHz.

Does this make any sense to anyone? The OS has Group 0 and Group 1 NUMA nodes. It doesn't seem to make a difference if I explicitly start half the processes on Group 0 and half on Group 1 (using Start / Node 0 and Start / Node 1). And you wouldn't expect it to, either - clearly the OS is capable of assigning all the processes across both Nodes evenly even without this explicit Node assignment, as evidenced that all vCores are at 100% some of the time, when turboboost is at 2.7/2.8.

It's frustrating because clearly I am not getting the full potential out of this expensive machine. I also have two 28 core machines. These run the exact same processes, 56 at a time with HT on and with 100% utilisation all the time, for days or weeks at a time, no issues whatsoever, no core dropouts below 100%, nothing. Per core processing on the 44 core machines is up to 50 % slower as evidenced by some quick and dirty benchmarking, and this despite the fact that the 44 core machine has faster cores - turboboosting all cores simultaneously to 2.8 GHZ max, while the 28 core machines turboboost all cores to 2.5 GHz.

If anyone can help me figure out what is going on I would be most grateful. I am going to try fiddling with settings in the BIOS. One thing to try will be to turn HT off. But I know that if HT is on and all cores are utilised at 100%, overall per-thread processing is 10-15% faster, so turning HT off is not really the ideal solution even if it improves things a bit.

One final detail that might offer a clue as to what is going on is that this cycling from 100% to 50% and back again correlates with the way the processing of the data chunks is lined up in batch files. Each process is launched from the command line as I mentioned. I have a script that automatically writes batch files, one batch file per virtual core. So there are 88 batch files, and each one contains a series of command line launch commands to be executed one after the other. The idea is that each core will process these one at a time. The cycling from initial 100% vCore usage to 50% correlates with the time all the cores take to process the first command line in each of their batch files. It then cycles back from 50% to 100% with the processing of the second line in the batch file (but with the 50% usage corresponding to the other half of the vCores). And so on. So somehow the scheduling of the command line processes is interacting with the way Windows 10 schedules the processes across all the vCores and Groups to produce this weird cycling behaviour and under-utilisation of the overall processing power available. So perhaps there is a way of scheduling the process tasks from the command line that will not produce this effect. But I can't really think beyond that.
User avatar
scole of TSBT
Boinc Major General
Boinc Major General
Posts: 5980
Joined: Mon Feb 03, 2014 2:38 pm
Location: Goldsboro, (Eastern) North Carolina, USA

#66 Re: Xeon - not utilising more than 32 threads

Post by scole of TSBT »

What is the command line you're using? When you use the start command, are you setting just the node or node and affinity?
Image
noetus
Boinc Corporal
Boinc Corporal
Posts: 50
Joined: Tue May 30, 2017 3:15 am

#67 Re: Xeon - not utilising more than 32 threads

Post by noetus »

START /NODE 0 command.bat

One of these for each vCore, splitting equally between NODE 0 and NODE 1. The file command.bat then contains a list of commands to run serially. I have been assuming that the commands in the batch file will inherit the NODE setting from the calling START command, but now that I think about it, perhaps that doesn't make sense.

[Edit]. I am adding screenshots of the sequence so you can see what happens. It starts out at 100%on all vCores, then gradually there is a reduction in the CPU usage for one half of all the cores, as you can see in these Task Manager grabs. After some minutes there is only CPU activity for half the vCores, then the other half gradually start up until they are all at 100%. Then the opposite half starts gradually reducing, until there are only half the number (the other half) with activity. And so on.
You do not have the required permissions to view the files attached to this post.
noetus
Boinc Corporal
Boinc Corporal
Posts: 50
Joined: Tue May 30, 2017 3:15 am

#68 Re: Xeon - not utilising more than 32 threads

Post by noetus »

So I did some quick and dirty benchmarking. I ran 440 identical processes on a single set of data overall, which each process analysing 1/440 of the data. With hyperthreading on, I had 88 processes running simultaneously, with 5 repetitions of each process (on a different chunk of data each). The total time to complete all the processes, across all cores, was 7704.5 mins or 128.4 minutes or 2 and 1/8 hours.

Then I turned hyperthreading off, and ran the test again. This time I had 44 process running simultaneously, half the number, with double the number of repetitions of each process (on a different chunk of data each time) per core, i.e. 10. The total time to complete all the processes, across all cores, was 4605.5 secs or 76.76 minutes or an hour and a quarter.

That's an astounding difference. Clearly all the core slowdowns and massive scheduling and rescheduling Windows does to all the processes with hyperthreading on, really kills it (despite the increased turboboost when fewer cores are utilised at the same time, as can be seen in the above grabs of Task Manager). Or is there another explanation? I'd like to run the benchmark again with hyperthreading on but with full 100% utilisation of the CPUs all the way through - but I don't know how to get it to do that on this machine with >64 cores.

For now I'll simply turn hyperthreading off. Once my two 28 core machines have finished their current task (another week) I will run the test on them, too, to see how much of a difference hyperthreading makes on those machines for a bunch of OS-managed parallelization of single-thread processes.
User avatar
scole of TSBT
Boinc Major General
Boinc Major General
Posts: 5980
Joined: Mon Feb 03, 2014 2:38 pm
Location: Goldsboro, (Eastern) North Carolina, USA

#69 Re: Xeon - not utilising more than 32 threads

Post by scole of TSBT »

On your 88 thread system use these commands. Start half the processes with one command line and the use the other command line for the other half.

START /NODE 0 /AFFINITY 0xFFFFFFFFFFF command.bat

START /NODE 1 /AFFINITY 0xFFFFFFFFFFF command.bat
Image
noetus
Boinc Corporal
Boinc Corporal
Posts: 50
Joined: Tue May 30, 2017 3:15 am

#70 Re: Xeon - not utilising more than 32 threads

Post by noetus »

This unfortunately made no difference. You still see a cycle of both Numa nodes utilised at 100%, then gradually one of them goes to zero, then it increases back to 100%, then the other decreases to zero, then it goes back to 100%, and so on. I confirmed it this time with the Numa Node view in Task Manager.
User avatar
scole of TSBT
Boinc Major General
Boinc Major General
Posts: 5980
Joined: Mon Feb 03, 2014 2:38 pm
Location: Goldsboro, (Eastern) North Carolina, USA

#71 Re: Xeon - not utilising more than 32 threads

Post by scole of TSBT »

I would guess it has something to do with your specific application and a bottleneck with the other system resources such as disk, network or memory. What do your memory, disk and network resource stats look like with no instances of your program running and what do they look like with 1, then 2, then 3 and so on running?
Image
noetus
Boinc Corporal
Boinc Corporal
Posts: 50
Joined: Tue May 30, 2017 3:15 am

#72 Re: Xeon - not utilising more than 32 threads

Post by noetus »

The processes are compute-bound. There is very little system resources used while they are active. Disk I/O, Network I/O, and memory use are all low (the system has 32 GB of RAM and each process uses about 100 MB, about 9-10GB of RAM total, divided between the 88 processes). The processes run great on my other two systems, of 28 cores each, with either hyperthreading turned on or off. It must have something to do with the way scheduling is interacting with the NUMA nodes for >64 processes, don't you think?
User avatar
Bryan
Boinc Brigadier
Boinc Brigadier
Posts: 2621
Joined: Thu May 21, 2015 6:18 pm

#73 Re: Xeon - not utilising more than 32 threads

Post by Bryan »

I have .bat files setup to launch instances of BOINC. I'm trying to blow off the cobwebs, but I ran into a problem with making the START command inclusive to the .bat file. What works on my systems without fail is to create the .bat file that sets up the program/thread and then call a .bat file that contains the START command and points to the other .bat file. What I'm suggesting is you create the file you want to execute (filename.bat) and then call it from another command file that has the START/node/affinity filename.bat.

Install ProcessLasso onto your Windows system. From there you can see where each of the threads is allowed to run. It will show the affinity. If you've set them up as 1/2 on each CPU it should show the affinity on 1/2 of them as 0-23 and the other 1/2 should show 24-43. If it doesn't then the NUMA node is not being setup correctly.
Image
User avatar
scole of TSBT
Boinc Major General
Boinc Major General
Posts: 5980
Joined: Mon Feb 03, 2014 2:38 pm
Location: Goldsboro, (Eastern) North Carolina, USA

#74 Re: Xeon - not utilising more than 32 threads

Post by scole of TSBT »

Exactly what model CPUs are these? are they retail or ES?

And running some WCG for the next 24 hours might help us diagnose the issue :-)
Image
noetus
Boinc Corporal
Boinc Corporal
Posts: 50
Joined: Tue May 30, 2017 3:15 am

#75 Re: Xeon - not utilising more than 32 threads

Post by noetus »

Thanks for all these comments. The affinity is correct as you described, I had already checked that. As I mentioned before, the core dropout on one NUMA node corresponds to when one batch file is finishing up and the next one is taking over from it. The explanation has to have something to do with that. However I've now checked on my 28 core systems (an identical hardware setup apart from the fewer cores) that are working perfectly whether hyperthreading makes any difference. For these single-threaded command line processes that run in parallel in the OS, there is no measurable difference between hyperthreading on or off. I can run twice the number of process with hyperthreading on, and have a batch list that is half as long, or have fewer processes with the hyperthreading off and a longer batch list. The total time to completion is within statistical error of being identical. Therefore the simplest solution for me now is to turn hyperthreading off on the 44 core system, which brings the thread count below 64, and alll is well (I've already tested it).

Perhaps I will come back to this later to try to get to the root of the problem!

[EDIT} The CPU model nos are E5-2696 @ 2.8 GHz when all cores are at 100%. They are retail chips.
User avatar
Dirk Broer
Corsair
Corsair
Posts: 1962
Joined: Thu Feb 20, 2014 11:24 pm
Location: Leiden, South Holland, Netherlands
Contact:

#76 Re: Xeon - not utilising more than 32 threads

Post by Dirk Broer »

With the new Threadripper coming out with 32 cores/64 threads and future EPYCs rumoured to get yet more cores (64c/128t), this thread about threads is getting more and more interesting...A dual EPYC2 mobo might run into yet another limit with its 256 threads, I fear.
Image
User avatar
Dirk Broer
Corsair
Corsair
Posts: 1962
Joined: Thu Feb 20, 2014 11:24 pm
Location: Leiden, South Holland, Netherlands
Contact:

#77 Re: Xeon - not utilising more than 32 threads

Post by Dirk Broer »

There is computing outside of x86, of course. IBM's Power9 consists out of either 12 cores with 8-way SMT or 24 cores with 4-way SMT and their systems can house sixteen (16!) sockets.
That is 16x 12x 8, or 16x 24x 4, both ending up with 1536 threads per system, max. These systems are not meant for Joe Sixpack though (but who is stopping sysadmin Sixpack from 'burning in' the new company server with a 7-day long BOINC run? :wink: ), and they won't be running Windows either.
Image
Post Reply Previous topicNext topic

Return to “Benchmarking and Hardware”