Xeon - not utilising more than 32 threads

Forum to discuss and compare Hardware profiles and Benchmarking
User avatar
Bryan
Boinc Brigadier
Boinc Brigadier
Posts: 2621
Joined: Thu May 21, 2015 6:18 pm

#31 Re: Xeon - not utilising more than 32 threads

Post by Bryan »

The use cpu time is intended to be a throttle used mostly for temperature control. I've never had it below 100% myself.

I'm not in a position right now to move a machine over to WCG and check those particular WU. Depending on the particular project/executable you will see some variation in CPU utilization. Typically they will range from 93-100% loading however. The worst are some of the multi-threaded apps. On those it is very common for it to use X number of threads and at times a number of those threads will be sitting idle.

It is not uncommon for a WU to start slowly on utilization because it is doing some initialization stuff etc. As it progresses the usage will increase. Most of the projects will hammer the CPU from the beginning.

I did just put a 12 thread machine on WCG SCC. 8 threads on WCG and 4 threads reserved for feeding a GPU. When the 8 WU 1st started they were showing 82% utilization. Now at the 10 minute mark they are up to 90-92% so the behavior you are seeing appears to be normal.
Image
User avatar
scole of TSBT
Boinc Major General
Boinc Major General
Posts: 5983
Joined: Mon Feb 03, 2014 2:38 pm
Location: Goldsboro, (Eastern) North Carolina, USA

#32 Re: Xeon - not utilising more than 32 threads

Post by scole of TSBT »

Bryan had a good idea to go ahead and install BoincTasks. It allows you to manage Boinc running on many different systems and provides extra info not available in the Boinc Manager program such as displaying the CPU % a task is using. Need to verify you have 56 WUs running and what the CPU % is on each of the running WUs.
http://efmer.com/b/?q=boinctasks_download
Image
User avatar
Bryan
Boinc Brigadier
Boinc Brigadier
Posts: 2621
Joined: Thu May 21, 2015 6:18 pm

#33 Re: Xeon - not utilising more than 32 threads

Post by Bryan »

noetus, I'll give you a little more information about BOINC. BOINC was designed with the intention that everyday users would donate "spare" CPU cycles to science. There are a lot of people who do this and they typically fall into the set-and-forget class. These are folks who hear about Seti (the biggest) or maybe like yourself WCG. They attach to the project, use their spare CPU cycles, and never pay any attention to the project or how many credits they earn.

The 2nd class are the folks that crunch a project because they have a strong desire to help the particular science being done. It might be a cure for a disease or finding a new prime number or ..... Some of these people will be like the set-and-forget guys and others will fully dedicate machines to running a project.

Then there is the hobbyist crowd. These are the folks who may have multiple machines that are dedicated to only crunching BOINC projects. They violate the original intent/concept of BOINC to use only "spare" cpu cycles. These people build machines and then use them for nothing except running BOINC. Many of these people are credit chasers, badge whores, or they even participate in team challenges ... ie Scole would be an example of that crowd. :lol: There might be a few more of those types running around - I certainly wouldn't know. :roll:

The point is, regardless of how you wish to proceed you can have fun with it. Having fun is what it is all about.

Some of the projects issue badges. For example, when you get 14 days worth of validated WU on WCG SCC they will award you with a badge. When you hit 45 days the badge will change color ... etc etc etc. If you run another of the WCG sub-projects then at the 14 day mark you get another (different) badge.

If you think you may do something besides being a set-and-forgetter then you should attach each machine to a project called WuProp which stands for WU Properties. This project is NCI (non computer intensive) so it runs along side whatever you are crunching. It collects data about the projects ie amount of memory required etc etc. Additionally when you run a sub-project for 100 hours you will be awarded a "little" star. If you run another sub-project for 100 hours you will get a 2nd "little" star. When you have 20 of those little stars then it will give you a big star added to a ring. The more hours you have the different color of your little star etc etc.

Attach each machine to: http://wuprop.boinc-af.org/
Image
User avatar
Alez
[ TSBT's Pirate ]
[ TSBT's Pirate ]
Posts: 10363
Joined: Thu Oct 04, 2012 1:22 pm
Location: roaming the planet

#34 Re: Xeon - not utilising more than 32 threads

Post by Alez »

I think you've now got most of it sorted, but to run things smoothly, never have use at most % of cpu time at less than 100%. Set anything less and it simply ramps the cpus up and down, either on or off. Why it's still in manager I don't know as it's and extremely poor way to control your systems.
Secondly, don't get greedy. Leave a core or two free for the system to run on and to handle i/o etc. For example, set use at most % of the cpu's to 93.75% will use 30 of the 32 cores.
All active projects have some info about them on the forum. Try a few different ones and run those you like, discard the others that don't fit with what you want.
Image
The best form of help from above is a sniper on the rooftop....
noetus
Boinc Corporal
Boinc Corporal
Posts: 50
Joined: Tue May 30, 2017 3:15 am

#35 Re: Xeon - not utilising more than 32 threads

Post by noetus »

My machines aren't dedicated to the project. I only donate spare CPU cycles. The BOINC processes are all set to idle priority and don't interfere with my day-to-day work at all. Don't notice a thing, even at 100% CPU usage, so I'm fine with that. And as mentioned earlier, I solved the ramping up and down issue by also setting "Use at most ___ % of CPU time" to 100% (in addition to "Use at most ___ % of the CPUs" at 100%). It's all steady as a rock now, 100% usage all the time, system is as fast and responsive for my own tasks as before.

The primary purpose of the machines is my own crunching (fintech big data project) and for that my own processes take priority, and also keep the machines all working at 100% when they are working on my own data. In order to make sure that 100% of my CPU cycles are dedicated to my own work when the machines are occupied with that, I have simply set the "Suspend when non-BOINC CPU usage is above ___ %" to 50%. That means when my machines ramp up with the local jobs all the BOINC stuff suspends and only takes up memory (of which there is plenty).

:-)
noetus
Boinc Corporal
Boinc Corporal
Posts: 50
Joined: Tue May 30, 2017 3:15 am

#36 Re: Xeon - not utilising more than 32 threads

Post by noetus »

OK, not all smooth. Since installing the new client each machine now has two entries in my device management page on the WCG portal. Seeems like maybe the new client didn't recognise the machine and set a new machine ID or something? Bizarrely, though, it didn't do this for every machine - one of them wasn't doubled up like this (despite also getting the new client). It's rather aggravating to see double listings for (almost) all my devices, and also to see an inaccurate report of the number of devices I have on my info page.

Here's what I mean:

https://ibb.co/c9akfv

Anything can be done about this? Presumably it would require manual intervention by WCG technical staff - not a very likely prospect I'm guessing.
User avatar
Bryan
Boinc Brigadier
Boinc Brigadier
Posts: 2621
Joined: Thu May 21, 2015 6:18 pm

#37 Re: Xeon - not utilising more than 32 threads

Post by Bryan »

I don't think there is anything you can do on WCG. On projects that run the standard BOINC supplied server code you can "merge" computers. That function will check the machines if it has 2 different IDs for a machine it will merge the stats and show only 1 machine.

WCG runs there own server code and I don't think they have that capability.
Image
User avatar
Dirk Broer
Corsair
Corsair
Posts: 1964
Joined: Thu Feb 20, 2014 11:24 pm
Location: Leiden, South Holland, Netherlands
Contact:

#38 Re: Xeon - not utilising more than 32 threads

Post by Dirk Broer »

With the Intel® Core™ i9-7980XE (not to mention CPUs with an even more EPYC number of cores/threads) just around the corner,
nice to know that the number of threads should not be a problem (but wasn't Windows limited to 64 threads? Well, we have Linux for that problem...)
Image
noetus
Boinc Corporal
Boinc Corporal
Posts: 50
Joined: Tue May 30, 2017 3:15 am

#39 Re: Xeon - not utilising more than 32 threads

Post by noetus »

And the upcoming Xeon Platinum 8180 with 28 cores. That's 112 threads on a dual processor motherboard.

Windows 10 supports up to 256 cores (512 threads with hyperthreading), so we don't have anything to worry about for the time being...
User avatar
Bryan
Boinc Brigadier
Boinc Brigadier
Posts: 2621
Joined: Thu May 21, 2015 6:18 pm

#40 Re: Xeon - not utilising more than 32 threads

Post by Bryan »

noetus wrote: Wed May 31, 2017 10:08 pm And the upcoming Xeon Platinum 8180 with 28 cores. That's 112 threads on a dual processor motherboard.

Windows 10 supports up to 256 cores (512 threads with hyperthreading), so we don't have anything to worry about for the time being...
You might want to take a look at the Windows NUMA implementation which is only 64 threads. That applies to ALL versions of Windows including the server series.
Image
User avatar
scole of TSBT
Boinc Major General
Boinc Major General
Posts: 5983
Joined: Mon Feb 03, 2014 2:38 pm
Location: Goldsboro, (Eastern) North Carolina, USA

#41 Re: Xeon - not utilising more than 32 threads

Post by scole of TSBT »

Bryan wrote: Thu Jun 01, 2017 1:35 am
noetus wrote: Wed May 31, 2017 10:08 pm And the upcoming Xeon Platinum 8180 with 28 cores. That's 112 threads on a dual processor motherboard.

Windows 10 supports up to 256 cores (512 threads with hyperthreading), so we don't have anything to worry about for the time being...
You might want to take a look at the Windows NUMA implementation which is only 64 threads. That applies to ALL versions of Windows including the server series.
The thing to point out is Windows will see all the cores/threads but the OS architecture limits the number of concurrent pipes to memory at 64. You will see 72 threads active on a 72 thread system but those processes must share 64 pipelines to memory and you will see less than 100% utilization on those threads.
Image
User avatar
Bryan
Boinc Brigadier
Boinc Brigadier
Posts: 2621
Joined: Thu May 21, 2015 6:18 pm

#42 Re: Xeon - not utilising more than 32 threads

Post by Bryan »

@Pete, nope! :P

NUMA is the new memory architecture used by both Windows and Linux. Its purpose is to speed things up (which it does) by having the thread use the memory that is tied to the CPU that the thread is running on. If it were to use memory on the other CPU it would have to access that by having the memory controller contact the other CPU's memory controller over the QPI link which is much slower because the QPI link is not as fast and 2 memory controllers get involved. In normal operation this would never happen unless the process needed more memory than was available on the local CPU.

On Windows there are only 64 NUMA channels - period. It doesn't matter how many threads the CPU's have because they will be sharing the 64 NUMA pipes. If you run more than 64 threads simultaneously it will start "sharing" the memory pipes. That means that it has to load the memory pipe for 1 thread and then reload for the 2nd thread. This is inefficient and means that you will actually get less results from running 72 threads than by just leaving it at 64.

In the real world where people who buy and use machines with > 64 threads you can rest assured they know what they are doing (I don't). There are commands in NUMA that allow you to set up CPU nodes and each node would have 64 NUMA pipes. In the case of my 72 thread machines I would assign CPU0 to Node 0 and assign CPU1 to another node, Node 1. Then I would assign an instance of BOINC to run on CPU0 and another instance of BOINC to run on Node1. In effect you have a machine with 2 CPUs but they run independent of one another. Some of the E7 Xeons allow you to run 8 CPUs at a time. This is the way it would be done on those machines ... or they would use sophisticated VM hypervisors.

If I put my 72 thread machines on Win7 and use ncpus to limit BOINC to 64 threads it still doesn't like it. On those machines I turn off HT if I need to run Windows. That is the reason the last 2 machines I built have only 64 threads. On those I can switch between Linux and Win and not have to screw with it.

BTW, I'm sure that Linux has some kind of limit on NUMA threads but I can't find what it is. I know they will handle 88 thread machines without a problem.
Image
noetus
Boinc Corporal
Boinc Corporal
Posts: 50
Joined: Tue May 30, 2017 3:15 am

#43 Re: Xeon - not utilising more than 32 threads

Post by noetus »

"On Windows there are only 64 NUMA channels - period. It doesn't matter how many threads the CPU's have because they will be sharing the 64 NUMA pipes. If you run more than 64 threads simultaneously it will start "sharing" the memory pipes. That means that it has to load the memory pipe for 1 thread and then reload for the 2nd thread. This is inefficient and means that you will actually get less results from running 72 threads than by just leaving it at 64."

I don't see how this can be right as a matter of principle. Once there are more than 64 logical cpus and thus multiple processor groups and sharing of NUMA channels of the same NUMA node, how much performance suffers will be a matter of empirically testing. I can't see the justification for making a blanket statement that there is no point in using more threads because the performance will be the same or worse. It is going to depend on the specifics of the architecture, how the application is coded, and just how much performance gain there is anyway from NUMA in the first place.

I will soon be able to test this empirically on an 88 thread machine, in a couple of weeks probably, so I'll test and report back here.
User avatar
Bryan
Boinc Brigadier
Boinc Brigadier
Posts: 2621
Joined: Thu May 21, 2015 6:18 pm

#44 Re: Xeon - not utilising more than 32 threads

Post by Bryan »

Have fun :) If you find it isn't a problem then I know a LOT of people that will be purchasing Win10 Enterprise! You can turn off the hardware NUMA in your BIOS and then you need to turn it off in Win as well.

You might want to install Boinc Tasks ... it makes it real easy to see the loading on each of the threads (or you can use Task Manager).

Here is some "light" reading for you: https://msdn.microsoft.com/en-us/librar ... s.85).aspx

In essence the problem is NOT the architecture, nor is it the executable from the project. The problem is that the BOINC client does not specifically take into account the systems NUMA nodes when it launches a thread. Ideally it would specifically assign a new thread to a NUMA node but instead it relies on the Windows scheduler to make that determination. The Win scheduler uses 64 NUMA threads.
The reason for initially limiting all threads to a single group is that 64 processors is more than adequate for the typical application. An application that requires the use of multiple groups so that it can run on more than 64 processors must intentionally determine where to run its threads. The application is responsible for setting thread affinities to the desired groups.


Then of course Windows may work differently with your 88 threads machines than it does with my 72 thread Xeon V4s ... hopefully that will be the case.

Linux doesn't have the same limitation BTW. It will run your 88 threads at 100% loading /thread.
Image
User avatar
Dirk Broer
Corsair
Corsair
Posts: 1964
Joined: Thu Feb 20, 2014 11:24 pm
Location: Leiden, South Holland, Netherlands
Contact:

#45 Re: Xeon - not utilising more than 32 threads

Post by Dirk Broer »

Bryan wrote: Fri Jun 02, 2017 4:04 pmThe reason for initially limiting all threads to a single group is that 64 processors is more than adequate for the typical application.
Where have I heard this line of argument before? Sounds like "640K ought to be enough for anybody" (supposedly said by Bill Gates around 1981) or "I think there is a world market for maybe five computers" (supposedly said by Thomas J. Watson of IBM in 1943).
Image
User avatar
Bryan
Boinc Brigadier
Boinc Brigadier
Posts: 2621
Joined: Thu May 21, 2015 6:18 pm

#46 Re: Xeon - not utilising more than 32 threads

Post by Bryan »

@Dirk ... I think you broke the code :lol:
Image
User avatar
scole of TSBT
Boinc Major General
Boinc Major General
Posts: 5983
Joined: Mon Feb 03, 2014 2:38 pm
Location: Goldsboro, (Eastern) North Carolina, USA

#47 Re: Xeon - not utilising more than 32 threads

Post by scole of TSBT »

Bryan, Read this https://support.microsoft.com/en-us/hel ... -windows-7

Might be solution. Haven't tried it yet.
Image
User avatar
Bryan
Boinc Brigadier
Boinc Brigadier
Posts: 2621
Joined: Thu May 21, 2015 6:18 pm

#48 Re: Xeon - not utilising more than 32 threads

Post by Bryan »

scole250 wrote: Fri Jun 02, 2017 7:08 pm Bryan, Read this https://support.microsoft.com/en-us/hel ... -windows-7

Might be solution. Haven't tried it yet.


Now that is interesting - very very interesting. Obviously the "hotfix" is irrelevant since it is dated 2010 and is incorporated in later releases/updates. However, the info on how to assign a process to a node is great.

You wouldn't be able to assign boinc.exe to 2 nodes (probably), however if you ran 2 instances of boinc with one being boinc.exe and the 2nd being boinc2.exe and then using ncpus to limit each to (total threads/2) then it might work :D That is worth playing with.

My machines are totally tied up for the next 4-5 days on Amicable Numbers - doing run to 100M. So when will you get a chance to try it Steve? :lol:

BTW, how much do you know about Python?

@noetus - you can install and simultaneously run multiple instances of BOINC on your computer. There are times when this is beneficial ... like when running a team competition. The "ncpus" is a command that can be placed into your cc_config.xml file and it tells BOINC to simulate that many CPUs/threads. For this discussion and your upcoming machine, you would set ncpus to 44 in each of your boinc instances.
Image
User avatar
scole of TSBT
Boinc Major General
Boinc Major General
Posts: 5983
Joined: Mon Feb 03, 2014 2:38 pm
Location: Goldsboro, (Eastern) North Carolina, USA

#49 Re: Xeon - not utilising more than 32 threads

Post by scole of TSBT »

I can't recall if I reinstalled win7 on the 72 thread system last time it flaked out but I'll try to give it a whirl this weekend.
Image
User avatar
Bryan
Boinc Brigadier
Boinc Brigadier
Posts: 2621
Joined: Thu May 21, 2015 6:18 pm

#50 Re: Xeon - not utilising more than 32 threads

Post by Bryan »

scole250 wrote: Fri Jun 02, 2017 8:26 pm I can't recall if I reinstalled win7 on the 72 thread system last time it flaked out but I'll try to give it a whirl this weekend.
That was too easy, I should have asked for more :roll:
Image
User avatar
scole of TSBT
Boinc Major General
Boinc Major General
Posts: 5983
Joined: Mon Feb 03, 2014 2:38 pm
Location: Goldsboro, (Eastern) North Carolina, USA

#51 Re: Xeon - not utilising more than 32 threads

Post by scole of TSBT »

Bryan wrote: Fri Jun 02, 2017 8:38 pm
scole250 wrote: Fri Jun 02, 2017 8:26 pm I can't recall if I reinstalled win7 on the 72 thread system last time it flaked out but I'll try to give it a whirl this weekend.
That was too easy, I should have asked for more :roll:
And I'll be glad to pass along my findings for just a few measly bitcoins. :lol:
Image
User avatar
Bryan
Boinc Brigadier
Boinc Brigadier
Posts: 2621
Joined: Thu May 21, 2015 6:18 pm

#52 Re: Xeon - not utilising more than 32 threads

Post by Bryan »

scole250 wrote: Fri Jun 02, 2017 8:52 pm
Bryan wrote: Fri Jun 02, 2017 8:38 pm
scole250 wrote: Fri Jun 02, 2017 8:26 pm I can't recall if I reinstalled win7 on the 72 thread system last time it flaked out but I'll try to give it a whirl this weekend.
That was too easy, I should have asked for more :roll:
And I'll be glad to pass along my findings for just a few measly bitcoins. :lol:
:lol: I knew there was a catch! The bitcoins are in the mail watch for them.
Image
noetus
Boinc Corporal
Boinc Corporal
Posts: 50
Joined: Tue May 30, 2017 3:15 am

#53 Re: Xeon - not utilising more than 32 threads

Post by noetus »

A point of clarification: was the earlier claim of a limit of 64 threads (by Bryan) related to Boinc specifically or was it supposed to cover any set of processes one might care to run?

In my own coding I do not code multi-threaded apps. I tried that and it didn't work out well (thread management was an issue and requires more complex coding than I am really capable of right now). Instead for naturally parallelisable tasks I code for single threads and then run multiple instances from the command line, letting the OS manage the multitasking.
User avatar
Bryan
Boinc Brigadier
Boinc Brigadier
Posts: 2621
Joined: Thu May 21, 2015 6:18 pm

#54 Re: Xeon - not utilising more than 32 threads

Post by Bryan »

It is a basic function/limitation of any flavor of the Windows OS to include the server versions. It has nothing to do with BOINC specifically. You will run into the same problem with your program.

We did find the solution using the start /node X /affinity 0xFFFFFFFF [your_program.exe]. From the command window or from a batch/cmd file you can execute the command. When the program is launched you assign it to either node 0 or node 1 and then you set the affinity of which processors in that node the program is allowed to use. The 0xFFFFFFFFF is a bit mask for the allowed threads in the node. What I showed would allow the program to use any of 36 threads. In your case you will need to add 2 more F's.

Check out the syntax/manual for the "start" command. Down in the multiprocessor section you will find the NUMA stuff.

https://ss64.com/nt/start.html

There are also NUMA commands you can use programmatically that will do the same thing and with those you can actualy go down to assigning at the thread level rather than at the program level.
Image
User avatar
Bryan
Boinc Brigadier
Boinc Brigadier
Posts: 2621
Joined: Thu May 21, 2015 6:18 pm

#55 Re: Xeon - not utilising more than 32 threads

Post by Bryan »

To be clear about this, you can run your 88 threads from a program without a problem. However the efficiency will be the same as if you were running 64 threads.

There are 64 NUMA memory pipes available. If you run 64 threads or less then each thread will run at 100% load. If you enable 88 threads then Windows will start "sharing" the memory pipes. So the 24 threads (above the 64) would start sharing the memory pipes with 24 of the 1st 64 and therefore would only be running at 50% efficiency because 1/2 the time they are running and 1/2 the time they are waiting for memory.

You would wind up with 40 threads running at full load and 48 threads running at 50% load ... ie 64 threads. Since the memory channels would have to be continuously loaded/unloaded it would actually be less efficient than just running 64 threads.

So to overcome this limitation you would set your program up to launch 44 threads at Node 0 (CPU0) and then 44 threads to Node 1 (CPU1). Then all 88 threads would be running at 100% loading.

We proved the technique works as I had describe in my post to Pete. We launched one BOINC client to node 0 and a 2nd BOINC client to node 1. On our 72 thread machines all threads were running at 100% load.
Image
User avatar
scole of TSBT
Boinc Major General
Boinc Major General
Posts: 5983
Joined: Mon Feb 03, 2014 2:38 pm
Location: Goldsboro, (Eastern) North Carolina, USA

#56 Re: Xeon - not utilising more than 32 threads

Post by scole of TSBT »

Actually, our observations have been that even if you limit the number of threads to 64 they still won't run as efficient as they could because another thing that happens with NUMA is it was designed to use memory on the bus of the other CPU if needed (or not) and it will do so at a loss of performance. To get the best performance, you must use the start command directives /NODE and /AFFINITY to restrict processes to the same NUMA node wired to the processor.

Seeing is believing and there are two things to get setup so you can observe efficiency and affect it:
1. Install BoincTasks so you can see what kind of CPU utilization WUs are running at. http://efmer.com/b/?q=boinctasks_download If you are running Boinc on more than one system, it's the only way to go to manage things. It will require you to configure each Boinc client to allow remote GUI access from other systems.
2. Setup your systems to run multiple Boinc clients. See info here...https://www.tsbt.co.uk/forum/viewtopic.php?f=172&t=3140 Not sure if you have access to that area yet. If not, let us know an we'll get it moved to an open access area.

A lot of things to do and understand in those 2 items. Going to take a little effort to set it all up. I don't think we can easily give a "go do A, B, C, D" list of instructions. If you have questions though, feel free to ask.
Image
User avatar
Bryan
Boinc Brigadier
Boinc Brigadier
Posts: 2621
Joined: Thu May 21, 2015 6:18 pm

#57 Re: Xeon - not utilising more than 32 threads

Post by Bryan »

There are a couple of caveats to those wanting to run more than 64 threads of BOINC.

1. There are a couple of projects, like Yoyo, that don't allow multiple BOINC clients.

2. If the project uses VBox then it doesn't work because you only have one instance of VBox installed and it therefore falls under the 64 NUMA thread rule. You can assign a boinc client to each node but when it calls VBox then VBox is limited to 64 threads.

On standard BOINC projects it is phenomenal ... beats the heck out of having to turn off HT when you have no choice but to run Windows (like Gerasim).
Image
noetus
Boinc Corporal
Boinc Corporal
Posts: 50
Joined: Tue May 30, 2017 3:15 am

#58 Re: Xeon - not utilising more than 32 threads

Post by noetus »

Let me see if I'm reading you right. So you're saying if I take a 44 core / 88 thread machine, and limit it to 32 cores / 64 threads in BIOS, then run Cinebench on Windows, then enable all cores in BIOS, and run Cinebench again, I won't see any significant difference in the benchmark scores?
User avatar
scole of TSBT
Boinc Major General
Boinc Major General
Posts: 5983
Joined: Mon Feb 03, 2014 2:38 pm
Location: Goldsboro, (Eastern) North Carolina, USA

#59 Re: Xeon - not utilising more than 32 threads

Post by scole of TSBT »

noetus wrote: Thu Jun 08, 2017 12:57 pm Let me see if I'm reading you right. So you're saying if I take a 44 core / 88 thread machine, and limit it to 32 cores / 64 threads in BIOS, then run Cinebench on Windows, then enable all cores in BIOS, and run Cinebench again, I won't see any significant difference in the benchmark scores?
I think you'll see a higher benchmark with 88 threads vs. 64 but you won't get anywhere near 100% utilization on all 88 threads at the same time. In order to get the most utilization of all 88 threads under Windows, you'll need to make sure 44 processes run on threads 0-43 on NUMA node 0 and the other 44 processes run on threads 44-87 on NUMA node 1. That doesn't occur automatically. To get Boinc to run under those constraints, you must setup 2 boinc clients and run each using the start command with the correct NODE and AFFINITY options.
Image
User avatar
Bryan
Boinc Brigadier
Boinc Brigadier
Posts: 2621
Joined: Thu May 21, 2015 6:18 pm

#60 Re: Xeon - not utilising more than 32 threads

Post by Bryan »

noetus wrote: Thu Jun 08, 2017 12:57 pm Let me see if I'm reading you right. So you're saying if I take a 44 core / 88 thread machine, and limit it to 32 cores / 64 threads in BIOS, then run Cinebench on Windows, then enable all cores in BIOS, and run Cinebench again, I won't see any significant difference in the benchmark scores?
No, I think Cinebench will use all 88 threads to their fullest capability. That is a professioanlly written benchmarking suite. That suite is used by every Tom, Dick, and Harry who tests computer systems. They would be fools to not take NUMA into account. If I were writing code to run benchmarks one of the 1st things I would do is check the system topology to see how many CPUs, cores/threads, and NUMA nodes that were available. I have absolutely no doubt that they do this and then manage the threads of the benchmark accordingly.

Don't forget the quote from Microsoft concerning the issue;
The reason for initially limiting all threads to a single group is that 64 processors is more than adequate for the typical application. An application that requires the use of multiple groups so that it can run on more than 64 processors must intentionally determine where to run its threads. The application is responsible for setting thread affinities to the desired groups.
I really think that Microsoft knows their product and how it operates :lol: Then again something may have changed in Win10 Enterprise but I haven't found any documentation along those lines. It also isn't mentioned in the new server editions including the "data center" versions. If it were to be changed that would be the logical place for it to occur.
Image
noetus
Boinc Corporal
Boinc Corporal
Posts: 50
Joined: Tue May 30, 2017 3:15 am

#61 Re: Xeon - not utilising more than 32 threads

Post by noetus »

OK, this is along the lines of what I assumed. Regarding single-threaded command line applications; multiple instances of these will also be assigned NUMA nodes according to best practices for efficiency I assume.
User avatar
Bryan
Boinc Brigadier
Boinc Brigadier
Posts: 2621
Joined: Thu May 21, 2015 6:18 pm

#62 Re: Xeon - not utilising more than 32 threads

Post by Bryan »

Correct. You would want to keep track of how many you have running and where. You also want to "balance" the 2 nodes so they are both using approximately the same number of threads.
Image
User avatar
scole of TSBT
Boinc Major General
Boinc Major General
Posts: 5983
Joined: Mon Feb 03, 2014 2:38 pm
Location: Goldsboro, (Eastern) North Carolina, USA

#63 Re: Xeon - not utilising more than 32 threads

Post by scole of TSBT »

I've been running my 72 thread system under Win 7 again this afternoon and can confirm again that if you run a single Boinc client and allow the OS to manage the nodes you will see CPU utilization vary from 50-100%. Most of the time I saw utilization between 90-96% and it rarely but occasionally peaked at 100% but just as often fell below 75%. It ran this way for a couple hours. Then I ran two clients specifying the NODE and AFFINITY for each client and CPU utilization stayed pegged at 100%.
Image
noetus
Boinc Corporal
Boinc Corporal
Posts: 50
Joined: Tue May 30, 2017 3:15 am

#64 Re: Xeon - not utilising more than 32 threads

Post by noetus »

OK, so I now have the 44 core / 88 thread machine up and running, and it is full of lovely scrumptious goodness. I was easily able to run at full 100% utilisation on each core (for 4 hours continuously) with my own crunching project, starting 88 separate threads from the command line. It also seems appreciably quicker per core than my two 28 core E5-2658 v4 powered machines, unbelievably quicker, like 40% quicker (per core), I have to look into that.

I have noticed that the generic BOINC client downloaded from the Berkeley site seems to generate fewer points when crunching for WCG than the WCG client downloaded from www.worldcommunitygrid.org. Which in itself is strange, have to look into that too. I downloaded the WCG-only client and tried running it (single instance) on the 44 core machine. Interestingly, even though it is 32 bit, it is running 64 processes concurrently (not 32 as before) and using 92% of CPU availability as reported by Task Manager. Have they updated this client? I don't think so. So what is going on?

I imagine that by running an additional client (as per the instructions mentioned above) I will get full 100% CPU utilisation, even when using the 32 bit client. Interesting!
noetus
Boinc Corporal
Boinc Corporal
Posts: 50
Joined: Tue May 30, 2017 3:15 am

#65 Re: Xeon - not utilising more than 32 threads

Post by noetus »

Guys... I have been absent for a while, crunching my own projects. Is there still interest in this topic?

I have some weird results. Recall I have an 44 core machine running Windows 10. Each machine has two sockets, each with 22 core Xeon, for a total of 88 threads or virtual cores with HT on.

For my own crunching, I have been running 88 separate threads, each thread a separate console process. Each process is single-threaded; hence I am relying on the OS to handle the parallelization of the data processing and scheduling of multiple threads when I launch multiple copies of the same process from the command line (each one handling a different chunk of data to be crunched via command line arguments).

The Windows scheduler does some funny stuff. It seems to cycle from employing half the virtual cores at 100%, half the vCores at 0%, to all vCores at 100%, to the other half at 100% and the first half that was at 0% before to 100% now, then repeating. Suppose we divide the 88 threads into two groups, A and B. So the Windows scheduler at first assigns equal processing to 88 threads divided between group A and group B. Then it slowly moves to giving all priority to group A and none to group B (you can see this in task manager - half the vCores are at 0%, half at 100%, divided neatly in the middle of the core activity display, and also half the threads are active, half more or less inactive), then back to half to group A and half to group B, then transitioning to giving no priority to group A and everything to group B (and here you see the core activity display in task manager inverted). The CPU frequency is throttled accordingly, too; when half the vCores are active, overall turboboost is 3.7 GHz, when all cores are active, overall turboboost drops to only to 2.7 or 2.8 GHz.

Does this make any sense to anyone? The OS has Group 0 and Group 1 NUMA nodes. It doesn't seem to make a difference if I explicitly start half the processes on Group 0 and half on Group 1 (using Start / Node 0 and Start / Node 1). And you wouldn't expect it to, either - clearly the OS is capable of assigning all the processes across both Nodes evenly even without this explicit Node assignment, as evidenced that all vCores are at 100% some of the time, when turboboost is at 2.7/2.8.

It's frustrating because clearly I am not getting the full potential out of this expensive machine. I also have two 28 core machines. These run the exact same processes, 56 at a time with HT on and with 100% utilisation all the time, for days or weeks at a time, no issues whatsoever, no core dropouts below 100%, nothing. Per core processing on the 44 core machines is up to 50 % slower as evidenced by some quick and dirty benchmarking, and this despite the fact that the 44 core machine has faster cores - turboboosting all cores simultaneously to 2.8 GHZ max, while the 28 core machines turboboost all cores to 2.5 GHz.

If anyone can help me figure out what is going on I would be most grateful. I am going to try fiddling with settings in the BIOS. One thing to try will be to turn HT off. But I know that if HT is on and all cores are utilised at 100%, overall per-thread processing is 10-15% faster, so turning HT off is not really the ideal solution even if it improves things a bit.

One final detail that might offer a clue as to what is going on is that this cycling from 100% to 50% and back again correlates with the way the processing of the data chunks is lined up in batch files. Each process is launched from the command line as I mentioned. I have a script that automatically writes batch files, one batch file per virtual core. So there are 88 batch files, and each one contains a series of command line launch commands to be executed one after the other. The idea is that each core will process these one at a time. The cycling from initial 100% vCore usage to 50% correlates with the time all the cores take to process the first command line in each of their batch files. It then cycles back from 50% to 100% with the processing of the second line in the batch file (but with the 50% usage corresponding to the other half of the vCores). And so on. So somehow the scheduling of the command line processes is interacting with the way Windows 10 schedules the processes across all the vCores and Groups to produce this weird cycling behaviour and under-utilisation of the overall processing power available. So perhaps there is a way of scheduling the process tasks from the command line that will not produce this effect. But I can't really think beyond that.
User avatar
scole of TSBT
Boinc Major General
Boinc Major General
Posts: 5983
Joined: Mon Feb 03, 2014 2:38 pm
Location: Goldsboro, (Eastern) North Carolina, USA

#66 Re: Xeon - not utilising more than 32 threads

Post by scole of TSBT »

What is the command line you're using? When you use the start command, are you setting just the node or node and affinity?
Image
noetus
Boinc Corporal
Boinc Corporal
Posts: 50
Joined: Tue May 30, 2017 3:15 am

#67 Re: Xeon - not utilising more than 32 threads

Post by noetus »

START /NODE 0 command.bat

One of these for each vCore, splitting equally between NODE 0 and NODE 1. The file command.bat then contains a list of commands to run serially. I have been assuming that the commands in the batch file will inherit the NODE setting from the calling START command, but now that I think about it, perhaps that doesn't make sense.

[Edit]. I am adding screenshots of the sequence so you can see what happens. It starts out at 100%on all vCores, then gradually there is a reduction in the CPU usage for one half of all the cores, as you can see in these Task Manager grabs. After some minutes there is only CPU activity for half the vCores, then the other half gradually start up until they are all at 100%. Then the opposite half starts gradually reducing, until there are only half the number (the other half) with activity. And so on.
You do not have the required permissions to view the files attached to this post.
noetus
Boinc Corporal
Boinc Corporal
Posts: 50
Joined: Tue May 30, 2017 3:15 am

#68 Re: Xeon - not utilising more than 32 threads

Post by noetus »

So I did some quick and dirty benchmarking. I ran 440 identical processes on a single set of data overall, which each process analysing 1/440 of the data. With hyperthreading on, I had 88 processes running simultaneously, with 5 repetitions of each process (on a different chunk of data each). The total time to complete all the processes, across all cores, was 7704.5 mins or 128.4 minutes or 2 and 1/8 hours.

Then I turned hyperthreading off, and ran the test again. This time I had 44 process running simultaneously, half the number, with double the number of repetitions of each process (on a different chunk of data each time) per core, i.e. 10. The total time to complete all the processes, across all cores, was 4605.5 secs or 76.76 minutes or an hour and a quarter.

That's an astounding difference. Clearly all the core slowdowns and massive scheduling and rescheduling Windows does to all the processes with hyperthreading on, really kills it (despite the increased turboboost when fewer cores are utilised at the same time, as can be seen in the above grabs of Task Manager). Or is there another explanation? I'd like to run the benchmark again with hyperthreading on but with full 100% utilisation of the CPUs all the way through - but I don't know how to get it to do that on this machine with >64 cores.

For now I'll simply turn hyperthreading off. Once my two 28 core machines have finished their current task (another week) I will run the test on them, too, to see how much of a difference hyperthreading makes on those machines for a bunch of OS-managed parallelization of single-thread processes.
User avatar
scole of TSBT
Boinc Major General
Boinc Major General
Posts: 5983
Joined: Mon Feb 03, 2014 2:38 pm
Location: Goldsboro, (Eastern) North Carolina, USA

#69 Re: Xeon - not utilising more than 32 threads

Post by scole of TSBT »

On your 88 thread system use these commands. Start half the processes with one command line and the use the other command line for the other half.

START /NODE 0 /AFFINITY 0xFFFFFFFFFFF command.bat

START /NODE 1 /AFFINITY 0xFFFFFFFFFFF command.bat
Image
noetus
Boinc Corporal
Boinc Corporal
Posts: 50
Joined: Tue May 30, 2017 3:15 am

#70 Re: Xeon - not utilising more than 32 threads

Post by noetus »

This unfortunately made no difference. You still see a cycle of both Numa nodes utilised at 100%, then gradually one of them goes to zero, then it increases back to 100%, then the other decreases to zero, then it goes back to 100%, and so on. I confirmed it this time with the Numa Node view in Task Manager.
User avatar
scole of TSBT
Boinc Major General
Boinc Major General
Posts: 5983
Joined: Mon Feb 03, 2014 2:38 pm
Location: Goldsboro, (Eastern) North Carolina, USA

#71 Re: Xeon - not utilising more than 32 threads

Post by scole of TSBT »

I would guess it has something to do with your specific application and a bottleneck with the other system resources such as disk, network or memory. What do your memory, disk and network resource stats look like with no instances of your program running and what do they look like with 1, then 2, then 3 and so on running?
Image
noetus
Boinc Corporal
Boinc Corporal
Posts: 50
Joined: Tue May 30, 2017 3:15 am

#72 Re: Xeon - not utilising more than 32 threads

Post by noetus »

The processes are compute-bound. There is very little system resources used while they are active. Disk I/O, Network I/O, and memory use are all low (the system has 32 GB of RAM and each process uses about 100 MB, about 9-10GB of RAM total, divided between the 88 processes). The processes run great on my other two systems, of 28 cores each, with either hyperthreading turned on or off. It must have something to do with the way scheduling is interacting with the NUMA nodes for >64 processes, don't you think?
User avatar
Bryan
Boinc Brigadier
Boinc Brigadier
Posts: 2621
Joined: Thu May 21, 2015 6:18 pm

#73 Re: Xeon - not utilising more than 32 threads

Post by Bryan »

I have .bat files setup to launch instances of BOINC. I'm trying to blow off the cobwebs, but I ran into a problem with making the START command inclusive to the .bat file. What works on my systems without fail is to create the .bat file that sets up the program/thread and then call a .bat file that contains the START command and points to the other .bat file. What I'm suggesting is you create the file you want to execute (filename.bat) and then call it from another command file that has the START/node/affinity filename.bat.

Install ProcessLasso onto your Windows system. From there you can see where each of the threads is allowed to run. It will show the affinity. If you've set them up as 1/2 on each CPU it should show the affinity on 1/2 of them as 0-23 and the other 1/2 should show 24-43. If it doesn't then the NUMA node is not being setup correctly.
Image
User avatar
scole of TSBT
Boinc Major General
Boinc Major General
Posts: 5983
Joined: Mon Feb 03, 2014 2:38 pm
Location: Goldsboro, (Eastern) North Carolina, USA

#74 Re: Xeon - not utilising more than 32 threads

Post by scole of TSBT »

Exactly what model CPUs are these? are they retail or ES?

And running some WCG for the next 24 hours might help us diagnose the issue :-)
Image
noetus
Boinc Corporal
Boinc Corporal
Posts: 50
Joined: Tue May 30, 2017 3:15 am

#75 Re: Xeon - not utilising more than 32 threads

Post by noetus »

Thanks for all these comments. The affinity is correct as you described, I had already checked that. As I mentioned before, the core dropout on one NUMA node corresponds to when one batch file is finishing up and the next one is taking over from it. The explanation has to have something to do with that. However I've now checked on my 28 core systems (an identical hardware setup apart from the fewer cores) that are working perfectly whether hyperthreading makes any difference. For these single-threaded command line processes that run in parallel in the OS, there is no measurable difference between hyperthreading on or off. I can run twice the number of process with hyperthreading on, and have a batch list that is half as long, or have fewer processes with the hyperthreading off and a longer batch list. The total time to completion is within statistical error of being identical. Therefore the simplest solution for me now is to turn hyperthreading off on the 44 core system, which brings the thread count below 64, and alll is well (I've already tested it).

Perhaps I will come back to this later to try to get to the root of the problem!

[EDIT} The CPU model nos are E5-2696 @ 2.8 GHz when all cores are at 100%. They are retail chips.
User avatar
Dirk Broer
Corsair
Corsair
Posts: 1964
Joined: Thu Feb 20, 2014 11:24 pm
Location: Leiden, South Holland, Netherlands
Contact:

#76 Re: Xeon - not utilising more than 32 threads

Post by Dirk Broer »

With the new Threadripper coming out with 32 cores/64 threads and future EPYCs rumoured to get yet more cores (64c/128t), this thread about threads is getting more and more interesting...A dual EPYC2 mobo might run into yet another limit with its 256 threads, I fear.
Image
User avatar
Dirk Broer
Corsair
Corsair
Posts: 1964
Joined: Thu Feb 20, 2014 11:24 pm
Location: Leiden, South Holland, Netherlands
Contact:

#77 Re: Xeon - not utilising more than 32 threads

Post by Dirk Broer »

There is computing outside of x86, of course. IBM's Power9 consists out of either 12 cores with 8-way SMT or 24 cores with 4-way SMT and their systems can house sixteen (16!) sockets.
That is 16x 12x 8, or 16x 24x 4, both ending up with 1536 threads per system, max. These systems are not meant for Joe Sixpack though (but who is stopping sysadmin Sixpack from 'burning in' the new company server with a 7-day long BOINC run? :wink: ), and they won't be running Windows either.
Image
Post Reply Previous topicNext topic

Return to “Benchmarking and Hardware”