Amd Ryzen Threadripper & X399 MegaThread! FAQ & Resources

Page 16 - Seeking answers? Join the Tom's Hardware community: where nearly two million members share solutions and discuss the latest tech.
Status
Not open for further replies.

jaymc

Distinguished
Dec 7, 2007
614
9
18,985
Here guy's but doesn't this also mean that the benchs weve been getting on Epyc 32 core Windows Server are all severely hampered by the same bad scheduler ?

I mean it's not a small discrepancy, a lot of the workloads are 2.5x faster in Linux.

I can't wait to see the actual numbers for a 32 Core Epyc Windows Server against Xeon when this is fixed...
Expect the gap to increase a lot... well 2.5x in certain workloads...

If this makes all the previous benchmark results wrong on Threadripper and Epyc...well lol This is huge.

I wonder what the 64 Core is going to look like, ridiculous I'd say :D

 


There's a couple of things going on here. I'd need some really low level performance stats to know for sure, but my suspicion is simply that the Windows scheduler really isn't designed for this type of workload.

Unlike Linux, Windows is designed to allow threads to jump between cores in order to increase maximum thread uptime. For example, on Linux, if a thread gets bumped by another thread, it will be re-assigned to the same CPU core. The downside is that thread may be waiting for a period of time even if another CPU core is capable of running it. This costs some performance for that specific thread, but you don't have to worry about accessing CPU cache across cores or having to access main memory as often as threads get re-assigned. By contrast, Windows will schedule a thread on whatever core is capable of running it at that time. This results in threads jumping between cores, increasing uptime but also increasing the amount of cache/memory access required.

If you have only a few threads, the Windows approach isn't going to significantly affect performance in a negative way, but as core count on the CPU starts to increase you need to start considering the increased cache/memory access this approach results in. This problem gets significantly amplified in a NUMA environment. Essentially: the Windows thread scheduler is optimized for a single application that utilizes no more then handful of threads.

As for the results on Linux, do notice scaling starts to decline noticeably beyond 16 threads, which is about what I'd expect. If I'm particularly bored (very little work for me right now) I might actually compute some hard numbers to demonstrate.
 


Games don't.

Other workloads will, but scaling declines once you get beyond 16 cores due to scheduling overhead. Coincidentally, you see that trend in the result set, where scaling drops significantly when moving beyond 16 cores.
 

jaymc

Distinguished
Dec 7, 2007
614
9
18,985


That's not exactly correct, a true Dx12 or Vulcan (mantle) title should be able to scale well above 8 cores. Actually it would be capable to break the workload into as many threads as there are free cores or is beneficial for that matter.
 


The problem is as you scale up, especially for memory heavy tasks like GPU rendering, you start to run into a lot of scheduling/resource overhead. Unless you have either insanely fast memory access or much larger CPU caches then we currently have, these effects will limit how well you can scale. The APIs can handle it, but the rest of the system can't.
 
Well, gamerk, in case you haven't noticed, we do have insanely fast memory that is grossly underused ATM. Maybe when developers start actually taxing the memory subsystem we'll start noticing such bottlenecks you're talking about, but if you ask me, we haven't seen any of them yet.

My take is like before: we haven't seen a proper paradigm shift because of economic reasons and not technical ones.

Cheers!
 


Latency is more important here then bandwidth; you do NOT want a thread assigned then have to wait several hundred CPU cycles to get the data it needs.
 
Not disagreeing, but I'm pretty sure that is why NUMA has such a huge impact in Linux for AMD. Memory management for threading is really important like you say and AMD has it covered with NUMA (to a degree) for massively parallel workloads. Changing the topology of how you arrange the memory for the CPUs has a deep impact on latencies across the hardware. I like the trade off, to be honest. Less effective memory for a better average access is most definitely a great trade off. This just adds to my comment of "we do have fast memory anyway".

What I'd like to see is Microsoft pushing a NUMA-enabled version of Windows10 for AMD. I know Servers have it, but you have to enable it. So, I don't think (or I hope?) it would be a massive cost for Microsoft to push it into consumer space. There's a reason why to do it now, at the very least, so might as well? Hell, maybe there's already a hidden option (since the Kernel is massively similar anyway?) that could enable it? Maybe there's already a way?

Cheers!
 


The downside to NUMA is you need your workloads to NEVER need to touch the same memory data, otherwise everything grinds to a halt. That's why Windows does so poorly in NUMA, given the way they handle threading (which is optimized for non-NUMA).
 

genz

Distinguished


Sounds like a problem for... drum roll
https://en.wikipedia.org/wiki/Windows_on_Windows

Maybe they can get a kernel scheduler on each NUMA channel. :lol:

Seriously though, with the amount of 2S and 4S sockets produced throughout history are you honestly telling me that OS schedulers start to destroy performance gains at only 16 threads? There are 8 Cores from 2007 (Core Quad xeons in 4S config). That's very very close to where we currently are in the consumer space already.
 


This might sound weird or even annoyingly direct, but... Do you really think serious work on servers with multitude of CPUs is done with Windows installed?

That is one of the many reasons why you never really use Windows Server for anything *remotely* serious.

All the people developing dotNet applications must be delusional their code is going to be running critical applications or in critical infrastructure using Windows, if at all. At best, web apps or crappy balancing machines. Funny thing: did you know IIS craps out when the CPU is at 100%? It can't accept new connections and it's still a thing they haven't fixed. Dayum!

Anyway, the point is MS hasn't really taken NUMA seriously for some bizarre reason I don't know personally, even when they do "support" it in MS Server.

Cheers!
 


MSFT doesn't take NUMA seriously because the scheduler they use really isn't designed for that type of workload. At minimum they'd have to re-write their thread scheduler from scratch, which is something they likely don't want to do at this juncture. And I suspect there's a lot of low-level Windows internals that don't play well with NUMA.

Windows was designed around Unified Memory Access; it's kind of no surprise that it starts to break in a NUMA environment.

As for threading, the main problem as you increase thread count is the OS scheduler/memory access starts to become a larger and larger problem, leading to oftentimes significant decreases in performance scaling. You also need to remember that other tasks are also trying to run, and having one application take all the CPU resources often leads to all applications losing performance as they constantly bump eachother's threads while trying to finish their tasks. In an environment where your application is the only thing running and you have direct control of memory access, you could scale to infinity. But neither of those things are typically true, and scaling declines as thread count increases as a result.

As for Windows on Windows, MSFT is just converting all 32-bit memory accesses into 64-bit equivalents in realtime in order to keep the OS happy. Everything is still going through the Win32 APIs under the hood.
 
https://www.hardocp.com/news/2019/01/03/amd_ryzen_threadripper_2990wx_performance_regressions_linked_to_windows_bug/

Called it. Sounds likes Windows was trying to use only one NUMA node, and spent almost all it's time trying to juggle threads onto just that one node.
 


More like, who was doubting that was not the case?

It's been known for ages that Windows doesn't do NUMA well. At least, in the enterprise world.

Cheers!
 

genz

Distinguished
It reads to me like a hardcoding fix for an Intel bug getting in the way. The OS shouldn't be avoiding using memory controllers period, even if it's because of there being no path to IO on the XCC chips.

Windows seems to fall over this regularly. Issuing a bugfix for 1st gen tech in some form that doesn't specifically target the hardware with the problem.
 


The problem is the scheduler has gotten so heavily optimized, things that are different tend to break in some really weird ways.

But yeah, it is odd the scheduler was trying to put all the threads on just one node, while avoiding the others entirely. That's unusual.
 
Hello everyone! I purchased the 2990wx back in September mainly for 3d rendering. It was great out of the box and running 100%. Recently I noticed while rendering in Keyshot the chip seems to be throttled. September 2018: 3ghz 100% all cores, April 2019: 2.23ghz ~65% all cores. I am not sure where to start, can someone please lead me in the right direction?

notes:
-Wendell's Coreprio does not seem to help.
-No other programs are running


AMD 2990wx, Enermax liquid cooling
Asus ROG Zenith Extreme
128gb Corsair Vengeance LPX
Evga 1080 ti
Evga 1600 watt PSU
lee-souder-cpu-bench.jpg
Well, I suggest you start your own Question thread, as this one is not meant for such advise. Once you do, I'll put my thoughts in it :D

Cheers!
 
I will say that Linux's approach to thread scheduling (thread pools) makes more sense with higher and higher core counts; you get more consistent performance then Windows' approach, which has the side effect of threads jumping between cores as they get bumped/rescheduled. Windows theoretically offers higher total system bandwidth as the highest priority thread(s) always run, but any performance benefit is often lost when threads get rescheduled to different CPU cores.

That being said, Linux's approach of giving every thread on a CPU cores thread pool an equal chance to run (fair scheduling) gimps performance in certain tasks (games) where one or two threads do a disproportionate amount of work.

Regardless, it's becoming clear Windows needs to re-design their thread scheduler going forward. That and tossing the NTFS file system are two overdue changes that MSFT needs to get working on.
 
Status
Not open for further replies.