AMD CPU speculation... and expert conjecture

gamerk316 · Feb 15, 2013

iceclock :

Its been there since the earliest computers with multiple CPU support back in the 70's.

No whether the software can be written in a multithreaded, parallel way, and the OS smart enough to put parallel threads on separate cores, thats another story.

gamerk316 · Feb 15, 2013

mayankleoboy1 :

Good to hear; still using MSVC 2008 at work.

And yeah, not shocked about the compatibility; as long as XP hangs around, you aren't going to see SSE2 as the default option (as XP supports CPUs pre-SSE2).

JAYDEEJOHN · Feb 15, 2013

http://msdn.microsoft.com/en-us/library/windows/desktop/hh404504(v=vs.85).aspx

iceclock · Feb 15, 2013

so basically theres gotta be a good communication between the software thats coded and the os, to have a good understanding on how to run in multi-threaded environnement.

palladin9479 · Feb 15, 2013

Agree, except for the psychotic part. The NT scheduler is actually pretty good: The highest priority feasible (able to run) thread ALWAYS runs. If theres a tie, one is picked at random. Threads that are waiting get a priority boost, threads that are running get a priority decrement. Threads running on a forground application get a priority boost. OS kernal threads can preempt any user thread. And other odds and ends like that. All in all, it does a good job in regards to program throughput (especially when there is only one foreground application), even if latency is only so-so.

That's not why it's psychotic. It's constantly moving threads around on the cores without respect to where they were last running, this isn't good for performance as it forces a cache flush every time the CPU task switch's, which is fairly often. I know this because I do my own performance metrics and watch it happening. This was REALLY a PITA when I was using K10stat on my Llano to overclock + undervolt. K10 would keep all four cores @800mhz all the time unless I started up something heavy then it would cycle the core up to 2.7Ghz. It could only keep one or two cores at that speed though. The NT scheduler would cycle the thread around until all four cores were @2.0ghz and kept them all there, drove me crazy. Used processor affinity to force the process on a single core and *poof* problem over, the other three cycled down to 800 and the 4th went to 2.7.

There was a Windows 7 patch awhile back that kinda made it a bit better on BD CPU's. Mostly it was to allow the BD CPUs to do core parking properly.

iceclock · Feb 15, 2013

interesting

gamerk316 · Feb 15, 2013

palladin9479 :

The issue is that the scheduler tries to run threads on the least worked cores, at that moment in time. For instance, if you get preempted, and some other heavy workload thread from some other application pops in, you probably don't want your games main thread running on that core, even if it costs you a cache flush.

The Windows API allows for setting a "preferred" core, or the core which you would like the thread to run on 🙂:SetThreadIdealProcessor), which helps mitigate this issue somewhat. Hard core locking fails performance wise though, when more then one heavy workload application runs. You also run the risk of preventing a thread from running much if a higher priority thread is active (say, Windows is doing some task on that core). Hence why the scheduler works as it does. As you can see, TurboBoost (and equivalent tech) complicates this feature, so I'd imagine MSFT would need to update the scheduler to favor putting threads on the same core if TurboBoost (or equivalent) is detected enabled for the CPU.

For those interested, the full list of Windows threading API functons:
http://msdn.microsoft.com/en-us/library/windows/desktop/ms684847(v=vs.85).aspx

iceclock · Feb 15, 2013

interesting.

palladin9479 · Feb 15, 2013

Having done benchmarking on this, hard locking a single threaded application to a core will give you a slight performance boost. We have multiple cores, windows will not preempt a thread off a core if there is another core already free. Hard locking won't run into issues with NT jacking your time slice as NT will just run on another available core instead. NT will only evict you if all other cores are @100% in which case you really don't need to be hard locking as your already maxing CPU performance. Hard locking is just a trick to squeeze out maximum performance in poorly threaded applications.

The problem with it being an API call is we're now at the mercy of a programmer to be assed to actually use it. I prefer having such control myself. This feature is actually incredibly important when dealing with NUMA architectures, scheduling a task to run on one CPU where it's memory space is located on a different physical CPU's memory is very bad.

gamerk316 · Feb 15, 2013

Having done benchmarking on this, hard locking a single threaded application to a core will give you a slight performance boost. We have multiple cores, windows will not preempt a thread off a core if there is another core already free. Hard locking won't run into issues with NT jacking your time slice as NT will just run on another available core instead. NT will only evict you if all other cores are @100% in which case you really don't need to be hard locking as your already maxing CPU performance. Hard locking is just a trick to squeeze out maximum performance in poorly threaded applications.

Not true. Understand the first rule of threading in Windows:

Windows, in all cases, without exception, will run the highest priority thread(s) that are capable of being run.

Granted, the switch may be for a few ns or so, so you don't notice. But it happens, and it happens a lot. Every time you access RAM? Your thread gets booted. Look at some variable stored in the L3? Thread gets booted? L2? Same deal. If your thread can not run, or if some higher priority thread can, you get booted. And guess what? Every time you get booted, there's a chance it may resume on a different CPU core.

Basic example: You have a priority of 5. Some other task has a priority of 1. So your thread runs for two cycles. Due to how the scheduler works, the prioritys will have changed so you have a priority of 3, and the other task has a priority of 3 (your priority gets decrmented while you run, while the other thread gets a boost while waiting). Windows decides to run the other task, your thread stops running. One cycle later, your thread has a priority of 4, and the other thread a priority of 2. So you figure your thread will resume on the same core, right? Unless some other thread on a different core has a priority of 1, in which case, guess what? He's the one who gets booted, due to having the lowest priority.

So you say "OK, I'll hardlock the core to avoid having to dump the cache". Lets look at this example:

Lets take the case of a 4 core system; you have 4 threads; 3 do a lot of work, 1 not so much. You query the processor, see it has four cores, and hard lock each thread to a core at runtime.

Meanwhile, at some point in time, you alt-tab out of the application, do some stuff on the internet, and so on. Meanwhile, your AV kicks off, and seeing no major CPU activity, kicks off the AV scanner, and loads its main thread on core 0, because it also hardlocked its threads to cores.

You come back, and blissfully unaware what just happened, return to your application.

You are now officially screwed from a performance standpoint; the thread on the first core (likely the games main thread) is now competing against the AV scan. Even if your thread has the higher priority (is gets a boost due to being in a foreground process), it is going to get booted some percentage of the time by the AV. And heaven forbid if the AV is coded where its main thread gets "high" priority by default.

When you hard lock threads, you make a major assumption: No other heavy workload application will come around and prevent an application critical thread from running.

The worst case, of course, would be the oft mentioned theoretical example of two games running side by side in windowed mode. Imagine if both games were coded to lock their three or four heavy workload threads on the first four CPU cores, blissfully unaware you are running an 8 core system and the last four cores are sitting idle. But, because you hard-locked a 1:1 core:thread ratio, the OS can't dispatch to the unused cores. Another good example: FRAPS. [And no, querying the CPU for current is not a good idea, because the CPU core will NEVER be doing nothing at the time you query it. (sarcasm implied)]

So yes, if no other process heavy app is working, hard locking would likely lead to very minimal (5%) gains, simply due to not flushing the CPU cache. Run with another process heavy app though, and performance suffers for both applications.

Now, what you describe is common for consoles, because there is NO threat of other threads running. [As far as the PS3, which I am more familiar with, you have just over 200MB RAM and 6 SPE's to play with; the rest (~56MB RAM and the 7th PPE) is reserved for the OS. As only one task (OS aside) can run at a time, you can hard lock threads:cores without negative impacts, unlike on a multitasking OS.

This feature is actually incredibly important when dealing with NUMA architectures, scheduling a task to run on one CPU where it's memory space is located on a different physical CPU's memory is very bad.

Windows actually has NUMA specific API calls for just that situation. For one, you can set logical process groups (up to 64 cores per process group). Point being, you aren't going to see applications move onto a different processor node on their own.

jdwii · Feb 15, 2013

Here is some Richland performance figres.

http://wccftech.com/amd-apu-performance-numbers-revealed-details-launch-schedule-richland-kabini-apus-leaked/

JAYDEEJOHN · Feb 15, 2013

One thing thats true from the earler conference call is, looks to be a 50% boost to top mobile solution, plus others, so more mobile wins can be seen for AMD
The rest of the call, about discrete, not so sure about

kettu · Feb 15, 2013

truegenius :

What is your point? Are you still trying to argue that games can't be made to utilize atleast six cores? The BF3 multiplayer pretty conclusively proves that it can be done. Or are you trying to reconcile these two incompatable sentences? Or something else?

"Funny, I have't seen a single example of that as far as games go..."
"I explicity states that Multiplayer lends itself to better performance with more cores, since you are adding a fairly process heavy thread into the mix."

palladin9479 · Feb 15, 2013

Gamer everything you just said only matters if you have a single CPU core, the moment you have multiple targets windows will always pile stuff onto the lowest utilized target. Thus if you hard lock a thread it never gets booted unless all other CPU's are already more busy then it is. I've stated this multiple times now.

Everything I've stated so far has been proven and is well known. I have had benchmark results from doing exactly this, namely with Cinebench running in single threaded mode. If you search around the site you might bump my old posted results.

Run with another process heavy app though, and performance suffers for both applications.

What part of having multiple CPU cores do you not understand here. Four cores, one is at 100% the other three are somewhere from 0 to 15%. Under no circumstance will the NT scheduler put something else onto that 100% utilizes core when there are three other cores with open resources available. The priority of the thread isn't nearly as important because you have four targets, your not getting four priority one tasks happening simultaneously that all consume large amounts of resources, unless you forgot to update your AV.

Windows actually has NUMA specific API calls for just that situation. For one, you can set logical process groups (up to 64 cores per process group). Point being, you aren't going to see applications move onto a different processor node on their own.

And NOBODY use's them (well mostly nobody). I know this because I do lots of performance analysis. They just write new threads and keep on trucking, ends up messy.

kettu · Feb 15, 2013

Cazalan :

I disagree. It's the high margin server chips that are most important. If AMD can sort them out the rest will fall in line more or less automatically. I mean that they would improve their financial situation with succesfull server line and those server cores can be used for consumer products with or without iGPU. But I sort of agree that the CPU/GPU chips could be more important product line of the two consumer lines (FX/A series). I think Steamroller will be a significant step forward with more dedicated hardware/core, considering that the BD/PD cores are fine. It's the shared uncore that is bottlenecking the cores. That's my opinion atleast.

palladin9479 · Feb 15, 2013

JAYDEEJOHN :

Hmm good to hear, was that 50% in GPU performance, CPU performance or some aggregate of the two?

kettu · Feb 16, 2013

By the way:
http://www.extremeoverclocking.com/reviews/processors/AMD_Phenom_II_X4_810_X3_720_10.html

Lost Planet: Colonies. Released allready in 2007 on the PC. Even back then it was possible to scale to atleast four cores pretty well considering the relative immaturity of multicore consumer CPUs vs today.

Here's a more recent example demonstrating much better scaling.
http://www.techspot.com/review/440-hard-reset-performance-test/page6.html

JAYDEEJOHN · Feb 16, 2013

Its not like for like
But allows for larger solutions with better perf at same tdp

anxiousinfusion · Feb 17, 2013

http://www.phoronix.com/scan.php?page=news_item&px=MTMwMzM

JAYDEEJOHN · Feb 17, 2013

Nice link

iceclock · Feb 17, 2013

this website doesnt look trust worthy as a source. pass.

mayankleoboy1 · Feb 17, 2013

^ If you are talking about Phoronix, you are mistaken. It is very very trustworthy.

de5_Roy · Feb 17, 2013

^ unless it isn't praising your favorite cpu brand. then it's run by paid shill of the rival brand.
imo it is half decent, just really low on resources.
their articles sometimes don't have concluding or summerizing paragraphs. 😛

blackkstar · Feb 17, 2013

I wanted to add a little bit to this conversation. I've been out of it for a while.

Several pages ago, I mentioned consoles were holding threading back because game developers were optimizing around 3 cores. I was then told that the OS is in charge of scheduling threads.

It seems the point I was trying to make was missed. If you have a three core CPU, you're going to aim for program logic that uses 3 cores. The OS scheduler can handle more, but why bother? It just adds needless complexity.

The entire argument about scaling to 8 cores or even 4 reminds me of when we started heading towards dual core. There were tons, and I mean tons, of people who felt that the second core was a waste and there was no way to use it.

Now, we have quads and 8 cores all over the place, and at least two threads are common now.

The point I'm trying to make is that, much like when we were transitioning for one core to two, it doesn't make sense to go to extra threads. Eventually, developers were forced to adapt into a situation where they had to start optimizing things for two cores.

Imagine if we never took that step from single thread. How well do you think Xbox 360 with tri-core PowerPC 970 @ 3.2ghz would run if it only used one core?

Software will catch up, and if the new consoles use 8 weak cores, it's going to force developers into programming such that 8 threads are used. Yes, it's not optimal. Yes, no one does it now. But will they? They will have if the 8 core Jaguar in console rumors are true. A game simply isn't going to run well when it's using 2 threads on a 1.6ghz 8 core Jaguar system.

And as for compilers, just knowing what it was compiled with is way, way too big of a generalization. I run Gentoo on my FX and I set flags on GCC that make my FX 8350 more than twice as fast. Hell, if I wanted to, I could compile it with horrible flags and chug out code that probably runs worse than everything out there.

Simply knowing what was used to compile software, and then to say that the entire compile is at fault is way, way too vague. In the case of people complaining about ICC, it's because it used to disable optimizations for certain products that were more than capable of using those optimizations regardless of what settings were.

http://software.intel.com/en-us/articles/performance-tools-for-software-developers-intel-compiler-options-for-sse-generation-and-processor-specific-optimizations

There's a lot of options there just for ICC. For GCC, it's far more extensive as you can flag each individual instructions. I would blame the person setting options way before I start blaming the compiler in general.

mayankleoboy1 · Feb 17, 2013

de5_Roy :

Because Phoronix "articles" are usually a comment on a set of open source code patches that have landed on linux kernel trees. So you cant really summary that. Its not like Toms News, which report wrong, dont usually give sources and are full of spelling errors.

Regarding paid shill, i'd say that its the only site on the net(apart from AMDzone) which shows AMD procs in a positive light.

AMD CPU speculation... and expert conjecture

Glorious

Glorious

Champion

Illustrious

Splendid

Illustrious

Glorious

Illustrious

Splendid

Glorious

Splendid

Champion

Distinguished

Splendid

Distinguished

Splendid

Distinguished

Champion

Distinguished

Champion

Illustrious

Distinguished

Splendid

Honorable

Distinguished

Share this page