AMD CPU speculation... and expert conjecture

JAYDEEJOHN · Feb 15, 2013

"Windows NT schedule is psychotic IMO. "

Even gpgpu is limited here.
The watchdog timer auto cuts the wavefront and only allows for 2 of them

palladin9479 · Feb 15, 2013

JAYDEEJOHN :

Sorry I didn't elaborate there. The NT scheduler doesn't do thread homing or make any attempt to keep threads running on the same target (register stack). This has a pretty bad impact as the each time the thread is task switched the cache and branching information is lost. Core 1 won't have the same caching and branching information as Core 2, so when the thread is resume on Core 2 it has to start from scratch. Inside the core there are actually many registers instead of the few x84/x64 that are directly addressable. CPU's use sliding register windows and register files to save a threads register state before doing a task switch, this is so that the state can be fast resumed when the thread is task switched back onto the processor. Modern CPU's also employ power gating and independent clocking so that unused cores can be down clocked or turned off. The NT scheduler ignores all this and just throws a thread onto the lowest utilized core it can find. You can see this by watching per-core utilization and running a heavy single threaded task, something like super-pi set on full blast. Instead of seeing one core @100% and three at 0~5% you see four @25%. It's just an incredibly inefficient way to go about doing things. Most Unix OS's and even some Linux kernels offer process homing and other mechanisms that try to keep a thread running on the same core whenever it's task switched back.

sarinaide · Feb 15, 2013

palladin9479 :

Having seen the 6800K it is definitely faster than the 5800K they punted 22% which is there or thereabout. The L2 underwent a lot of work and is just faster and with less latency.

palladin9479 · Feb 15, 2013

sarinaide :

Hopefully so. For now I'm guardedly optimistic. The L2 cache and branching where the two areas that crippled BDs performance potential, if they can fix that then they might have a pretty good CPU.

sarinaide · Feb 15, 2013

palladin9479 :

We did find that to be the case, Trinity lacks L3 yet at some instances it is able to beat the old X6's and even run up to the FX8350. I personally think AMD's L2 and L3 config is way to big and way to slow. What Richland has done is tightened that a lot and the performance benefits are there. If this is a pre cursor to Steamroller then AMD's roadmap stipulated 50% faster L2 and L3 and 37% latency reductions. That alone can affect IPC quite a fair amount, that to go with the other architecture overhauls.

If all goes well there is no reason why Kaviri will not be entry level gaming. I am still yet to hear more on the replacement dual graphics cards but the SKU's mentioned sound like they posses a little kick in them and this may be really awesome for running a low cost gamer which plays everything well at that price point and gives low cost, low powered dual graphics options.

Cazalan · Feb 15, 2013

palladin9479 :

One could argue that spreading the load gives the CPU more longevity, and prevents heat buildup in 1 quadrant of the CPU. Windows probably does switch far too often though and it fights with Turbo clocking if an application truly is limited to 1 or 2 cores.

Some games/programs are doing thread prioritization in Windows 7.

On my i5 Warcraft consistently loads Core1@25% and Core3@75%. Core 0/2 remain below 10%. Minimizing/exiting/restarting resumes the same loading behavior.

Core 2 is servicing a 1GB Firefox session (been running a week+) and bounces between 6-8%. Even after closing the game Core 2 stays the same load with Cores0/1/3 below 2%.

If it were 100% round robin I would expect a more even load.

Cazalan · Feb 15, 2013

sarinaide :

If you've seen it when is the embargo up? I thought these things started shipping in January.

These leaks showing fairly modest gains. ~10%

http://wccftech.com/amd-apu-performance-numbers-revealed-details-launch-schedule-richland-kabini-apus-leaked/

sarinaide · Feb 15, 2013

Cazalan :

They did say around 7-10% give or take on the CPU side. On the iGPU some said 40% other places 20% where 15-20% is more the region. In gaming that doesn't mean 15% of the frames the 5800k attained added on. In Sleeping dogs which some models the game comes with it while the 5800k does 16x10 res at medium settings around 34FPS, we saw the 6800K deliver a nice amount more that was using DDR3 1600 which no owner using integrated will use.

JAYDEEJOHN · Feb 15, 2013

Now, how well will the new mem drivers work on these?
Hopefully alot

sarinaide · Feb 15, 2013

Well we do know that AMD responds well to Memory performance, Anand, Toms have run how well intel HD vs AMD iGPU's respond to Memory profiles, while Intel makes very little improvements going from DDR31600 to DDR32400 can be as high as high as 20-25FPS depending on the game 10-12 is normally seen. Improving cache will help throughput so ultimately performance improves. The issue here is if you are on a 5800k there is no benefit to buy a 6800k you may as well wait for Kaviri. For a new prospective buyer if iGPU is your thing forget Haswell, go straight for this.

mayankleoboy1 · Feb 15, 2013

specially for desktops, Haswell iGPU is probably going to be only a little better than HD4000. The Quicksync3.0 should be blazing fast though.

sarinaide · Feb 15, 2013

mayankleoboy1 :

Encoding is quite impressive but then you expect that from Intel, it has its uses in that area, and obviously a nice little visual output with some gaming potential, other than that if integrated performance is the desire then a mid level APU is more than enough.

Mobility I think GT3 will be very nice for a mobile user particularly with AMD still testing the waters.

de5_Roy · Feb 15, 2013

the great hector ruiz writes about the fight against intel in a new book.
http://blogs.wsj.com/digits/2013/02/14/former-amd-chiefs-book-describes-fight-against-intel/?mod=
seems like such an earnest guy.
i got the link from fudzilla. they said amd won $1.25 from amd in the settlement. i skipped them for distorting facts, went over to wsj for the real article.

sarinaide · Feb 15, 2013

de5_Roy :

Well its ironic considering his exploitations of AMD resources to line his own pockets.

mayankleoboy1 · Feb 15, 2013

http://www.behardware.com/articles/847-1/the-impact-of-compilers-on-x86-x64-cpu-architectures.html

Here is the compiler test i reference for comparing VS, MSVC and GCC with various optimisations.

http://www.behardware.com/articles/847-15/the-impact-of-compilers-on-x86-x64-cpu-architectures.html

Taking ICC base as 100 score.
1 VS is slower in all cases.
2.Phenom X2 arch gains quite a lot from using ICC.
3. BD gains a lot over VS, but less than PH II

iceclock · Feb 15, 2013

windows nt is old technology. makes no sense to compare it to recent os.

gamerk316 · Feb 15, 2013

esrever :

Lets see: The Athlon X3 is ancient, and the Pentium line isn't as strong IPC wise as the rest of hte processors.

The i3 holds up pretty well though, as expected.

iceclock · Feb 15, 2013

yeah this is true, but soon your gonna need 4cores. makes no sense to not utilize more than 2 with strong systems of today,

gamerk316 · Feb 15, 2013

mayankleoboy1 :

Nope. Task Manager %usage basically indicates how much time a core's idle thread is not running. What that means, in laymans terms, is if you have a core thats 75% utilized, its spending 25% of its time...doing absolutely nothing whatsoever [which implies there is no thread on the system that is able to be run.]

Kinda puts things in perspective: If running a game that scales like this:

Core0 :75%
Core1: 70%
Core 2: 45%
Core 3: 10%

Then HALF of the time, the OS believes there is no thread, anywhere on the OS, that is capable of being run at that point in time. Kinda puts things in perspective as far as performance goes, doesn't it? How can a program be expected to scale, if 50% of the time, it has nothing to run? [The time it takes to load from the L2, let alone the L3, is significant when looking at Utilization graphs.]

iceclock · Feb 15, 2013

its not the os the problem. its the game not coded to utilize all cores the issue.

gamerk316 · Feb 15, 2013

palladin9479 :

Agree, except for the psychotic part. The NT scheduler is actually pretty good: The highest priority feasible (able to run) thread ALWAYS runs. If theres a tie, one is picked at random. Threads that are waiting get a priority boost, threads that are running get a priority decrement. Threads running on a forground application get a priority boost. OS kernal threads can preempt any user thread. And other odds and ends like that. All in all, it does a good job in regards to program throughput (especially when there is only one foreground application), even if latency is only so-so.

As far as breaking things up into more threads: It does not make sense to break things up when they can not be run in parallel. As most tasks are either sequential OR interact with so many other subsystems that locks and waits will kill performance, you see most games running on 2-3 heavy threads, with other threads working on the remaining parallel, but very process light, workloads. At the end of the day: The render pipeline is sequential. Stage A has to be completed, because the output of A is the input of B.

truegenius · Feb 15, 2013

but very few (as rare as anti matter 😉 ) single player games use 4+ cores and fewer use 4+ cores effectively (as rare as terrestrial radio signal)
so buying a multi core cpu for those games which will emerge 5+ years from now is not a good move

gamerk316 · Feb 15, 2013

mayankleoboy1 :

Won't stop AMD fans from complaining that Intels compiler generates the fastest code for their platform, by a significant amount, even when not fully optimized.

One thing I found interesting to note though:

If you have read this report from the top, the results given here won't surprise you. The Intel compiler does best. While gcc does okay, in practice the tuning options are often counterproductive and undermine the gains they bring. Visual Studio is significantly slower in its standard version because it still generates x87 code by default for floating point operations. Moving over to SSE2 for the maths operations improves performance, particularly on AMD processors where x87 code has been somewhat sidelined by new instruction sets such as SSE2 that have been designed to replace them.

Now, almost all devs I know enable SSE2 by default, but kinda makes you wonder how much X87 code is still being generated by the compiler...

iceclock · Feb 15, 2013

i dont care how fast the code is compiled. i just want multi-threaded supports over 4core or more

mayankleoboy1 · Feb 15, 2013

gamerk316 :

AFAIK, VS2012 SP1, it defaults to targeting SSE2 instruction set. In fact Firefox had to specifically change this setting, as they dont even target SSE2 (which is almost ancient now) for compatibility's sake.

AMD CPU speculation... and expert conjecture

Champion

Splendid

Splendid

Splendid

Splendid

Distinguished

Distinguished

Splendid

Champion

Splendid

Distinguished

Splendid

Splendid

Splendid

Distinguished

Illustrious

Glorious

Illustrious

Glorious

Illustrious

Glorious

Distinguished

Glorious

Illustrious

Distinguished

Share this page