AMD Piledriver rumours ... and expert conjecture

Reynod · Oct 27, 2011

We have had several requests for a sticky on AMD's yet to be released Piledriver architecture ... so here it is.

I want to make a few things clear though.

Post a question relevant to the topic, or information about the topic, or it will be deleted.

Post any negative personal comments about another user ... and they will be deleted.

Post flame baiting comments about the blue, red and green team and they will be deleted.

Enjoy ...

MU_Engineer · Jan 19, 2012

But again, you still need to be doing multiple events at the same time for threading to have any real impact. And very few work-heavy workloads can be easily broken up into smaller units running in parallel, because most jobs end up being a series of sequential events.

For example, when we first got quads, the thought was that you could write a game with the audio, rendering, physics, and AI engines each running on a different CPU. Sounds like a nice scheme, until you realise that the audio is almost totally dependent on the other three engines, and AI, physics, and rendering all interact with eachother in some fashon, which limits how much of their workloads can be done without needed some form of syncromization [which can take a VERY long time if one of those threads is pre-empted by the OS, in which case your application is basically stuck in a waiting state. Thankfully, Windows typically doesn't pre-empt high workload forground application threads...].

Farther, with more heavy workload threads running, you usually require a LOT more memory access at any one point in time, which, in 32-bit .exes or low memory systems, can result in a lot of paging, bringing HDD IO waits into the equation. [Moving to native 64-bit .exe's would help solve this problem though...]. You also need to put a LOT more thought on what the CPU decides to cache in the L2, which can have a huge impact on performance [especially if theres a lot of paging going on].

Point being, the "simple" scheme of stuffing each game engine on a different core got very complex very quickly. And HEAVEN FORBID your AV decides now is a good time to start on the 4th core, in which case you've just tanked you application performance. [Using more cores has the negative side effect of increasing the chances of a total application bottleneck if another application starts to execute on that same core, forcing you app to a waiting state. If your other threads are stuck in a syncronization state, you are basically dead].

And this is BEFORE we even begin to consider how the OS is actually managing all these resources.

How would multithreading hurt performance in this situation? In your example, you have four game threads. Three of those threads are dependent on the first thread to execute. You have a fifth unrelated core-pegging thread start up on the same core that's running the first game thread. The 2nd-4th game threads immediately go into a wait state as the 1st thread lets the unrelated thread use the core. Your CPU utilization immediately drops to 25% as the 2nd-4th game threads are waiting for data from the first game thread. The OS sees three idle cores, one active core, and two threads wanting to run (the AV thread and the first game thread). It schedules the first game thread on one of the three other unused cores, and then the game resumes, with the three auxiliary game threads taking turns on the remaining two cores. This is how OS scheduling is accomplished by any remotely sane OS scheduler, which I believe even Windows has at the current time.

Also, in your example, making a poorly-threaded game won't result in more performance. Sure, there won't be a millisecond dip in performance due to thread rescheduling, but the game will likely be running more slowly to begin with as you are making one or two cores do four cores' worth of work. Oh, and the way to gain peak performance in your scenario would be to have a CPU with a huge amount of cores and massive memory bandwidth, such that no thread ever has to share a core with any other thread or wait on memory access due to another thread at any point in time. That's pretty much the opposite of what you were initially implying, e.g. that a lower-core-count CPU and less program threading is better.

-Fran- · Jan 19, 2012

I really doubt gamerk was saying "lower-core-count CPU and less program threading is better" TBH.

I'm sure you'll agree that this statement is somewhat true: "faster 4 cores are better than slow 8 cores". You just need to find the balance in the mix for the 8 cores to do more than the fast 4 ones. And looking at real programs today, 4 (or even 2) very fast cores are better than 4+ slow ones at any given time of the execution, no matter the scheduling approach the OS takes. It's hard to make an statement and not think about re-compiling and re-tunning or re-code a program, but given programs now, well, they're not that well threaded in most cases, cept "professional" software (IMO all should be called that, heh).

And I like the BFS approach (now that you mentioned scheduling), even though it will never be in the main branch of the linux kernel, lol.

Anyway, great posts, MU.

Cheers!

gamerk316 · Jan 19, 2012

The OS sees three idle cores, one active core, and two threads wanting to run (the AV thread and the first game thread). It schedules the first game thread on one of the three other unused cores, and then the game resumes, with the three auxiliary game threads taking turns on the remaining two cores. This is how OS scheduling is accomplished by any remotely sane OS scheduler, which I believe even Windows has at the current time.

As a programmer, I've learned the hard way to never assume the components some other programmer wrote was written in a safe, sane manner. In XP, the behavior I described is actually quite common. One of the big changes in Vista was giving priority to forground applications, which you would think would have been done ages ago.

Windows uses a combination of Priority and Round-Robin for scheduling. The longer a thread waits, the higher its priority gets, and the highest priority thread gets run. Aside from that, forground tasks get a priority boost [as of Vista], and the OS has several high-level interrupts that pre-empt everything in the user domain. Thats the windows scheduler in a nutshell; MSFT knew a simple scheduler is the best possible implementation.

Back to the example: if the AV thread I described is launched as high priority (which I know a few are), compared to your game (which is probably running normal priority), even with the boost the game gets for being in the forground, it will be routhly splitting work with the AV. Thats a 50% performance penalty. And if thats the core thats running one of the main program threads, you've cut your maximum possible performance in half.

Now, if we assume the core that is running the audio engine is basically stuck at 5% usage (audio is NOT a heavy workload), then by using one fewer cores and having two engines (audio and one other) on the same core, you would have the 4th core free for the AV, which would not cause a negative performance hit for your game in question.

Hence, the debate between core balencing and maximizing total core usage.

Also, in your example, making a poorly-threaded game won't result in more performance. Sure, there won't be a millisecond dip in performance due to thread rescheduling, but the game will likely be running more slowly to begin with as you are making one or two cores do four cores' worth of work. Oh, and the way to gain peak performance in your scenario would be to have a CPU with a huge amount of cores and massive memory bandwidth, such that no thread ever has to share a core with any other thread or wait on memory access due to another thread at any point in time. That's pretty much the opposite of what you were initially implying, e.g. that a lower-core-count CPU and less program threading is better.

That is not was I was implying at all, I was pointing out that using more processor cores when not needed will do nothing to increase performance, and under some situations can actually have a negative impact. There is a rule in software design: You don't optimize when performance is already good enough, because at the end of it, you'll hurt performance more then you help.

gamerk316 · Jan 19, 2012

And I like the BFS approach (now that you mentioned scheduling), even though it will never be in the main branch of the linux kernel, lol.

BFS is simple, but even teh author admits it does not scale beyond 16 cores. For the lowest possible latency though, it is king.

http://www.cs.unm.edu/~eschulte/data/bfs-v-cfs_groves-knockel-schulte.pdf

A good article that explains how O(1), CFS, and BFS work, and the results between CFS and BFS scheduling for linux.

-Fran- · Jan 19, 2012

Nice read. Thanks for the link, gamerk.

And I've played a lot with the kernel scheduling. Remember the famous 200 lines? Well, for a lot of cores, they do well and all, but I noticed the real impact was in 1 or 2 cores CPUs. That's not even doing a re-compile for the programs (I used a game called OpenTTD to bench 😛).

Well, on the windows side of things... Optimizing for threads is a PITA. MS changes the conditions from one version to the other (not only scheduling wise, but libs/dll wise), so if you want to "fine tune", you'll shoot yourself in the foot in the long run. It's sad to say it, but I really hate fine tunning for windows. That's why I love Java XD! So, yeah, I've been on your shoes as well, gamerk and I really get what you mean.

Cheers!

g4114rd0 · Jan 19, 2012

Hops Piledriver rumours based on a 1% semiaccurate story.
The lack of rumours opportunities has not kept MS away from being productive,
however and there are a few treats in the pipelines for fans to look forward to.
XBOX 720 Piledriver inside.

Cazalan · Jan 19, 2012

Looks like Trinity could be what Bulldozer should have been. Bulldozer was over hyped and suffered for that. The problem now is the release schedule. Trinity will be after Ivy. 22nm tri-Gate vs 32nm.

palladin9479 · Jan 20, 2012

These are the benchmarks I'd like to see, but unfortunately never will.

http://blogs.oracle.com/BestPerf/entry/20110926_sparc_t4_4_specjenterprise2010

We decided to go with these for our next major roll out.

All the power of the T3's but without the single thread performance hit that would occur.

gamerk316 · Jan 20, 2012

^^ To be fair, it IS CD we're talking about here...Still, if he says anything pro-Intel/NVIDIA, that usually indicates something good. Though I'll be shocked if Kepler beats GCN in power draw, considering how good it looks right now...

9_breaker · Jan 21, 2012

http://www.legitreviews.com/news/12331/

you guys should use Google alerts to for piledriver new

bawchicawawa · Jan 21, 2012

http://www.xbitlabs.com/news/cpu/display/20120117221436_AMD_Ups_Performance_Projections_for_Next_Gen_Trinity_APUs.html

AMD Ups Performance Projections for Next-Gen "Trinity" APUs.
AMD "On-Track" to Release Trinity in Mid-2012
[01/17/2012 10:14 PM]
by Anton Shilov
Advanced Micro Devices has increased performance projections for its next-generation code-named Trinity accelerated processing unit and set up launch timeframe for the chip. While AMD Trinity will be faster than initially believed, it will become available only in the middle of the year, which is somewhat later than generally expected based on AMD's comments.

According to performance benchmarks conducted by AMD, the Trinity 35W APU with Piledriver-class x86 cores will provide 25% better x86 performance compared to Llano 35W (with K10.5+ "Husky" x86 cores) based on results obtained in PC Mark Vantage Productivity benchmark. AMD also claims that Trinity 35W will offer up to 50% better result in 3D Mark Vantage performance benchmark compared to Llano 35W.

Earlier released slides, which were also presumably from AMD, projected 20% increase in x86 performance and 30% boost in graphics performance for Trinity compared to currently available A-series "Llano" APUs based on simulations.

Although AMD implied a number of times that it would release Trinity in the first half of the year and rather sooner than later, the official plan now is to launch it in the middle of the year. Some unofficial sources have implied that A-series "Trinity" will be released in June, 2012.

"We are indeed on track with Trinity for mid-2012," said Chris Hook, a spokesman for AMD, without elaborating on actual months or dates.

According to documents seen by X-bit labs, staring from early and middle March, 2012, AMD intends to mass produce its desktop A-series "Trinity" accelerated processing units with 65W thermal design power (TDP). In early May, 2012, the chip designer wants to initiate mass production of desktop A-series "Trinity" APUs with 100W TDP and higher performance.

AMD’s second-generation code-named Trinity APU for mainstream personal computers (Comal for notebooks and Virgo for desktops) will be made using 32nm SOI HKMG process technology at Globalfoundries. The APU will feature up to four x86 cores powered by enhanced Bulldozer/Piledriver architecture, AMD Radeon HD 7000-series "Southern Islands" graphics core with DirectX 11-class graphics support, DDR3 memory controller and other improvements. The chips will be compatible with new infrastructure.

de5_Roy · Jan 21, 2012

i don't understand why there isn't any leak or rumor about trinity/piledriver's pcie support. will it support 2.0 or 3.0?
trinity will probably have pcie controller built into the cpu like llano, so that's an important information. it will also be important in the matters of hybrid crossfire, vce. amd should allow lower end 6000 series cards like 6570-6670 or even 6770 cfx with trinity.

viridiancrystal · Jan 21, 2012

all logic says that it will support 3.0. The Radeon 7000's are going to support 3.0, so their other hardware should as well.

Reynod · Jan 21, 2012

Chad and N00b lets keep it civil guys and leave Charlie D out of this discussion unless his site posts a story on the topic.

See my previous post regarding knocking other sites.

9_breaker · Jan 21, 2012

if any of these apu can beat a 2500k ill definitely get one .

bawchicawawa · Jan 21, 2012

It wont beat the 2500k when it comes to the processing portion of the chip, obviously........+ it will be a chunk cheaper than the 2500k. No clue why you even thought that. =\

saint19 · Jan 21, 2012

if any of these apu can beat a 2500k ill definitely get one .

Will not and both are in different segment markets....

de5_Roy · Jan 21, 2012

trinity might not have l3 cache like llano. that could hurt it's performance against a core i5.
trinity's piledriver core will have bd-like core design - not like a traditional quad core e.g. ph ii x4 980 or i5 2500k.
it's igp otoh, seems quite impressive from the rumors and that ces demo.

bawchicawawa · Jan 21, 2012

trinity might not have l3 cache like llano. that could hurt it's performance against a core i5.
trinity's piledriver core will have bd-like core design - not like a traditional quad core e.g. ph ii x4 980 or i5 2500k.
it's igp otoh, seems quite impressive from the rumors and that ces demo.

Trinity doesn't have l3 cache for a fact, it's been known for quite awhile now. l3 cache isn't THAT important, it doesn't effect performance that much.

viridiancrystal · Jan 22, 2012

trinity might not have l3 cache like llano. that could hurt it's performance against a core i5.
trinity's piledriver core will have bd-like core design - not like a traditional quad core e.g. ph ii x4 980 or i5 2500k.
it's igp otoh, seems quite impressive from the rumors and that ces demo.

I have to say that CES demo was quite nice to see. Seemed rather impressive

palladin9479 · Jan 22, 2012

trinity might not have l3 cache like llano. that could hurt it's performance against a core i5.
trinity's piledriver core will have bd-like core design - not like a traditional quad core e.g. ph ii x4 980 or i5 2500k.
it's igp otoh, seems quite impressive from the rumors and that ces demo.

L3 cache only good when your processor's caching / prediction is really bad, it acts like a safety net before accessing system RAM. It's also the cheapest component to add to the die.

jimmysmitty · Jan 23, 2012

L3 cache only good when your processor's caching / prediction is really bad, it acts like a safety net before accessing system RAM. It's also the cheapest component to add to the die.

We all know BDs predisiton is pretty bad. Llano used Athlon II based cores, no L3, so I wonder if to keep the prices on Trinity low they are cutting L3 cache.

May not be a good thing. Might hurt it in the long run.

palladin9479 · Jan 23, 2012

We all know BDs predisiton is pretty bad. Llano used Athlon II based cores, no L3, so I wonder if to keep the prices on Trinity low they are cutting L3 cache.

May not be a good thing. Might hurt it in the long run.

If they haven't fixed their cache and prediction issues prior to PD / Trinity then AMD's gonna have big issues.

viridiancrystal · Jan 23, 2012

Anyone here believe AMD's 25% cpu increase claim for trinity?

Mousemonkey · Jan 23, 2012

Anyone here believe AMD's 25% cpu increase claim for trinity?

Why wouldn't anyone believe AMD? It's not like they've ever been economical with truth before.

AMD Piledriver rumours ... and expert conjecture

Administrator

Splendid

Glorious

Glorious

Glorious

Glorious

Distinguished

Distinguished

Splendid

Glorious

Distinguished

Distinguished

Splendid

Distinguished

Administrator

Distinguished

Distinguished

Polypheme

Splendid

Distinguished

Distinguished

Splendid

Champion

Splendid

Distinguished

Titan

Share this page