AMD Piledriver rumours ... and expert conjecture

Page 35 - Seeking answers? Join the Tom's Hardware community: where nearly two million members share solutions and discuss the latest tech.
Status
Not open for further replies.
We have had several requests for a sticky on AMD's yet to be released Piledriver architecture ... so here it is.

I want to make a few things clear though.

Post a question relevant to the topic, or information about the topic, or it will be deleted.

Post any negative personal comments about another user ... and they will be deleted.

Post flame baiting comments about the blue, red and green team and they will be deleted.

Enjoy ...
 
But again, you still need to be doing multiple events at the same time for threading to have any real impact. And very few work-heavy workloads can be easily broken up into smaller units running in parallel, because most jobs end up being a series of sequential events.

For example, when we first got quads, the thought was that you could write a game with the audio, rendering, physics, and AI engines each running on a different CPU. Sounds like a nice scheme, until you realise that the audio is almost totally dependent on the other three engines, and AI, physics, and rendering all interact with eachother in some fashon, which limits how much of their workloads can be done without needed some form of syncromization [which can take a VERY long time if one of those threads is pre-empted by the OS, in which case your application is basically stuck in a waiting state. Thankfully, Windows typically doesn't pre-empt high workload forground application threads...].

Farther, with more heavy workload threads running, you usually require a LOT more memory access at any one point in time, which, in 32-bit .exes or low memory systems, can result in a lot of paging, bringing HDD IO waits into the equation. [Moving to native 64-bit .exe's would help solve this problem though...]. You also need to put a LOT more thought on what the CPU decides to cache in the L2, which can have a huge impact on performance [especially if theres a lot of paging going on].

Point being, the "simple" scheme of stuffing each game engine on a different core got very complex very quickly. And HEAVEN FORBID your AV decides now is a good time to start on the 4th core, in which case you've just tanked you application performance. [Using more cores has the negative side effect of increasing the chances of a total application bottleneck if another application starts to execute on that same core, forcing you app to a waiting state. If your other threads are stuck in a syncronization state, you are basically dead].

And this is BEFORE we even begin to consider how the OS is actually managing all these resources.

How would multithreading hurt performance in this situation? In your example, you have four game threads. Three of those threads are dependent on the first thread to execute. You have a fifth unrelated core-pegging thread start up on the same core that's running the first game thread. The 2nd-4th game threads immediately go into a wait state as the 1st thread lets the unrelated thread use the core. Your CPU utilization immediately drops to 25% as the 2nd-4th game threads are waiting for data from the first game thread. The OS sees three idle cores, one active core, and two threads wanting to run (the AV thread and the first game thread). It schedules the first game thread on one of the three other unused cores, and then the game resumes, with the three auxiliary game threads taking turns on the remaining two cores. This is how OS scheduling is accomplished by any remotely sane OS scheduler, which I believe even Windows has at the current time.

Also, in your example, making a poorly-threaded game won't result in more performance. Sure, there won't be a millisecond dip in performance due to thread rescheduling, but the game will likely be running more slowly to begin with as you are making one or two cores do four cores' worth of work. Oh, and the way to gain peak performance in your scenario would be to have a CPU with a huge amount of cores and massive memory bandwidth, such that no thread ever has to share a core with any other thread or wait on memory access due to another thread at any point in time. That's pretty much the opposite of what you were initially implying, e.g. that a lower-core-count CPU and less program threading is better.
 
I really doubt gamerk was saying "lower-core-count CPU and less program threading is better" TBH.

I'm sure you'll agree that this statement is somewhat true: "faster 4 cores are better than slow 8 cores". You just need to find the balance in the mix for the 8 cores to do more than the fast 4 ones. And looking at real programs today, 4 (or even 2) very fast cores are better than 4+ slow ones at any given time of the execution, no matter the scheduling approach the OS takes. It's hard to make an statement and not think about re-compiling and re-tunning or re-code a program, but given programs now, well, they're not that well threaded in most cases, cept "professional" software (IMO all should be called that, heh).

And I like the BFS approach (now that you mentioned scheduling), even though it will never be in the main branch of the linux kernel, lol.

Anyway, great posts, MU.

Cheers!
 
The OS sees three idle cores, one active core, and two threads wanting to run (the AV thread and the first game thread). It schedules the first game thread on one of the three other unused cores, and then the game resumes, with the three auxiliary game threads taking turns on the remaining two cores. This is how OS scheduling is accomplished by any remotely sane OS scheduler, which I believe even Windows has at the current time.

As a programmer, I've learned the hard way to never assume the components some other programmer wrote was written in a safe, sane manner. In XP, the behavior I described is actually quite common. One of the big changes in Vista was giving priority to forground applications, which you would think would have been done ages ago.

Windows uses a combination of Priority and Round-Robin for scheduling. The longer a thread waits, the higher its priority gets, and the highest priority thread gets run. Aside from that, forground tasks get a priority boost [as of Vista], and the OS has several high-level interrupts that pre-empt everything in the user domain. Thats the windows scheduler in a nutshell; MSFT knew a simple scheduler is the best possible implementation.

Back to the example: if the AV thread I described is launched as high priority (which I know a few are), compared to your game (which is probably running normal priority), even with the boost the game gets for being in the forground, it will be routhly splitting work with the AV. Thats a 50% performance penalty. And if thats the core thats running one of the main program threads, you've cut your maximum possible performance in half.

Now, if we assume the core that is running the audio engine is basically stuck at 5% usage (audio is NOT a heavy workload), then by using one fewer cores and having two engines (audio and one other) on the same core, you would have the 4th core free for the AV, which would not cause a negative performance hit for your game in question.

Hence, the debate between core balencing and maximizing total core usage.

Also, in your example, making a poorly-threaded game won't result in more performance. Sure, there won't be a millisecond dip in performance due to thread rescheduling, but the game will likely be running more slowly to begin with as you are making one or two cores do four cores' worth of work. Oh, and the way to gain peak performance in your scenario would be to have a CPU with a huge amount of cores and massive memory bandwidth, such that no thread ever has to share a core with any other thread or wait on memory access due to another thread at any point in time. That's pretty much the opposite of what you were initially implying, e.g. that a lower-core-count CPU and less program threading is better.

That is not was I was implying at all, I was pointing out that using more processor cores when not needed will do nothing to increase performance, and under some situations can actually have a negative impact. There is a rule in software design: You don't optimize when performance is already good enough, because at the end of it, you'll hurt performance more then you help.
 
And I like the BFS approach (now that you mentioned scheduling), even though it will never be in the main branch of the linux kernel, lol.

BFS is simple, but even teh author admits it does not scale beyond 16 cores. For the lowest possible latency though, it is king.

http://www.cs.unm.edu/~eschulte/data/bfs-v-cfs_groves-knockel-schulte.pdf

A good article that explains how O(1), CFS, and BFS work, and the results between CFS and BFS scheduling for linux.
 
Nice read. Thanks for the link, gamerk.

And I've played a lot with the kernel scheduling. Remember the famous 200 lines? Well, for a lot of cores, they do well and all, but I noticed the real impact was in 1 or 2 cores CPUs. That's not even doing a re-compile for the programs (I used a game called OpenTTD to bench 😛).

Well, on the windows side of things... Optimizing for threads is a PITA. MS changes the conditions from one version to the other (not only scheduling wise, but libs/dll wise), so if you want to "fine tune", you'll shoot yourself in the foot in the long run. It's sad to say it, but I really hate fine tunning for windows. That's why I love Java XD! So, yeah, I've been on your shoes as well, gamerk and I really get what you mean.

Cheers!
 
Hops Piledriver rumours based on a 1% semiaccurate story.
The lack of rumours opportunities has not kept MS away from being productive,
however and there are a few treats in the pipelines for fans to look forward to.
XBOX 720 Piledriver inside.
 
http://www.xbitlabs.com/news/cpu/display/20120117221436_AMD_Ups_Performance_Projections_for_Next_Gen_Trinity_APUs.html



AMD Ups Performance Projections for Next-Gen "Trinity" APUs.
AMD "On-Track" to Release Trinity in Mid-2012
[01/17/2012 10:14 PM]
by Anton Shilov
Advanced Micro Devices has increased performance projections for its next-generation code-named Trinity accelerated processing unit and set up launch timeframe for the chip. While AMD Trinity will be faster than initially believed, it will become available only in the middle of the year, which is somewhat later than generally expected based on AMD's comments.

According to performance benchmarks conducted by AMD, the Trinity 35W APU with Piledriver-class x86 cores will provide 25% better x86 performance compared to Llano 35W (with K10.5+ "Husky" x86 cores) based on results obtained in PC Mark Vantage Productivity benchmark. AMD also claims that Trinity 35W will offer up to 50% better result in 3D Mark Vantage performance benchmark compared to Llano 35W.



Earlier released slides, which were also presumably from AMD, projected 20% increase in x86 performance and 30% boost in graphics performance for Trinity compared to currently available A-series "Llano" APUs based on simulations.

Although AMD implied a number of times that it would release Trinity in the first half of the year and rather sooner than later, the official plan now is to launch it in the middle of the year. Some unofficial sources have implied that A-series "Trinity" will be released in June, 2012.

"We are indeed on track with Trinity for mid-2012," said Chris Hook, a spokesman for AMD, without elaborating on actual months or dates.

According to documents seen by X-bit labs, staring from early and middle March, 2012, AMD intends to mass produce its desktop A-series "Trinity" accelerated processing units with 65W thermal design power (TDP). In early May, 2012, the chip designer wants to initiate mass production of desktop A-series "Trinity" APUs with 100W TDP and higher performance.

AMD’s second-generation code-named Trinity APU for mainstream personal computers (Comal for notebooks and Virgo for desktops) will be made using 32nm SOI HKMG process technology at Globalfoundries. The APU will feature up to four x86 cores powered by enhanced Bulldozer/Piledriver architecture, AMD Radeon HD 7000-series "Southern Islands" graphics core with DirectX 11-class graphics support, DDR3 memory controller and other improvements. The chips will be compatible with new infrastructure.

 
i don't understand why there isn't any leak or rumor about trinity/piledriver's pcie support. will it support 2.0 or 3.0?
trinity will probably have pcie controller built into the cpu like llano, so that's an important information. it will also be important in the matters of hybrid crossfire, vce. amd should allow lower end 6000 series cards like 6570-6670 or even 6770 cfx with trinity.
 
trinity might not have l3 cache like llano. that could hurt it's performance against a core i5.
trinity's piledriver core will have bd-like core design - not like a traditional quad core e.g. ph ii x4 980 or i5 2500k.
it's igp otoh, seems quite impressive from the rumors and that ces demo.
 
trinity might not have l3 cache like llano. that could hurt it's performance against a core i5.
trinity's piledriver core will have bd-like core design - not like a traditional quad core e.g. ph ii x4 980 or i5 2500k.
it's igp otoh, seems quite impressive from the rumors and that ces demo.

Trinity doesn't have l3 cache for a fact, it's been known for quite awhile now. l3 cache isn't THAT important, it doesn't effect performance that much.
 
trinity might not have l3 cache like llano. that could hurt it's performance against a core i5.
trinity's piledriver core will have bd-like core design - not like a traditional quad core e.g. ph ii x4 980 or i5 2500k.
it's igp otoh, seems quite impressive from the rumors and that ces demo.
I have to say that CES demo was quite nice to see. Seemed rather impressive
 
trinity might not have l3 cache like llano. that could hurt it's performance against a core i5.
trinity's piledriver core will have bd-like core design - not like a traditional quad core e.g. ph ii x4 980 or i5 2500k.
it's igp otoh, seems quite impressive from the rumors and that ces demo.

L3 cache only good when your processor's caching / prediction is really bad, it acts like a safety net before accessing system RAM. It's also the cheapest component to add to the die.
 
L3 cache only good when your processor's caching / prediction is really bad, it acts like a safety net before accessing system RAM. It's also the cheapest component to add to the die.

We all know BDs predisiton is pretty bad. Llano used Athlon II based cores, no L3, so I wonder if to keep the prices on Trinity low they are cutting L3 cache.

May not be a good thing. Might hurt it in the long run.
 
Status
Not open for further replies.