AMD Piledriver rumours ... and expert conjecture

Page 32 - Seeking answers? Join the Tom's Hardware community: where nearly two million members share solutions and discuss the latest tech.
Status
Not open for further replies.
We have had several requests for a sticky on AMD's yet to be released Piledriver architecture ... so here it is.

I want to make a few things clear though.

Post a question relevant to the topic, or information about the topic, or it will be deleted.

Post any negative personal comments about another user ... and they will be deleted.

Post flame baiting comments about the blue, red and green team and they will be deleted.

Enjoy ...
 
http://www.xbitlabs.com/news/cpu/display/20120111193048_AMD_Demos_17W_Trinity_Accelerated_Processing_Unit.html

In case we missed it.

Cheers!



Yeah i watched the video(which i can't find) i'm going to buy the best APU for my new laptop i hope trinity is out before 4Q of 2012. It should be out on time since Amd keeps displaying it.
 
^^ Here's the problem: The thread is you lowest unit of execution. And most times, you only have one or two threads that do any real heavy lifting at any one time. That limits how well you can scale your software. I've always seen more cores as a way to do more APPLICATIONS at one time, but simply do not see software scaling much beyond four cores, except in some very special cases. Todays games may have 80+ threads running, but only a handful are doing any real work.

Hence my primary argument of why BD was destined to fail: for software like you see on a PC, its simply far too difficult to scale efficently. Once performance is "good enough", designers stop trying to optimize farther, because its a waste of time and money. At the end of the day, very few people (if any) buy a game because it spits out 120 FPS on a 460; they buy it because they want to play it.

That being said, I consider the next leap forward in gaming to be implementing real-time physics effects into games, which almost requires a dedicated GPGPU. [Multiple-object dynamic physics equations get very complicated very fast]. For example, instead of set values for bullet damage, I want the effects computed in real-time based on distance, velocity, armor, etc.


This is very very true, and the reason I hate "benchmarking" with a single app. You can only parallelize so much before your required to radically change the architecture of your game. More cores means you can have more things going at once without two heavy apps being forced to share CPU resources. This becomes blatantly obvious when your start engineering systems using virtualization. Running 5~6 VM's on a platform with eight to twelve cores you quickly see the benefit of having so many independent processor resources.

BD's uarch was pretty solid in concept, most operations are Integer OPS / Logic Compares or memory operations, actual FPU OPS are pretty rare as a percentage. They stood to save a lot of die real estate by sharing one big FPU and coherent MMU's between cores while keeping separate ALU's. They screwed up in the execution of this idea, the cache has way to high a latency coupled with mispredicts that only emphasize that latency. If they could find a way to fix the latency issue and get better prediction rates and data preload rates, the we would see a significant (25~30% would be my WAG) jump in performance per cycle. After all, pretty much every ALU in the world is the same, there are only so many ways to add 0 and 1 together.

GPGPU for physics would be easy to implement once they got their algorithms worked out. Right now their being slow due to consoles not supporting it.

@Viridian,

A GPU will never be as good as a CPU at generic integer ops / logic compares, they'll always bee a few orders of magnitude behind. It has to due with VLIW operations in general, and no amount of secret magic sauce will fix it. Intel found this out the hard way when they tried to have the Itanium compete with Ultra SPARC and IBM Power. In a VLIW CPU you execute multiple instructions simultaneously. Now these instructions can not be co-dependent. If one of those instructions has an operand that is dependent on another, then it must wait till after the previous one is finished before it can be executed. With VLIW setups instructions are not evaluated for dependencies until their loaded for processing, with the dependent instructions having their results discarded and are run again. Basically you can't do branch prediction or Out of Order Execution very effectively with a VLIW uarch. What a VLIW uarch can do very well is process large amounts of non-dependent calculations, basically vector math. This perfectly fine for GPU's as their processing pixel elements and vertices, all require independent calculations with the final result being a rendered image, they typically render one pixel per "thread". A 1920 x 1080 resolution is 2,073,600 pixels per frame, at 60 fps you get 124,416,000 pixels that need to be rendered per second. This is without considering AA or shader's being applied. 124 million separate vector calculations per second would be nearly impossible on a generic processor but easy on a vector processor with 200~500 separate processing engines (aka cores). This is also why memory access seems so much more important to a GPU then a CPU. CPU's have ultra fast cache inside of them that typically holds all the data their going to need for that moment in time. GPU's massive amounts of independent operations requires that they process very large amounts of data, data that won't even remotely fit into cache.
 
This is very very true, and the reason I hate "benchmarking" with a single app. You can only parallelize so much before your required to radically change the architecture of your game. More cores means you can have more things going at once without two heavy apps being forced to share CPU resources. This becomes blatantly obvious when your start engineering systems using virtualization. Running 5~6 VM's on a platform with eight to twelve cores you quickly see the benefit of having so many independent processor resources.

BD's uarch was pretty solid in concept, most operations are Integer OPS / Logic Compares or memory operations, actual FPU OPS are pretty rare as a percentage. They stood to save a lot of die real estate by sharing one big FPU and coherent MMU's between cores while keeping separate ALU's. They screwed up in the execution of this idea, the cache has way to high a latency coupled with mispredicts that only emphasize that latency. If they could find a way to fix the latency issue and get better prediction rates and data preload rates, the we would see a significant (25~30% would be my WAG) jump in performance per cycle. After all, pretty much every ALU in the world is the same, there are only so many ways to add 0 and 1 together.

GPGPU for physics would be easy to implement once they got their algorithms worked out. Right now their being slow due to consoles not supporting it.

@Viridian,

A GPU will never be as good as a CPU at generic integer ops / logic compares, they'll always bee a few orders of magnitude behind. It has to due with VLIW operations in general, and no amount of secret magic sauce will fix it. Intel found this out the hard way when they tried to have the Itanium compete with Ultra SPARC and IBM Power. In a VLIW CPU you execute multiple instructions simultaneously. Now these instructions can not be co-dependent. If one of those instructions has an operand that is dependent on another, then it must wait till after the previous one is finished before it can be executed. With VLIW setups instructions are not evaluated for dependencies until their loaded for processing, with the dependent instructions having their results discarded and are run again. Basically you can't do branch prediction or Out of Order Execution very effectively with a VLIW uarch. What a VLIW uarch can do very well is process large amounts of non-dependent calculations, basically vector math. This perfectly fine for GPU's as their processing pixel elements and vertices, all require independent calculations with the final result being a rendered image, they typically render one pixel per "thread". A 1920 x 1080 resolution is 2,073,600 pixels per frame, at 60 fps you get 124,416,000 pixels that need to be rendered per second. This is without considering AA or shader's being applied. 124 million separate vector calculations per second would be nearly impossible on a generic processor but easy on a vector processor with 200~500 separate processing engines (aka cores). This is also why memory access seems so much more important to a GPU then a CPU. CPU's have ultra fast cache inside of them that typically holds all the data their going to need for that moment in time. GPU's massive amounts of independent operations requires that they process very large amounts of data, data that won't even remotely fit into cache.

Agreed on all of this. I already suspect VM's are going to be the future, which if done right, solves the issue of software incompatability across architectures.
 
^^ Here's the problem: The thread is you lowest unit of execution. And most times, you only have one or two threads that do any real heavy lifting at any one time. That limits how well you can scale your software. I've always seen more cores as a way to do more APPLICATIONS at one time, but simply do not see software scaling much beyond four cores, except in some very special cases. Todays games may have 80+ threads running, but only a handful are doing any real work.

Hence my primary argument of why BD was destined to fail: for software like you see on a PC, its simply far too difficult to scale efficently. Once performance is "good enough", designers stop trying to optimize farther, because its a waste of time and money. At the end of the day, very few people (if any) buy a game because it spits out 120 FPS on a 460; they buy it because they want to play it.

That being said, I consider the next leap forward in gaming to be implementing real-time physics effects into games, which almost requires a dedicated GPGPU. [Multiple-object dynamic physics equations get very complicated very fast]. For example, instead of set values for bullet damage, I want the effects computed in real-time based on distance, velocity, armor, etc.

Like i said, i am not a programmer. I have it set in my mind that you can program software to be processed in a certain way. For instance, in a game you cant just tell the processor: Render this polygon on this core, this polygon on next core (if available) and it repeats. There is much more than just polygons happening, which is why it takes so long for software to become completed.

One day, i want to make my own indie game, so maybe i'll find out by then, but it seems fairly simple to one that has no clue how software is made.
 
As someone who's written two seperate OGL game engines, a lot is actually very linear processing. There are surprisingly few ways to use more cores effectivly. Most implementations I've seen simply offloads a few key threads [AI control, main rendering thread, etc] to avaliable cores, and thats it. This works well for 2-4 cores or so, but after that, you've run out of ways to use more cores effectivly.

Secondly, a lot of the key processing is linear in nature; there is a clear order in how things must be processed. This very much limits how much farther you can parallize. You also need to worry about inter-thread communication, which puts a screeching halt to any attempts to parallize effectivly. [IE: AI depends on some extent to the outcomes of the rasterization formula. The AI needs to know if theres a solid object in its way, after all].

Finally, after years of coding in a few dozen languages, I've learned the hard way that the programmer will never be smarter then a decent optmizing compiler. Way back when, we had the "register" directive in languages like C and Pascal, which would force the compiler to put some variable in a register for quicker use [the idea being that this would be used to avoid doing lots of memory IO on frequently used variables]. In practice, the compiler is smarter then the programmers 99% of the time, and using this directive almost always kills performance [because you now have one fewer register in use, which often carried a greater hit then the necessary memory IO]. I view core usage the same exact way: The compiler and OS scheduler will be a heck of a lot smarter then the programmer when it comes to low-level resource usage.

Point being, games are never going to scale much beyond four cores. I view more cores simply as a way to run more tasks in teh background, which will allow for more powerful OS's down the road. But for a single application, adding more cores is not going to affect performance whatsoever, because those cores aren't going to be used.
 
As someone who's written two seperate OGL game engines, a lot is actually very linear processing. There are surprisingly few ways to use more cores effectivly. Most implementations I've seen simply offloads a few key threads [AI control, main rendering thread, etc] to avaliable cores, and thats it. This works well for 2-4 cores or so, but after that, you've run out of ways to use more cores effectivly.

Secondly, a lot of the key processing is linear in nature; there is a clear order in how things must be processed. This very much limits how much farther you can parallize. You also need to worry about inter-thread communication, which puts a screeching halt to any attempts to parallize effectivly. [IE: AI depends on some extent to the outcomes of the rasterization formula. The AI needs to know if theres a solid object in its way, after all].

Finally, after years of coding in a few dozen languages, I've learned the hard way that the programmer will never be smarter then a decent optmizing compiler. Way back when, we had the "register" directive in languages like C and Pascal, which would force the compiler to put some variable in a register for quicker use [the idea being that this would be used to avoid doing lots of memory IO on frequently used variables]. In practice, the compiler is smarter then the programmers 99% of the time, and using this directive almost always kills performance [because you now have one fewer register in use, which often carried a greater hit then the necessary memory IO]. I view core usage the same exact way: The compiler and OS scheduler will be a heck of a lot smarter then the programmer when it comes to low-level resource usage.

Point being, games are never going to scale much beyond four cores. I view more cores simply as a way to run more tasks in teh background, which will allow for more powerful OS's down the road. But for a single application, adding more cores is not going to affect performance whatsoever, because those cores aren't going to be used.

Very nice explanation. thank you.

So that begs the question, There wasn't a single manager or executive at AMD who may have known this, and spoke up and said that Bulldozer wasn't going to work out?
 
It just goes back to Bulldozer being a server-oriented concept that AMD tried to wedge into an enthusiast product.

It most likely would have worked if they had been able to keep single-thread performance consistent with Phenom II's. Unfortunately it appears that they dropped the ball on cache performance and/or prediction and IPC took a sizable hit.

If they can get IPC back to Phenom II levels with PD then they may have a decently competitive CPU again.
 
damn... so.. with llano and bulldozer, amd took the first step in splitting up their cpu lineup into 'mainstream entry level' and 'enthusiast' segments. the next step is with trinity and piledriver. and later the 'enthusiast' segment might be phased out in favor of more powerful apus. or may be switch to soc-type apus and apu-type cpus ( cpu + imc + pcie controller w/o igpu). just speculating.
 
Very nice explanation. thank you.

So that begs the question, There wasn't a single manager or executive at AMD who may have known this, and spoke up and said that Bulldozer wasn't going to work out?

Because they needed to recoup investment costs. Even if it bombs, they at least make some money back on it. And for server based tasks, BD is a very well designed architecture [again, because multiple applications are running at once, which favors massivly multi-core architectures].
 
upu
news.softpedia.com/news/For-the-First-Time-In-Decades-a-New-CPU-Architecture-Appears-246302.shtml
vr-zone.com/articles/icube-upu-the-next-step-in-processor-evolution-/14518.html
As you run more and more graphics, you get less and less free cpu clocks, woohoo .. Interesting concept tho, just don't see it working well other than being very cheap to produce.

Secondly, a lot of the key processing is linear in nature; there is a clear order in how things must be processed.

one of these days programmers will discover a way to eliminate the Turing Computing Model. It will probably take an albert einstein to go and start from scratch and may be 20+ years from now.

Until then, we are stuck with this, primarily because it cheaper to edit some existing code rather than start over. In short, to get true multi-core computing on program level, an entirely new OS must be made, wich would mean no backward compatibility, and Microsoft won't be doing that any time soon.
 
Now that the core counts are stabilizing at quad core it's time to bump the turbo modes. Down the road these need to be 2x speedup or more. 400Mhz-1ghz bumps aren't enough.

May need some OS help to move the boosted core around periodically and even out the CPU die temperature. Gets complicated as you're bumping voltages and clocks to make the turbo work.
 
Yes:
http://blogs.amd.com/play/2011/10/13/our-take-on-amd-fx/

I wouldn't call BD a failure, but rather a disappointment for some (if not most). BD needs work, but I try not to give up ALL hope on AMD.


Im an Amd and a intel fan.
Amd kinda fail with the Bulldozers.
8 Core processor Vs the Amd Phenom ll x 955
Keep in mind this a 4 core processor
http://www.anandtech.com/bench/Product/434?vs=88
Then Bulldozer vs the i5 2500k
Keep in mind this a 4 core processor
http://www.anandtech.com/bench/Product/434?vs=288
 
Status
Not open for further replies.