AMD Piledriver rumours ... and expert conjecture

Page 4 - Seeking answers? Join the Tom's Hardware community: where nearly two million members share solutions and discuss the latest tech.
Status
Not open for further replies.
We have had several requests for a sticky on AMD's yet to be released Piledriver architecture ... so here it is.

I want to make a few things clear though.

Post a question relevant to the topic, or information about the topic, or it will be deleted.

Post any negative personal comments about another user ... and they will be deleted.

Post flame baiting comments about the blue, red and green team and they will be deleted.

Enjoy ...
 
The thing though, is the article did not state any specific architecture technology. It just said that GF was having yield issues with their 32nm process technology. Ergo, the entire lineup of archs from AMD are subject to these yield issues. It would make sense considering that obtaining these brand new Bulldozers is as hard as finding a Dodo in the wild. Almost non-existant for my work while when SB hit, we had them day one.

As for UVD, I will believe it when I see it. QuickSync is quite a feat. Intel really hit nVidia and AMD hard there. I don't think they wont be ble to make something similar, but performance is the only thing I want to see there.

If Trinity has the simpler VLIW4, then its just as I said, based on the HD6K series. That means the only way to improve performance at all is either higher GPU clocks or more SPUs. I am still doubtint Trinity will utilize the new MIMD like the HD7970/7950 will use.

And still, the Windows 7 excuss is over used. It makes no sense for AMD, who has been struggling and not competting for almost 5 years, to design an arch and not work with Microsoft to make sure it will work with 7 to give the best performance. That just insane. I understand designing to make it last but thats different than desigining an arch that can't properly be utilized by the OS that is currently the OS that people are using.


Please, Windows 7 was released before Orochi taped out. How could they possibly optimize Win 7? Trinity was never supposed to get Radeon Next. I actually wouldn't be surprised if Opteron got the first "real" GPGPU.

 
Because you stated in several posts that "AMD and GF are just fine together". As you can see from Read's statement and finger-pointing, not so..

You're kind of a trouble maker.

Well, geez - your title says "community reporter" -- not 'AMD spin control'. If expecting a modicum of credibility from a CR is making trouble, then so be it.

I don't remember word of them losing their Board seats, but I do have to work.

IOW, you have time to post, but no time to actually check facts first..

At any rate, I WANT PD/Trinity pushed back to 2H 2012.


Fixed it for you.
 
And still, the Windows 7 excuss is over used. It makes no sense for AMD, who has been struggling and not competting for almost 5 years, to design an arch and not work with Microsoft to make sure it will work with 7 to give the best performance. That just insane. I understand designing to make it last but thats different than desigining an arch that can't properly be utilized by the OS that is currently the OS that people are using.

I'll make it even simpler: If AMD is telling the OS that a 4-module BD is really 8 cores, they have no one to blame but themselves if the scheduler assumes all 8 cores are created equal.

A really quick fix could be done via BIOS update: Simply label the second core of a BD module as a logical processor. Now the scheduler will treat those cores simmilar to how they already handle Hyperthreaded cores, helping to avoid the issue.

I'm still a little in the dark on why the eight cores cannot behave like 8 separate cores (and with a hypothetical AMD version of hyperthreading, 16 THREADS, WOW!! :lol: ); is it not really feasible (from a cost standpoint, mainly) with the current production techniques to have eight distinct cores on one CPU die, functioning as eight separate cores? Is the "module" design necessary to conserve space on the die, as well as avoid power management issues or something? Yeah, yeah, I should google "processor fabrication" or something before I start asking dumb questions, but I figured with so many diverse users right here, it is a good place to start.
 
I'm still a little in the dark on why the eight cores cannot behave like 8 separate cores (and with a hypothetical AMD version of hyperthreading, 16 THREADS, WOW!! :lol: ); is it not really feasible (from a cost standpoint, mainly) with the current production techniques to have eight distinct cores on one CPU die, functioning as eight separate cores? Is the "module" design necessary to conserve space on the die, as well as avoid power management issues or something? Yeah, yeah, I should google "processor fabrication" or something before I start asking dumb questions, but I figured with so many diverse users right here, it is a good place to start.

There's some good articles on BD's problematic design here and on the other thread http://www.tomshardware.com/forum/315700-28-bulldozer-considered-failure.

For one thing, the two cores within each module share the front end decoders, so effectively BD has only a 2-issue decoder for each core when both are active. This is a step backwards from Phenom's 3-issue decoder. Intel uses a 4-issue decoder for each core, but for some reason - possibly scheduling - doesn't seem to take as much of a hit using SMT.

There's other reasons as well, such as some huge cache latencies, that cause BD to perform below expectations. Whether these are fixable in a new stepping remains to be seen. I don't think the shared front end can be fixed without a major redesign however.
 
I'm still a little in the dark on why the eight cores cannot behave like 8 separate cores (and with a hypothetical AMD version of hyperthreading, 16 THREADS, WOW!! :lol: ); is it not really feasible (from a cost standpoint, mainly) with the current production techniques to have eight distinct cores on one CPU die, functioning as eight separate cores? Is the "module" design necessary to conserve space on the die, as well as avoid power management issues or something? Yeah, yeah, I should google "processor fabrication" or something before I start asking dumb questions, but I figured with so many diverse users right here, it is a good place to start.
The idea is, power/perf.
The less transistors you have, ideally, the less power draw.
CMT brings 180% perf with 2 cores, HT 120% or so.
The die size increases show CMT to be a worthy approach to get this extra perf with minimal die space and power usage.
With 8 cores using CMT as AMD does, youd get 6.4 cores worth of performance, at best, possibly more, but those circumstances arent typical.
Same with HT, as Intel gets around 120% per core, but no one has an 8 core monster out there for DT, as its asking too much as 32nm.
Youd get great perf, but power would go thru or beyond currently acceptable limits.
 
There's some good articles on BD's problematic design here and on the other thread http://www.tomshardware.com/forum/315700-28-bulldozer-considered-failure.

For one thing, the two cores within each module share the front end decoders, so effectively BD has only a 2-issue decoder for each core when both are active. This is a step backwards from Phenom's 3-issue decoder. Intel uses a 4-issue decoder for each core, but for some reason - possibly scheduling - doesn't seem to take as much of a hit using SMT.

There's other reasons as well, such as some huge cache latencies, that cause BD to perform below expectations. Whether these are fixable in a new stepping remains to be seen. I don't think the shared front end can be fixed without a major redesign however.

Thanks for the link, except it comes up as an "error 404" page when I click on it, strange. So, AMD was just trying to invent a clever alternative to hyperthreading with their "two cores on one module, sharing resources", except different, and supposedly better? If the result is that it behaves as 8 quasi/virtual cores at best, with the perfect OS scheduling, then isn't it kind of false advertising that they claim "AMD FX 8-Core Processor" on the box?
 
The idea is, power/perf.
The less transistors you have, ideally, the less power draw.
CMT brings 180% perf with 2 cores, HT 120% or so.
The die size increases show CMT to be a worthy approach to get this extra perf with minimal die space and power usage.
With 8 cores using CMT as AMD does, youd get 6.4 cores worth of performance, at best, possibly more, but those circumstances arent typical.
Same with HT, as Intel gets around 120% per core, but no one has an 8 core monster out there for DT, as its asking too much as 32nm.
Youd get great perf, but power would go thru or beyond currently acceptable limits.


I see, so power management and cost have a lot to do with it. Well, I suppose the onus is on AMD to figure out how to adjust the front side shared resource management with new steppings, or Piledriver, to get back in the ring with Intel.
 
Yes, but the questions remain, how deep, or bad are these problems that run in the silicon?
There appear to be several, with possibly some fixes easy to do, but at what perf gains once fixed, no one knows, except AMD hopefully.
There may be others, major fixes, which may not allow for a better overall fix, at least within a certain timeline.
Juries still out, and time is a tickin
 
CMT brings 180% perf with 2 cores, HT 120% or so.

But that '180%' for CMT is based on a low-IPC BD core. Didn't somebody here state recently that a BD module is about 120% of a PII core? In contrast, that '120%' for HT is based on a high-IPC core, which means the increase is more significant.

The die size increases show CMT to be a worthy approach to get this extra perf with minimal die space and power usage.

That sentence sounds really good, except for the words after "The" 😀.

BD uses 315mm^2 on 32nm vs. SB's 216mm^2 on 32nm. BD's TDP is what - 130W while SB's is 95W. Not exactly a ringing endorsement for CMT, despite AMD's wishing we all would conveniently ignore the numbers and believe otherwise.

 
And you can explain this?
No one, except again, AMD can
And yes, the scaling is incredibly good, so the starting point is increased as threads go up
Now, since BD is whatever it is, PD will have some better IPC support, whether thru fixes or clocks.
As GFs process matures, AMD will make more headway than they have for awhile, again, especially with many things to correct, and the obvious scaling, which HT is poor at
 
There's some good articles on BD's problematic design here and on the other thread http://www.tomshardware.com/forum/315700-28-bulldozer-considered-failure.

For one thing, the two cores within each module share the front end decoders, so effectively BD has only a 2-issue decoder for each core when both are active. This is a step backwards from Phenom's 3-issue decoder. Intel uses a 4-issue decoder for each core, but for some reason - possibly scheduling - doesn't seem to take as much of a hit using SMT.

There's other reasons as well, such as some huge cache latencies, that cause BD to perform below expectations. Whether these are fixable in a new stepping remains to be seen. I don't think the shared front end can be fixed without a major redesign however.


Can you say SUPERSCALAR OPERATION? Nearly every modern CPU is designed to do more in one cycle. Some may even reach near two executions per cycle.

The FE should have two ports. If not THAT would be the issue as one port has to achieve 2 executions per cycle( fetch, decode, etc). I would think they're smarter than that. Plus, the majority of Operatiosn are memory operations so by decouples AGU\ALU they don' have switching penalties and less scratch space required to maintain last state (load\store or execute). This is one reason why different parts of the CPU run at different speeds.

Win 8 sees latency go down by A LOT. (Check out the PCStats link from Chad) Of course, it would be better if it was ready for Win 7 or vice versa but for heavy everyday use, FX can't be beat. There are places where it even beats 12 threads(990X).
 
Yes, but the questions remain, how deep, or bad are these problems that run in the silicon?
There appear to be several, with possibly some fixes easy to do, but at what perf gains once fixed, no one knows, except AMD hopefully.
There may be others, major fixes, which may not allow for a better overall fix, at least within a certain timeline.
Juries still out, and time is a tickin


it's mainly from SW not understanding the module paradigm. how can program X do it if Windows can't? It's still in the Best Sellers on Newegg though so not everyone agrees with you.
 
OK, I might accept that, but if that were the case, from what Ive seen , where optimization has occured, its IPC is still stunted.
Which goes back to a HW problem first.
Someone needs to run enough tests to reason out the slowdowns, and why theyre there, define them
Process may play a part in this, but what of all those transistors? Just what are they doing? Thats alot of silicon, and perf per area is extremely lacking, no matter how good you make that SW, from everything Ive seen
 
And you can explain this?
No one, except again, AMD can
And yes, the scaling is incredibly good, so the starting point is increased as threads go up
Now, since BD is whatever it is, PD will have some better IPC support, whether thru fixes or clocks.
As GFs process matures, AMD will make more headway than they have for awhile, again, especially with many things to correct, and the obvious scaling, which HT is poor at

Heh, the point I was making is that CMT may or may not be good - the jury is still out on that - but Bulldozer is far from being an exemplary showcase of CMT's advantages. I was thinking AMD's approach was very interesting and wanted to see how well BD did with threading. Guess I'll wait and see if PD proves its merits.
 
I understand, and if the scaling we see, which is excellent, if it costs this much in silicon, therell be no fix for PD, besides the perf side only, still doewsnt explain the area/di size.

I fully expect for Intel to do this very thing, and some will say they at least did it right, but, yes, CMT is the right direction

As in per die size, this isnt your older brothers 4870
 
So, AMD was just trying to invent a clever alternative to hyperthreading with their "two cores on one module, sharing resources", except different, and supposedly better? If the result is that it behaves as 8 quasi/virtual cores at best, with the perfect OS scheduling, then isn't it kind of false advertising that they claim "AMD FX 8-Core Processor" on the box?

I guess since AMD figures 2 partial cores under CMT is something like 80% of the throughput of two full cores, then due to rounding errors they can claim it as 2 cores per module 😀.

CMT actually looks like a promising idea. However IMO it remains merely a promising idea as BD failed to demonstrate much in the way of advantages. I'll see if PD can improve upon it.
 
Please, Windows 7 was released before Orochi taped out. How could they possibly optimize Win 7? Trinity was never supposed to get Radeon Next. I actually wouldn't be surprised if Opteron got the first "real" GPGPU.

Your negativity is so thick you can cut it. I'm bailing.

There is no negativity. I am just looking at the facts. The fact is that AMD designed a CPU that could not be optimized, or rather they changed it too many times and pushed it back too many times to allow MS the time to optimize it for 7.

Trinity will be HD6K based and that means, as I said, that the performance increase will be via higher clock speeds and more SPUs. Thats not negative, just factual. The only problem is people will see "Radeon HD 7000" in the specs and expect it to be based on the next gen HD7K series, which it will not be nor will any of the HD7X00s below the HD7950/7970.

I think it would benefit AMD to keep the naming similar to the arch its based on rather than just going with the current gen. It confuses normal people who cannot understand the technical specifications like ourselves.

And if you wish to bail instead of discuss it, by all means go for it. A discussion is just this; we present our theories/ideas and then try to back them up with factual data. If you cannot continue to do so without getting angry then it is pointless for you, no?

So they optimised for Win 8 instead? :pt1cable:

With Win 7 being Vista done right, then they could have optimised for Vista and be much better placed.

I guess there has to be a reason for it. In this world we have, if it was Intel doing the same thing I would have been saying the same thing. You create a arch and make sure it works well with the OS thats current. If MS didn't optimize 8 for BD, and we have yet to see if the performance difference is as major as some say it is or if its the small amount currently showing, then BD would be stuck for another 3 years on a OS not optimized for it. But it would then be Windows 8s fault, not AMDs.

But there is the kicker. Even when Intel comes out with something new, they work very closley to software devs to make sure it works to its peak performance. AMD needs to get more out there with software devs. Yes its money and time but you have to spend money to make money. Intel does it every year on R&D and so far for 5 years they have done very well. AMD did it with ATI but that hasn't been a good investment, yet. If the HD79XX series can push Kepler, I am optimistic about it hence why I want a HD7970, then maybe it will change.

No. MS optimized Win 8. There are a lot of differences between Vista and Win 7 kernels so there is no guarantee that things will work exactly the same.

The 7 kernal is just a highly optimized Vista kernal. Thats why you can "upgrade" from Vista to 7, although I never would nor would I recommend it. On the other hand, the XP and Vista/7 kernals are very different and thats why a clean install "upgrade" is required from XP -> Vista/7.

And for the most part, everything in Vista works on 7. In fact sometimes even Vista drivers work for 7 while XP drivers wont work on anything but XP.

That said, I still love 7 and wouldn't move back to XP or Vista for anything.

I guess since AMD figures 2 partial cores under CMT is something like 80% of the throughput of two full cores, then due to rounding errors they can claim it as 2 cores per module 😀.

CMT actually looks like a promising idea. However IMO it remains merely a promising idea as BD failed to demonstrate much in the way of advantages. I'll see if PD can improve upon it.

I think CMT will become another RHT (Reverse HT). It was the original idea behind BD, something that is much like HT but has more resources. Who knows. I think PD will be Deneb for BD. It will imrpove it but not enough to make a dent in Intels current dominance unless they go back to the drawing board and redesign the arch by human hand to help cut some meat off of that 2Billion transistor design.
 
The die size increases show CMT to be a worthy approach to get this extra perf with minimal die space and power usage.

That sentence sounds really good, except for the words after "The" 😀.

BD uses 315mm^2 on 32nm vs. SB's 216mm^2 on 32nm. BD's TDP is what - 130W while SB's is 95W. Not exactly a ringing endorsement for CMT, despite AMD's wishing we all would conveniently ignore the numbers and believe otherwise.

Your die size argument is not valid. Much of the die size for BD comes from cache. It has almost twice the cache of Thuban or sandy bridge.

In order to properly measure the effect of CMT on die size, we need to be able to compare the units on equal terms. Difficult to do right now because of the cache size difference.

As for performance, the difference between SB and bulldozer has more to do with the design of the processors execution components and not CMT. I will remind you that AMD was well behind Intel long before CMT.
 
There's some good articles on BD's problematic design here and on the other thread http://www.tomshardware.com/forum/315700-28-bulldozer-considered-failure.

For one thing, the two cores within each module share the front end decoders, so effectively BD has only a 2-issue decoder for each core when both are active. This is a step backwards from Phenom's 3-issue decoder. Intel uses a 4-issue decoder for each core, but for some reason - possibly scheduling - doesn't seem to take as much of a hit using SMT.

There's other reasons as well, such as some huge cache latencies, that cause BD to perform below expectations. Whether these are fixable in a new stepping remains to be seen. I don't think the shared front end can be fixed without a major redesign however.


Some very good points fazers ... I hadn't read that far in the procesor brief or missed it.
 
Status
Not open for further replies.