AMD Piledriver rumours ... and expert conjecture

Page 53 - Seeking answers? Join the Tom's Hardware community: where nearly two million members share solutions and discuss the latest tech.
Status
Not open for further replies.
We have had several requests for a sticky on AMD's yet to be released Piledriver architecture ... so here it is.

I want to make a few things clear though.

Post a question relevant to the topic, or information about the topic, or it will be deleted.

Post any negative personal comments about another user ... and they will be deleted.

Post flame baiting comments about the blue, red and green team and they will be deleted.

Enjoy ...
 
Yuka. out of curiosity, what are the specs for your i5 workstation at work?

i5 2500 (vPro), 4GB (3.25GB effective) and a Quadro NVS4200. It's an H67 chipset.

I'm more inclined to blame WinXP though, now that I think about it XD

Cheers!

EDIT:

On another note... Anyone here with a Crosshair V and some flavor of Bulldozer? There was a recent BIOS/UEFI update that claimed to have some perf improvements. As well as driver updates fro the MoBo.
 
While I disagree the FX was a total fizzer (no disrespect to Don) you have to agree it has not soundly outperformed its predecessor ... in a number of cases it even fails to meet it.

Essentially the cache latency and prefetch logic is poor / broken.

Name an Intel CPU that failed to outperform the previous generation ... I can think of the Pentium IV, and in terms of steppings the Prescott was also poor.

Did we bag those two ... yes we did.

How about SB-E, yes I went there.

http://www.techspot.com/review/492-intel-core-i7-3820/page7.html

3.6 ghz I7 3820 outperformed by 3.3 ghz I5 all the way down.
 
How about SB-E, yes I went there.

http://www.techspot.com/review/492-intel-core-i7-3820/page7.html

3.6 ghz I7 3820 outperformed by 3.3 ghz I5 all the way down.

Still, the quad-core 3820 beat out the 8-core FX 8150 in just about all those game benches, for $20 more in price. It also beat the 8150 in most of the other benchmarks as well, so which is the bigger failure again?? :hello:

From the conclusion:

Final Thoughts

For $285, the Core i7-3820 is a worthwhile entry as it repeatedly exceeded the Core i7-2600K's performance output. The new SB-E chip looked particularly favorable in our encoding benchmarks, while its showing was satisfactory in the application tests, in which it managed to match the 2600K. It didn't do as well in our gaming tests, as it was disappointing to see the processor fall in the vicinity of the last-generation i7-920 and i5-750.

To be fair, the i7-3960X isn't all that impressive when gaming either, though it's at least on par with the 2600K. I believe the SB-E parts struggle here because they have inferior memory performance to the 2500K and 2600K. Despite the newer chips supporting quad-channel RAM, the SB-E processors have a higher memory latency, which attributes for the loss in bandwidth.

When you consider that those investing in the 3820 or 2600K are likely building a high-end rig, they won't be after a budget motherboard and high-end Z68/X79 boards tend to carry similar price tags. For instance, the Asrock Z68 Extreme7 costs $270 while the X79 Extreme7 is slightly cheaper at $260.

In other words, regardless of whether you purchase an 3820 or 2600K, a high-end build should cost roughly the same. By opting for the 3820, you get a platform that supports more RAM and PCI Express 3.0 along with more PCIe lanes. Along with features tipping in favor of the 3820, we believe it provides the best overall performance and for that reason, we would choose it over the i7-2600K or 2700K.

The 39XX 6-cores and 38XX quad-core are more intended for workstation environments - I don't see Intel touting them as extreme 'gaming' CPUs, unlike your fave company which tarted up the 8150 as a gamer's choice.
 
Still, the quad-core 3820 beat out the 8-core FX 8150 in just about all those game benches, for $20 more in price. It also beat the 8150 in most of the other benchmarks as well, so which is the bigger failure again?? :hello:

The 39XX 6-cores and 38XX quad-core are more intended for workstation environments - I don't see Intel touting them as extreme 'gaming' CPUs, unlike your fave company which tarted up the 8150 as a gamer's choice.
It doesn't matter how you look at it, AMD can't win, thats the conclusion everyone comes up with in one way or another.

If its AMD vs previous gen AMD, look at X bench not Y, but thats not the point, look at Intel they never lose to their own cpus, even if shown they do, thats not the point, Intel doesn't have slower cpus than AMD.

If its AMD vs Intel, look at their power draw, don't look at benchmarks that show it being ~5%

Its all the same crap over and over, no matter what the question is, someone comes up with an entirely different answer.

Where did AMD lose on every game bench vs their previous gen cpu?

http://www.techspot.com/review/452-amd-bulldozer-fx-cpus/page10.html

Intel did, straight up all the way down. and that was the question that was asked, not wether Intel failed to surpass AMD. QUIT CHANGING THE SCENARIO just so you can have something to bitch about.

Name an Intel CPU that failed to outperform the previous generation
where anywhere does that say AMD?

final thought http://www.anandtech.com/show/5537/intel-releases-core-i73820
 
Aside from the game being optomized for the Intel platform,

How do you know if the game is optimised for the Intel platform or not? :pt1cable:

Exactly. Don's (Cleeve I think was his AMDZone username) also called them exactly for what they are - pretending to hold everybody to their own "high standards" for proof and then insisting that the first (and only) benchmarks to agree with their own preconceived conclusions are correct, despite dozens of other reviews showing the opposite.

Don was right - they are a tiny, insignificant and uninfluential fanbois club.


Don's efforts were very impressive. :lol:
 
There has already been some rather extensive analysis done on where BD fell short - I'm too busy to look up the links again but they were mentioned in numerous threads here previously. IIRC one review site had an analysis entitled "death by a thousand paper cuts" or something similar.

The shared front-end means that effectively a BD "core" has just two integer pipelines available when both cores of the module are in use. In contrast, K10.5 had 3 pipes per core available and Core2 and later iterations have 4 pipes.

There are many other 'paper cuts' to BD's performance that ultimately cost signficant performance over what we were led to believe a year ago..

Not to be disrespectful, but you might want to go research what exactly an integer pipeline is. The AMD "module" has more then two, each core has four integer pipelines, two for ALU and two for AGU functions. Thus a single AMD module actually has eight integer pipelines if you really want to get technical.

As has been stated several times, its not the module approach that is killing BD performance, its the cache latency and poor prediction / prefetch that is staling the units out. There is only so many different ways to do binary math, and we figured them all out a few decades ago. The only difference between how an AMD, an Intel, a Sun or an IBM CPU does math is in the secret magic sauce of the prefetch / cache unit, otherwise they all add 1 + 1 the exact same way (bit order not withstanding).

A typical CPU design involved the following
1~3 Integer Units
1 Float Point Unit (x87 math coprocessor or SIMD unit)
1 MMU (for stores / loads)
1 Cache system
1~4 instruction scheduler units

x86 CPU's include the following,
1 Instruction Decoder unit
1~4 x86 Instruction Scheduler unit

Back when AMD announced their modular approach I called them overloaded cores due to myself believing there would only be one front end instruction scheduler that would act as a bottleneck. Information was released afterward that showed there were actually 4 front end instruction schedulers and each integer core had it's own dedicated internal instruction scheduler and four integer pipelines. This dismissed by concerns about them just being overloaded cores as there are enough front end schedulers to efficiently assign instructions to the individual processing units. Instruction scheduling is not the bottleneck. Thus my surprise when BD performed like a$$ during official reviewer benchmarks (not ES). Later it was demonstrated that BD has crazy high cache latencies and piss poor cache hit rates, which meant the secret sauce that goes into to design of the instruction prefetch unit / branch prediction was screwing something up thus forcing BD to rely on those high latencies. I blame the high L2 latencies on BD using a shared L2 cache between cores.

So while the 8150 is indeed an 8 core CPU, it is unable to fully utilize those 8 cores. OS patch's and software updates can only do so much, if the CPU's own instruction prefectch / predictor is messed up and forcing it to rely on long latency cache, then its a DOA chip. AMD has lots of work over the next few years to improve upon and perfect this design, I expect their already busting a$$ on fixing the latencies and predictors. Giving each core a dedicated 1MB of L2 cache would be a good start.
 
How do you know if the game is optimised for the Intel platform or not? :pt1cable:

If it wasn't obvious enough

Blizzard worked with companies such as Intel to make sure that StarCraft II works efficiently on the latest PCs and supports the largest group of systems and video cards on the market today. Sigaty said his team had the foresight from the start of development to look to the future of processing power and technology.

http://software.intel.com/sites/billboard/article/blizzard-entertainment-re-imagines-starcraft-intels-help
 
I said this, well supported this, way back at near the beginning of the thread. AMD by far has more features and add-ons on their platform than intel. Phenom 2 was great because cpu performance kept up with SB and it has a boat load of other features that made you want to buy it. More PCIe lanes, faster memory, ect.
If your buying a computer or building one, you need to realize that the cpu is not only the brain and main component of your computer, you need to know it determines every other part your going to buy in one way or another.
BD is just so far behind in performance it is hard to recommend.
Trinity, however unlikely this is to happen, could fall somewhere near SB or IB in performance cpu wise. (IB is not going to be that much better, whatever the case.) If it falls within 10% of SB or IB then i may buy it over the intel platform.
This looks awfully fanboyish of me, and it does show off my bias towards amd. I want to see them succeed, and if you don't. You're a fool.

Out of everything here, only the more PCIe lanes is true but to get a nice mobo like that, it cost about the same as a "low" end LGA1366 mobo (normally $200+). And by "low" I mean what they consider low end, which is still high end.

Most of the time features are the same. SATA 6, USB 3 etc. I don't see the need for more than 4 SATA 6 ports as Blu-Ray doesn't need it, HDDs don't need it only a SSD can benefit from the faster interface.

As for memory, its great to support faster memory, but every system I have built with a Phenom II or FX CPU, still has lower total memory bandwidth than Nehalem does. And still lower than SB. So having faster memory is not always a benefit. Most of the time the CPUs only support up to a certain memory speed while the mobo can overclock it to beyond that.

I doubt Trinity will fall within 10% of IB. Maybe SB, but I doubt IB. And I think it will be a close fight with the IGP as Intel is doing quite a bit of upgrading and as we all know, Intel can do near anything they set out to do. We saw that with the HD3K which did better than most expected considering Intels previous IGPs being flat out crap for anything more than basic use.

And as for Phenom II, I still recommend it over FX but in terms of performance, its still about on par with first gen Core 2 Quad, not very close to SB.

I for one am really pleased we are not stooping so low as to badmouth them in our forums ... which says a lot about our users here having a more balanced view of things.

I read the thread but declined to post over there.

I note MU and a couple of other users who are also here didn't get into it.

While I disagree the FX was a total fizzer (no disrespect to Don) you have to agree it has not soundly outperformed its predecessor ... in a number of cases it even fails to meet it.

Essentially the cache latency and prefetch logic is poor / broken.

Name an Intel CPU that failed to outperform the previous generation ... I can think of the Pentium IV, and in terms of steppings the Prescott was also poor.

Did we bag those two ... yes we did.

There are some redeeming qualities for AMD's modular approach but at present it is not being felt ... perhaps the next stepping might see something?

Anyway, I just wanted to give you guys a pat on the back for not buying into their dirt tactics.

Toms Hardware Rank = 1035 http://www.alexa.com/siteinfo/tomshardware.com

AMDZone Rank = 284,085 http://www.alexa.com/siteinfo/amdzone.com

Hardly worth discussing further eh?

😉

Its still pathetic to see them take low blows. To them, any site that states what they see instead of trying to justify it is paid by Intel. So pretty much every other site out there, sans a very few, are Intel sponsored since they gave the same overall opinion, that BD just wasn't that great.

Its main failing was weak performance, scaling and the worst, or I think, is power usage, something AMD used to push around with the Athlon 64/X2 series. Per clock and per core, its weak compared to SB and Phenom II. It uses way too much power, especially when overclocked. And it scales badly unless the program is optimized for feature sets only in FX.

If AMD can fix these, one may be out of their control (the power usage may be more process related and thus GF controls that), then Trinity and PD may be Phenom II. Not quite a Intel killer but something they can budget. But then again, they need money to move forward faster and being budget friendly is not the best way to do it.

Is FX a failure? No. But it sure is not a better option. The FX8150 is $269, the 2500K is $229 on Newegg. Hard to say its a better option just because of the extra "cores".

I still think the one area Intel will sin is power usage. Their top end IB CPU will only have a TDP of 77w while haveing a higher clock speed and more advanced GPU core than SB.

Time will tell. For now, AMD better hope they get PD right.
 
It doesn't matter how you look at it, AMD can't win, thats the conclusion everyone comes up with in one way or another.

If its AMD vs previous gen AMD, look at X bench not Y, but thats not the point, look at Intel they never lose to their own cpus, even if shown they do, thats not the point, Intel doesn't have slower cpus than AMD.

If its AMD vs Intel, look at their power draw, don't look at benchmarks that show it being ~5%

Its all the same crap over and over, no matter what the question is, someone comes up with an entirely different answer.

You're missing the point - Intel doesn't advertise or promote their workstation CPUs for gaming - AMD did with BD (and still does AFAIK). If AMD says BD is great for gamers, then it is entirely fair to compare it on that basis, as a check on their credibility.

IIRC while JF-AMD never allowed himself to be pinned down as to exactly what BD was optimized for (other than stating Interlagos would offer significant improvement over the prior generation), Anandtech shows even that to be a big stretch:

The Opteron 6276 offered similar performance to the Xeon in our MySQL OLTP benchmarks. If we take into account the hard to quantify TPC-C benchmarks, the Opteron 6276 offers equal to slightly better OLTP performance. So for midrange OLTP systems, the Opteron 6276 makes sense if the higher core count does not increase your software license. The same is true for low end ERP systems.

When we look at the higher end OLTP and the non low end ERP market, the cost of buying server hardware is lost in the noise. The Westmere-EX with its higher thread count and performance will be the top choice in that case: higher thread count, better RAS, and a higher number of DIMM slots.

AMD also lost the low end OLAP market: the Xeon offers a (far) superior performance/watt ratio on mySQL. In the midrange and high end OLAP market, the software costs of for example SQL Server increase the importance of performance and performance/watt and make server hardware costs a minor issue. Especially the "performance first" OLAP market will be dominated by the Xeon, which can offer up to 3.06GHz SKUs without increasing the TDP.

The strong HPC performance and the low price continue to make the Opteron a very attractive platform for HPC applications. While we haven't tested this ourself, even Intel admits that they are "challenged in that area".

The Xeon E5, aka Sandy Bridge EP

There is little doubt that the Xeon E5 will be a serious threat for the new Opteron. The Xeon E5 offers for example twice the peak AVX throughput. Add to this the fact that the Xeon will get a quad channel DDR3-1600 memory interface and you know that the Opteron's leadership in HPC applications is going to be challenged. Luckily for AMD, the 8-core top models of the Xeon E5 will not be cheap according to leaked price tables. Much will depend on how the 6-core midrange models fare against the Opteron.

The Hardware Enthusiast Point of View

The disappointing results in the non-server applications is easy to explain as the architecture is clearly more targeted at server workloads. However, the server workloads show a very blurry picture as well. Looking at the server performance results of the new Opteron is nothing less than very confusing. It can be very capable in some applications (OLTP, ERP, HPC) but disappointing in others (OLAP, Rendering). The same is true for the performance/watt results. And of course, if you name a new architecture Bulldozer and you target it at the server space, you expect something better than "similar to a midrange Xeon".

It is clear to us that quite a few things are suboptimal in the first implementation of this new AMD architecture. For example, the second integer cluster (CMT) is doing an excellent job. If you make sure the front end is working at full speed, we measured a solid 70 to 90% increase in performance enabling CMT (we will give more detail in our next article). CMT works superbly and always gives better results than SMT... until you end up with heavy locking contention issues. That indicates that something goes wrong in the front end. The software applications that do not scale well could be served well with low core count "Valencia" Opteron 4200s, but when we write this, the best AMD could offer was a 3.3GHz 6-core. The architecture is clearly capable of reaching very high clockspeeds, but we saw very little performance increase from Turbo Core.

IOW, BD (Interlagos) comes out fairly mediocre.

While I wouldn't call either desktop or server an "epic fail", neither came out anywhere near to expectations mostly set by AMD itself.

Where did AMD lose on every game bench vs their previous gen cpu?

Intel did, straight up all the way down. and that was the question that was asked.

And that is a silly question, about on par with benching Interlagos on games. OTOH, comparing Interlagos to the previous gen AMD server CPUs is a legit question, and the answer is telling..

 
Not to be disrespectful, but you might want to go research what exactly an integer pipeline is. The AMD "module" has more then two, each core has four integer pipelines, two for ALU and two for AGU functions. Thus a single AMD module actually has eight integer pipelines if you really want to get technical.

Yes, I've noted the AMD slides showing 4 integer pipes per core since AMD previewed them a couple years ago. However what I said was effectively two int pipes per core, not actual number. From http://semiaccurate.com/2011/10/17/bulldozer-doesnt-have-just-a-single-problem/:

The front end is also a big problem. Bulldozer’s front end can decode four instructions per clock vs three in Stars, and Intel’s Sandy Bridge also can do four per clock. This would be an advancement if that front end serviced one integer core, but it doesn’t, it services two. That front end can switch on a per-clock basis, feeding one core or the other, but not both in one clock.

This effectively pushes the decode rate from 3/clock to 2/clock/core average in Bulldozer. Given buffers, stalls, and memory wait times, this may not be a bottleneck in practice, but it sure seems questionable on paper. One thing to be sure is that it is a regression from Stars and certainly doesn’t live up to the promises made by insiders over the past few years.

The integer cores have two ports per core vs three on the older architectures, so theoretically this 4 instruction issue per two clocks isn’t a big problem, but since IPC does indeed go down in the real world vs Stars, it doesn’t seem to work out well in practice. That said, keeping a two pipe core close to a three pipe core in single threaded integer performance is a pretty amazing feat, but the competition has more decode, more threads, and overall better performance. Peeling out the actual effect on performance is almost impossible, but there is unquestionably one. Slice a few more cuts here.

As has been stated several times, its not the module approach that is killing BD performance, its the cache latency and poor prediction / prefetch that is staling the units out. There is only so many different ways to do binary math, and we figured them all out a few decades ago. The only difference between how an AMD, an Intel, a Sun or an IBM CPU does math is in the secret magic sauce of the prefetch / cache unit, otherwise they all add 1 + 1 the exact same way (bit order not withstanding).

Many reviewers disagree - there are many things suboptimal in BD, including the shared front-end design which is the basic idea behind the module approach. While it is true the cache latency and branch prediction are poorly implemented in this first iteration, the problems are more extensive than just those two areas..
 
fermi 2.0, nvidia's monolithic design wins again. AMD doesn't seem to be having too many issues at ~378mm2, but nvidia apparently can't get one to work from 550mm2. learning from the past is not good for the future is apparently Nvidiots philosophy, but aside from that, whats this have to do with PD?
 
"So, this is probably the best GPU we have ever built and the performance and power efficiency is surely the best that we have ever created,” said Mr. Huang

Wow, that's a very bold statement. Hope he doesn't fall short and we laugh at a BD from nVidia! xD

Cheers!
 
fermi 2.0, nvidia's monolithic design wins again. AMD doesn't seem to be having too many issues at ~378mm2, but nvidia apparently can't get one to work from 550mm2. learning from the past is not good for the future is apparently Nvidiots philosophy, but aside from that, whats this have to do with PD?
I'm not sure, but it better preform 2 times as good as a 7970. It DOES have twice the die space 😛. *sarcasm*

Posting something about PD to stay on topic:
Does anyone think AMD can pull off the "More IPC based cores"
 
As has been stated several times, its not the module approach that is killing BD performance, its the cache latency and poor prediction / prefetch that is staling the units out. There is only so many different ways to do binary math, and we figured them all out a few decades ago. The only difference between how an AMD, an Intel, a Sun or an IBM CPU does math is in the secret magic sauce of the prefetch / cache unit, otherwise they all add 1 + 1 the exact same way (bit order not withstanding).

Many reviewers [strike]have made no comment[/strike] - there are many things suboptimal in BD, including the shared front-end design which is the basic idea behind the module approach. While it is true the cache latency and branch prediction are poorly implemented in this first iteration, the problems are more extensive than just those two areas..

Your spewing tons of BS about "effective" Integer cores (WTF is effective supports to mean, is that a concept you created in your own head with next to no engineering experience). Four pipelines per core meaning eight per module, far more then two.

You also fail to understand how instruction queing and decoding work. That's an entire book in and of itself and far beyond the scope of this discussion. Safe to say, your horribly wrong if you think that four schedulers couldn't keep two integer cores and two 128-bit FPU's busy as part of a superscaler design. Eventually your realize there are internal schedulers to each core, and then your going to ponder what those schedulers are for and the difference between the ones outside and the ones inside.

Pretty much your just looking for reasons to hate on a product, this is the kind of blind hate that I can't stand to see spread, even on enthusiast websites.
 
Your spewing tons of BS about "effective" Integer cores (WTF is effective supports to mean, is that a concept you created in your own head with next to no engineering experience). Four pipelines per core meaning eight per module, far more then two.

You also fail to understand how instruction queing and decoding work. That's an entire book in and of itself and far beyond the scope of this discussion. Safe to say, your horribly wrong if you think that four schedulers couldn't keep two integer cores and two 128-bit FPU's busy as part of a superscaler design. Eventually your realize there are internal schedulers to each core, and then your going to ponder what those schedulers are for and the difference between the ones outside and the ones inside.

Pretty much your just looking for reasons to hate on a product, this is the kind of blind hate that I can't stand to see spread, even on enthusiast websites.
You need to resize your sig as it's too high, rules are rules.

 
fermi 2.0, nvidia's monolithic design wins again. AMD doesn't seem to be having too many issues at ~378mm2, but nvidia apparently can't get one to work from 550mm2. learning from the past is not good for the future is apparently Nvidiots philosophy, but aside from that, whats this have to do with PD?

Fermi may have run hot but it sure did perform. Kepler probably will too.

As for the relevance, it has to do with TSMCs 28nm process node, which they are having trouble with according to nVidia. The same 28nm process AMD was planning to move their APUs to instead of using GFs 32nm, who is also having problems, which they cancelled. For now.
 
Status
Not open for further replies.