AMD Piledriver rumours ... and expert conjecture

Reynod · Oct 27, 2011

We have had several requests for a sticky on AMD's yet to be released Piledriver architecture ... so here it is.

I want to make a few things clear though.

Post a question relevant to the topic, or information about the topic, or it will be deleted.

Post any negative personal comments about another user ... and they will be deleted.

Post flame baiting comments about the blue, red and green team and they will be deleted.

Enjoy ...

jimmysmitty · Feb 18, 2012

this is not a stepping revision but the same first model chip with higher clocks, if so then what's the point of that.?

No its the same stepping, lower stock clocks. That means its just a way for AMD to push out the "not quite up to par" parts, much like the Phenom II X2/X3s.

Hmm, I used to think that you were one of the more knowledgeable and reasonable posters here, but this post in particular, and your recent responses to others in general, have pretty much destroyed that perception. You seem more interested in your own bloated ego and protecting your "Internet credibility" than in discussing the issues.

I provided links to an S/A article that countered your position that BD's problems were mostly confined to cache. Either point out how S/A was wrong or don't bother responding. Attacking a poster rather than countering his or her argument is not acceptable under the TOS here.

Back on topic, it doesn't matter how many integer or FP pipes a "core" might have - if the shared scheduler can only keep two per core filled under CMT, then the core is only going to be able to execute two integer computations simultaneously, rather than 3 or 4 as in K8 and later, and Core2 and later respectively.

Pretty much your just looking for reasons to hate on a product, this is the kind of blind hate that I can't stand to see spread, even on enthusiast websites.

Click to expand...

Whatever. I was considering BD for my next build, and was disappointed to see AMD release yet another mediocre product. I don't hate the product or AMD, I just think it is a disappointment. I might point out that your own posts demonstrate you think it is not a stellar product as well.

Yep. You can have 4 water pipes but if the lines can only fill 2 at a time, don't expect it to somehow fill all 4.

Originally I was interested in Barcelona as an upgrade from my P4 Northwood, but after the delays and the 'dancing in the aisles' and other BS leaks that I learned about, including here on THG articles and the THG forums, I got a Q6700 shortly after it was released and never looked back. Ever since then, I have waited for the independent reviews to appear before plunking down $$ on components. I recall your buying that AM3+ mobo however 😀, but rest assured there are plenty other posters here who did the same thing. Maybe you guys should file for refunds from AMD or from Baron 😛..

Most of us are here to learn, and appreciate hard facts and links. Unfortunately some few already 'know it all' and are here to pontificate on their preconceived notions instead. Take the discussion about the L3 cache being next to worthless, for instance. Intel has publicly stated they don't incorporate features unless there is at least a 2x performance increase for the extra power consumption required. Since the L3 cache uses a lot of transistors, clearly it's power draw is not negligible. However its benefits must outweigh the extra power seeing as how Intel seems to make their L3 cache bigger with each generation. And I note that both AMD and Intel's top-end CPUs have loads of L3. Plus looking at AT's benchmarks for an Athlon X4 and the Phenom X4, same clock speeds, only difference being the first has no L3 while the 2nd does, some benchmarks show a pretty significant increase with the L3 cache present.

While I have learned some things from Palladin and mostly appreciate his participating on the forums, lately he seems off his meds or something. At least he is not a blind AMD fanboy too, unlike some others..

Intel uses the L3 also as a storage as well for L2 and L1 instructions so it can access them much faster than from memory. L3 obviously has its place else AMD would not have it nor Intel. And I honestly trust the engineers with masters degrees and years of design experience before I trust some random net dude.

Standard monitors have higher pixel density than TVs with larger display panels. That is why TVs are cheap for their size but up close one knows as they look like crap while a monitor is different. Personally I would like to see higher pixel densities on TVs for better and more consistent image quality.

Yep. Always have been. Thats why 1080P on a 60" looks like crap compared to a 40". Maybe they will get it some day.

I will have to look over BD again but one thing I know for sure is that there is no L0 in BD unlike SB. I wonder how much effort they really do put into designing their CPUs now days. I know that their main pitfall has been their almost total reliance on automated design software, one of their former engineers had said a few months ago that such designs that are brought about by automated software is usually 20% larger and 20% slower than traditional methods of doing the work by hand. That is one reason the L3 takes up so much space, eventually the industry is going to have to come to the realization that every transistor added counts and every transistor that manged to save results in savings for everyone.

Its not only the L3 that takes up so much space but in BD alone there seems to be large areas of nothing.

Not sure if Intel does the same, automation, or if they do they then take the design and go over it by human to cut the fat off but I can see how that type of design can really be worse off than human.

esrever · Feb 18, 2012

I will have to look over BD again but one thing I know for sure is that there is no L0 in BD unlike SB. I wonder how much effort they really do put into designing their CPUs now days. I know that their main pitfall has been their almost total reliance on automated design software, one of their former engineers had said a few months ago that such designs that are brought about by automated software is usually 20% larger and 20% slower than traditional methods of doing the work by hand. That is one reason the L3 takes up so much space, eventually the industry is going to have to come to the realization that every transistor added counts and every transistor that manged to save results in savings for everyone.

I have no idea what that guy meant by automated design software but theres no way to design a chip without automated software. Even reworking the chip is designed on software and optimized by using the software. There is no way to design a 1.2 billion transistor chip by hand. Given that each transistor has 2 modes of operation and it takes around 10 for a basic gate, you will be working with logic in the order of 2^1000000 per section of chip which is impossible to deal with without software. Logically the chip would be broken down and each part written in some kind of hardware design language then compiled and tested with the transistor specifications and simulated, then rewritten and compiled until they got something that worked with the specifications of the chip. All this done with software and has been done with software since they figured out they can simulate gate logic with software.

nforce4max · Feb 18, 2012

No its the same stepping, lower stock clocks. That means its just a way for AMD to push out the "not quite up to par" parts, much like the Phenom II X2/X3s.

Yep. You can have 4 water pipes but if the lines can only fill 2 at a time, don't expect it to somehow fill all 4.

Intel uses the L3 also as a storage as well for L2 and L1 instructions so it can access them much faster than from memory. L3 obviously has its place else AMD would not have it nor Intel. And I honestly trust the engineers with masters degrees and years of design experience before I trust some random net dude.

Yep. Always have been. Thats why 1080P on a 60" looks like crap compared to a 40". Maybe they will get it some day.

Its not only the L3 that takes up so much space but in BD alone there seems to be large areas of nothing.

Not sure if Intel does the same, automation, or if they do they then take the design and go over it by human to cut the fat off but I can see how that type of design can really be worse off than human.

Intel does use such software but to a less extent and only on parts that are not of priority. Plus Intel has been cutting the fat with ever new generation or we wouldn't be seeing the improvements.

I was looking over what is going to be replacing the e-350 and e-450 from amd and not thrilled. They are bringing out a new south bridge while the apu isn't changing much :s Personally I would have opted for a dual channel memory controller, full clock L2 and integrate the south bridge into the same die as well so it is a all in one chip. On top of that would have allowed higher clocks on the cpu side. A single chip is a lot easier for oems to work with and very easy to keep cool.

Cazalan · Feb 18, 2012

My guess is that it is indeed GF's process problems that are mostly responsible for the low clock speeds as well as the low yields. Most of the IBM fab club - including GF - will be abandoning gate-first HKMG at 22nm and going to the Intel/TSMC gate-last approach. While GF's 32nm is apparently pretty good at making leaky transistors (e.g., the recent world-record in overclocking BD), it's not so good at making low-leakage but high-clocking ones.

The 4.2Ghz on the Fx-4100 isn't a low clock speed though.

If you look at SandyBridge 2600/2700 overclocking they reach a hurdle around the 4.5/4.6Ghz range as well. By 5Ghz the power draw doubles.

So the GF process doesn't appear that much worse than Intel's at 32nm.

nforce4max · Feb 18, 2012

I have no idea what that guy meant by automated design software but theres no way to design a chip without automated software. Even reworking the chip is designed on software and optimized by using the software. There is no way to design a 1.2 billion transistor chip by hand. Given that each transistor has 2 modes of operation and it takes around 10 for a basic gate, you will be working with logic in the order of 2^1000000 per section of chip which is impossible to deal with without software. Logically the chip would be broken down and each part written in some kind of hardware design language then compiled and tested with the transistor specifications and simulated, then rewritten and compiled until they got something that worked with the specifications of the chip. All this done with software and has been done with software since they figured out they can simulate gate logic with software.

It can be done but that requires a larger and very highly skilled team, I have no problem with the simulation as that isn't the problem. The problem is when the software uses more transistors than what was actually needed to build the circuit. 20% larger and 20% slower vs the old way of designing the chip, that is why in the past such projects often took several years before there was even a chance to test the final silicon. With software it is very easy to just plug in a set of specifications and just let the program do it's job then take the results into a simulator. Eventually there will come a point where every transistor added will come at a premium when manufacturing process and wafer size has maxed out for such methods to be used and so cost will sky rocket. Intel is playing their cards smart ahead of time by trying to integrate useful features while trying to keep their dies sizes down and designs lean. AMD can get away for two or three generations before hitting the wall and having no way out other than a crash course project to continue introducing new generations. GPUs will likely hit this wall long before CPUs do but that is likely to be only a few years apart.

nforce4max · Feb 18, 2012

The 4.2Ghz on the Fx-4100 isn't a low clock speed though.

If you look at SandyBridge 2600/2700 overclocking they reach a hurdle around the 4.5/4.6Ghz range as well. By 5Ghz the power draw doubles.

So the GF process doesn't appear that much worse than Intel's at 32nm.

GF process allows for high clocks but AMD's designs favor high clocks while Intel's it is a performance per clock where efficiency is favored. Intel doesn't need to run at very high clocks to achieve high performance. The moment AMD changes from going GHZ only to something that aims to squeeze the most out at more manageable clocks then power consumption will drop while giving Intel something to think about.

jimmysmitty · Feb 18, 2012

Intel does use such software but to a less extent and only on parts that are not of priority. Plus Intel has been cutting the fat with ever new generation or we wouldn't be seeing the improvements.

I was looking over what is going to be replacing the e-350 and e-450 from amd and not thrilled. They are bringing out a new south bridge while the apu isn't changing much :s Personally I would have opted for a dual channel memory controller, full clock L2 and integrate the south bridge into the same die as well so it is a all in one chip. On top of that would have allowed higher clocks on the cpu side. A single chip is a lot easier for oems to work with and very easy to keep cool.

True.And if Intel does go the route with stacked RAM it would be a game changer in a lot of ways. I guess we shall see.

nforce4max · Feb 18, 2012

True.And if Intel does go the route with stacked RAM it would be a game changer in a lot of ways. I guess we shall see.

Stacking is the future and the first one to get to that at a good price has a lot to gain in market share. Once process and die size maxes out stacking is the only way out while keeping conventional tech going. Plus at that point only TDP and yield become the main limiting factors.

Cazalan · Feb 18, 2012

Stacking is the future and the first one to get to that at a good price has a lot to gain in market share. Once process and die size maxes out stacking is the only way out while keeping conventional tech going. Plus at that point only TDP and yield become the main limiting factors.

I would expect to see this on the Atom first. Perhaps why they're delaying the 22nm die shrink on Atom(Silvermont), to line it up with stacked memory in 2013. A potential knockout punch for mobile ARM processors.

esrever · Feb 18, 2012

It can be done but that requires a larger and very highly skilled team, I have no problem with the simulation as that isn't the problem. The problem is when the software uses more transistors than what was actually needed to build the circuit. 20% larger and 20% slower vs the old way of designing the chip, that is why in the past such projects often took several years before there was even a chance to test the final silicon. With software it is very easy to just plug in a set of specifications and just let the program do it's job then take the results into a simulator. Eventually there will come a point where every transistor added will come at a premium when manufacturing process and wafer size has maxed out for such methods to be used and so cost will sky rocket. Intel is playing their cards smart ahead of time by trying to integrate useful features while trying to keep their dies sizes down and designs lean. AMD can get away for two or three generations before hitting the wall and having no way out other than a crash course project to continue introducing new generations. GPUs will likely hit this wall long before CPUs do but that is likely to be only a few years apart.

Not sure if you realize but doing the grunt work by hand is pointless and is all done by software no matter the chip. you realize that it would take 38 years just to count to 1.2 billion if you counted a number a second?

jimmysmitty · Feb 18, 2012

I would expect to see this on the Atom first. Perhaps why they're delaying the 22nm die shrink on Atom(Silvermont), to line it up with stacked memory in 2013. A potential knockout punch for mobile ARM processors.

I am sure too. Stacked LPDRR2/3 would be insanley fast compared to one on a board. But I think Haswell might get it first for the GPU.

Not sure if you realize but doing the grunt work by hand is pointless and is all done by software no matter the chip. you realize that it would take 38 years just to count to 1.2 billion if you counted a number a second?

Thats not design. You can design a IC ithout having to count. And thats what matters. Machines cannot think outside their programming, yet (Skynet). Only humans can look and see where they can shave something for better performance. SO AMD can use automation. Its just better to have a human look it over first.

MU_Engineer · Feb 18, 2012

No its the same stepping, lower stock clocks. That means its just a way for AMD to push out the "not quite up to par" parts, much like the Phenom II X2/X3s.

Yep. You can have 4 water pipes but if the lines can only fill 2 at a time, don't expect it to somehow fill all 4.

Intel uses the L3 also as a storage as well for L2 and L1 instructions so it can access them much faster than from memory. L3 obviously has its place else AMD would not have it nor Intel. And I honestly trust the engineers with masters degrees and years of design experience before I trust some random net dude.

The more I hear about Bulldozer's performance in certain applications, the more I think the caches are the issue. The L1 is 33% slower in Bulldozer than in 10h (4 cycles load to use vs 3), and the L2 is a *lot* slower- a minimum of 12 cycles in 10h and a minimum of 20 cycles in Bulldozer. Bulldozer's shared L1 instruction cache requires two cycles to fetch a cache line, whereas if I read correctly, 10h can fetch a 64-byte-long line in one cycle. (References are the Family 10h/12h and Family 15h software optimization guides from AMD's website.) And that L1I is feeding two separate sets of integer pipes in Bulldozer. That sounds like it could be a pretty big bottleneck to me and why Bulldozer may not scale quite as well as first predicted. Perhaps somebody with a little more experience can weigh in on that one.

Yep. Always have been. Thats why 1080P on a 60" looks like crap compared to a 40". Maybe they will get it some day.

Some digital cinema is done in 4K (roughly 4000x3000) and there is Quad Full HDTV (3840x2160). Those would be the next logical candidates for increased TV resolution. However, I wouldn't expect much soon since most TV material is still in low-def, let alone 1080p.

truegenius · Feb 18, 2012

what is the use of sharing a fpu (less die size, and lots of performance loss issues)?

esrever · Feb 18, 2012

the share fpu reduces size and really shouldn't lose much performance the chip performed properly and the instructions are properly called.

very small percent of cpu work use the fpu except for some specific programs.

truegenius · Feb 18, 2012

that means this module design is not a issue related to performance, but it increased the latency of l2 and thus causing performance loss meanr this design is not impleted correctly.

viridiancrystal · Feb 18, 2012

The module design would work pretty well if it was used correctly yes. Your basically shoving two cores right next to each other, and sharing some of the things each core would have (most of them i assume would be pieces not often utilized by a core anyway, only under specific conditions.)

However, they share L2 cache, which i feel they really shouldn't. With the amount of L2 there is, if two cores sharing L2 have to work on separate processes, it doesn't end well.

To be completely honest, if done correctly, it would have been seen as an incredible idea. Too many small details were left out it seems, and AMD is paying for it.

nforce4max · Feb 18, 2012

what is the use of sharing a fpu (less die size, and lots of performance loss issues)?

Not all programs are data and int only.

truegenius · Feb 18, 2012

a module design is creating performance issues in multiple threads.
So if we use only 1 thread in 1 module (at first core) then we can avoid l2 cache sharing and should see a huge performance increase,

but still bd is not capable to beat a daneb/thuban core at same clock speed in single threaded tasks, so why it is so?

nforce4max · Feb 18, 2012

a module design is creating performance issues in multiple threads.
So if we use only 1 thread in 1 module (at first core) then we can avoid l2 cache sharing and should see a huge performance increase,

but still bd is not capable to beat a daneb/thuban core at same clock speed, so why it is so?

Not huge, its around 10% when the core isn't having to share resources. Xbit done a review with a 8120/8150 vs the lower end models and artificially disabled cores as they went down vs how amd disabled modules and they found that performance was greater by about 10% vs what amd had done.

fazers_on_stun · Feb 18, 2012

I have no idea what that guy meant by automated design software but theres no way to design a chip without automated software. Even reworking the chip is designed on software and optimized by using the software. There is no way to design a 1.2 billion transistor chip by hand. Given that each transistor has 2 modes of operation and it takes around 10 for a basic gate, you will be working with logic in the order of 2^1000000 per section of chip which is impossible to deal with without software. Logically the chip would be broken down and each part written in some kind of hardware design language then compiled and tested with the transistor specifications and simulated, then rewritten and compiled until they got something that worked with the specifications of the chip. All this done with software and has been done with software since they figured out they can simulate gate logic with software.

What we were discussing is 100% automated layout - no hand-tuning at all. Of course a billion+ transistor VLSI design will have to use automated layout tools during its design, but some studies show that you can increase performance by as much as 20% if you have an experienced engineering team go over it and 'hand-tune' the critical speed paths - those parts that have to operate at the highest speeds almost all the time as the overall performance is heavily weighted or influenced by them.

TBH, I have also seen posts by some presumably experienced design engineers stating that often such hand-tuning results in worse performance, since there is no way a team of humans is going to be able to check all the permutations that a computer tool running on a supercomputer, and using the design experience of thousands of experts, can.

Personally, I am no design engineer, although I did take a VLSI design course in graduate school a long time ago. So both positions make sense to me

. Perhaps somebody who is a design engineer can comment..

fazers_on_stun · Feb 18, 2012

The 4.2Ghz on the Fx-4100 isn't a low clock speed though.

But that has half the cores/modules that the 8-core parts have. For the same TDP (or ACP as AMD refers to average power draw), then it follows the clocks should be significantly higher.

If you look at SandyBridge 2600/2700 overclocking they reach a hurdle around the 4.5/4.6Ghz range as well. By 5Ghz the power draw doubles.

So the GF process doesn't appear that much worse than Intel's at 32nm.

From Chris Angelini's article on BD's efficiency http://www.tomshardware.com/reviews/bulldozer-efficiency-overclock-undervolt,3083-19.html:

As we've seen, the efficiency scores of Intel's processors begin at 320 and top out at more than 635 points, achieved by Core i5-2500K and Core i7-2600K. Thus, Sandy Bridge and Sandy Bridge-E offer significantly more performance per watt of electrical power than Bulldozer, which is caused primarily by the fact that the Intel processors do not require such extreme voltage increases to realize higher clocks.

During its press briefing, AMD's architects claimed that the decisions they made in putting Bulldozer together were driven by a desire to achieve better efficiency than its aging Stars design. Given the numbers presented today, we're pretty sure something went wrong along the way. At its stock settings, Bulldozer comes close to being competitive, but it's not clear how AMD plans to scale the performance of this design over the next year (before Piledriver is ready) without increasing voltage and actually hurting efficiency in the process.

triny · Feb 18, 2012

Netburst sold very well but afterward stained Intel's reputation even now amongst some. It all comes down to the Brand. People will buy thinking that all is good if the product before meet the needs or expectations. We all see examples of this in our everyday lives in just about everything one can care to imagine from cars, shoes, different fast food and shopping outlets. Brands are what people remember so any company that knows how to exploit it's brand sells well and any good salesman can sell sh^t but never again to the same people as before.

That's certainly all true ,however I have found that those who are inclined to shop that way will get taken to the cleaners on price
and oft times on quality as well.

triny · Feb 18, 2012

What we were discussing is 100% automated layout - no hand-tuning at all. Of course a billion+ transistor VLSI design will have to use automated layout tools during its design, but some studies show that you can increase performance by as much as 20% if you have an experienced engineering team go over it and 'hand-tune' the critical speed paths - those parts that have to operate at the highest speeds almost all the time as the overall performance is heavily weighted or influenced by them.

TBH, I have also seen posts by some presumably experienced design engineers stating that often such hand-tuning results in worse performance, since there is no way a team of humans is going to be able to check all the permutations that a computer tool running on a supercomputer, and using the design experience of thousands of experts, can.

Personally, I am no design engineer, although I did take a VLSI design course in graduate school a long time ago. So both positions make sense to me . Perhaps somebody who is a design engineer can comment..

Once you hand tune to get the parameters, I don't see why you can't automate the process using the hand tuned parameters.

Mousemonkey · Feb 18, 2012

Hmm that's 75 pixels high with 400 wide correct ?

Working on getting it down to 75 from 95. Oracle's a bit of a prune about logos.

Yes mate, PITA I know but that was the size decided on some time ago and whilst there has been some discussions about changing it nothing has been decided yet.

king smp · Feb 18, 2012

I have been following this thread for a few pages while keeping my mouth shut since
I like listening to people more knowledgeable than me
I find the discussion about CPU design fascinating though I dont understand all of it
The only point I wanted to add is that it seems like AMD and Intel switched philosophies
Looking back at the Netburst/P4 days Intel was going with longer pipelines and focusing on GHZ speed increases and they had a problem also with the L1/L2 cache at the time
Intel also with the P4 was favoring speed over IPC at the time
If you remember back then AMD would beat Intel with lower clock speeds
Hence the old Athlon XP naming scheme with 3200 representing that a 2.4 Athlon was equal to a 3.2 Pentium
Now it is like the roles are reversed and AMD didnt learn from Intel's mistakes
It doesnt matter how fast a ghz a CPU is
if a 1ghz CPU is so efficient at IPC that it equals a 3ghz CPU then obviously it is better
the only way that GHZ helps is in marketing
the common computer user was fooled by Intel back in the day because they would just see that Intel had 3ghz while AMD had 2.2 and just think that the Intel was better
kind of like HP in cars
that is why even though Pentium 4 was a failure overall as a design it sold extremely well
to this day in my computer shop I see so many Pentium 4 computers and very few Athlon XP towers come in for repair
Intel was almost a genius in marketing the Pentium 4 and was smart enough to switch gears with the Core2 design
this maybe a simplistic viewpoint on my part since I am nowhere near an expert in CPU design and just barely understand the basics of CPU architecture
but I can appreciate how Intel (BTW very happily own a PHII x 4 925 Deneb @ 3.4

)
was able to position itself the past 10 years or so
Got to give Intel credit their design team is matched by their marketing team LOL

AMD Piledriver rumours ... and expert conjecture

Administrator

Champion

Splendid

Splendid

Distinguished

Splendid

Splendid

Champion

Splendid

Distinguished

Splendid

Champion

Splendid

Distinguished

Splendid

Distinguished

Distinguished

Splendid

Distinguished

Splendid

Splendid

Splendid

Distinguished

Distinguished

Titan

Splendid

Share this page