AMD Piledriver rumours ... and expert conjecture

Reynod · Oct 27, 2011

We have had several requests for a sticky on AMD's yet to be released Piledriver architecture ... so here it is.

I want to make a few things clear though.

Post a question relevant to the topic, or information about the topic, or it will be deleted.

Post any negative personal comments about another user ... and they will be deleted.

Post flame baiting comments about the blue, red and green team and they will be deleted.

Enjoy ...

fazers_on_stun · Feb 14, 2012

Chad and N00b lets keep it civil guys and leave Charlie D out of this discussion unless his site posts a story on the topic.

See my previous post regarding knocking other sites.

Rey, you might want to take a look at this AMDZone thread :

Re: According to Tom's Hardware FX is a failure
by abinstein

They are just an Intel sponsored fan site.
I don't think they even deny it when spreading misinformation across the web.

BadTrip · Feb 15, 2012

Rey, you might want to take a look at this AMDZone thread :

Re: According to Tom's Hardware FX is a failure
by abinstein

Click to expand...

I read that Sunday. So much much disrespect to Don Woligroski with no actual facts to back it up.

Shameful.

palladin9479 · Feb 15, 2012

Heh, well the fanbois were insisting that BD's CMT was actually 90% of a full core performance - check out the huge 200+ page thread on BD, where supposedly the first core operates at 100% and the 2nd at 80% (not particularly symmetrical seeing as how both cores are identical..).

A "bulldozer" core is exactly 100% of a real core, it has all the components required to be considered such.

This isn't Intel HT where they just virtualize two register stacks for one processing unit, BD actually put two sets of hardware into each of these "modules". Even the shared FPU is capable of two separate 128 bit operations and thus can act like two separate FPU's.

There are only two truly shared components, the scheduler / prediction unit and the L2 cache unit. BD has more L2 cache per core then SB thus we can assume it's not the size of that cache that is a performance issue.

What we're left with, and seeing how it behaves in various benchmarks, is that there is something going wrong inside either the predictor / scheduler unit or the L2 cache unit, or both. This would result in the two cores not running at capacity and stalling out, which would result in the bad single thread performance as well as the general bad state of it's performance.

Also please realize that SB kinda "cheats" on benchmarks. It's design is such that although it only has 256kb of L2 cache, the large fast L3 cache can be used to supplement that single heavy thread. We're talking 6MB+ of L3 going to a single core, with that much you just copy all the code segments into cache and be done with it. Also Intel's loop detection is able to detect when looped tests are being done and start prefetching much deeper then AMD is capable of. Running "multi threaded" benchmarks is bull ***, all they do is spawn multiple threads but use the same code segments / data pool for all of them. Again Intel's genius engineers have it so that once that happens, all the code / data for those threads is kept in one place, no duplicates are made, thus maximizing cache hit rates.

These are the reasons I want to see real multi-process benchmarks done, not a single benchmark at a time that just copies it's treads. With completely different code segments and data sets you would get a real feel for how a CPU's actual processing components are instead of how good it's loop detection and cache integration are.

Cazalan · Feb 15, 2012

And ~most~ of heat is from the processing units doing the work your issuing them.

I'm really laughing that you actually thought the voltage to "drive the memory bus" is creating any significant amounts of heat. I call upon the resident overclockers, the kings of thermal load management, as evidence that the CPU core clock (aka processor speed) is the primary control over heat load. Take the 970BE, default 3.5Ghz with a DDR3-1600 dual memory bus. I can clock it at 2.0Ghz and it will be significantly cooler, yet still have the exact same 128-bit DDR3-1600 dual channel memory bus. I can clock it at 4.2Ghz and it will be significantly hotter, to hot for Air coolers in general, and still have the exact same 128-bit DDR3-1600 dual channel memory bus.

Removing the memory bus to put a small amount of memory ontop will do nothing for the processing components heat load yet increase the heat load of the entire chip resulting in lower max clocks.

I never suggested the memory interface was the majority of heat from a CPU today, but it does contribute. At some point the on-board memory will have to get wider or significantly faster to keep up with the demands of faster/larger CPUs.

Tom's hardware did some testing of 1.35v vs 1.5v memory and found a 5 Watt difference at peak. That may be insignificant for a 140 Watt CPU but what about a 35Watt Ivy Bridge? Suddenly that 5W is more significant. What if you want to double the bandwidth for Haswell yet keep bringing total system power down?

I suggest you look a bit deeper into research in this area. It was brought to my attention by reading papers from Micron/IBM/Intel engineers about 2 years ago. The way computers get more powerful/smaller/cheaper with each generation is by integrating components. There was a time when 2D and 3D video were separate chips. There was a time when L2/L3 cache was off-die. Now we have entire computers on a chip (SoC).

Not really sure what your test case is proving, other than than memory does contribute to the power load of a system. Of course the clock speed also factors into the equation. That's why every generation of memory has to go to lower voltages to accommodate the higher clock speed without also doubling the power consumption.

As IBM mentioned, you could go as low as 0.2V with on-die memory and FinFet transistor. That would be significant power savings. Micron mentions a 70% power savings.

http://chipdesignmag.com/lpd/blog/tag/3d-stacking/

g4114rd0 · Feb 15, 2012

Valentine's Day: Intel sends gifts to AMD, First desktop Core i3 Ivy Bridge processors leaked.

We know who will win dancing with the APU Trinity A10-5800K,
the new core "i3 3225 3.3 Ghz. Intel HD 4000 TDP W 55.".

palladin9479 · Feb 15, 2012

Found a decent demonstration of why single threading vs multi-threading (no cheating possible) bench's should be run more.

http://www.sweclockers.com/artikel/14650-prestandaanalys-battlefield-3/5#pagehead

BF3 but in large multiplayer maps. It's been shown that games that use complex multi-player maps require significantly more processing power then they do on single player "timed demo" mode. You can see a BD barely keeping up with a 2500K and surpassing both the 1060T / 980.

I really think it's time we stop benchmarking everything in single player timed demo, at least not competitive PvP games.

esrever · Feb 15, 2012

Found a decent demonstration of why single threading vs multi-threading (no cheating possible) bench's should be run more.

http://www.sweclockers.com/artikel/14650-prestandaanalys-battlefield-3/5#pagehead

BF3 but in large multiplayer maps. It's been shown that games that use complex multi-player maps require significantly more processing power then they do on single player "timed demo" mode. You can see a BD barely keeping up with a 2500K and surpassing both the 1060T / 980.

I really think it's time we stop benchmarking everything in single player timed demo, at least not competitive PvP games.

its hard to get consistent data in multiplayer because of too many variables.

palladin9479 · Feb 15, 2012

its hard to get consistent data in multiplayer because of too many variables.

And yet we have evidence that multi player and single player behave differently.

If I was a competitive FPS player (I'm not) then I'd be more concerned about how my rig performed in FPS match's then how it runs a timed demo looping.

esrever · Feb 15, 2012

And yet we have evidence that multi player and single player behave differently.

If I was a competitive FPS player (I'm not) then I'd be more concerned about how my rig performed in FPS match's then how it runs a timed demo looping.

true, which is why big games should come with a standard multiplayer simulation benchmark to make it more consistent.

-Fran- · Feb 15, 2012

true, which is why big games should come with a standard multiplayer simulation benchmark to make it more consistent.

Well, it's not that hard to record MP's plays into scripts, but even like that it would not be the same... Maybe real MP benching sessions with comunity folks 😛

Cheers!

palladin9479 · Feb 15, 2012

Well, it's not that hard to record MP's plays into scripts, but even like that it would not be the same... Maybe real MP benching sessions with comunity folks 😛

Cheers!

That's what I'm thinking, or some external simulator.

In any case, it can not be a script or a pre-recorded timed demo. That's just single player timed demo looped benchmarking which has been shown to be not be a reliable predictor of MP performance. The basic idea is to have different code segments and data sets being processed instead of four copies of the same segments / datasets.

truegenius · Feb 15, 2012

acording to my calculations
first cores of every modules of fx8 bd when clocked equal to 4.2ghz then they will perform equal to 3.3-3.4ghz of daneb considering that efficency of using multicore is 90% in bd

means after corrected scheduling of bd, it will perform equal to ph2 955 in multi threaded task when every cpu is at stock.
And if they can use full potential of a bd's first cores of every module then they can match the performance of ph2 980.

de5_Roy · Feb 15, 2012

first, i am sorry because of the long post, i was away for a bit..

If Trinity/Piledriver doesn't do well, AMD may just cancel future generations. They've done it before (the QuadFX version of Barcelona for instance, which left all the purchasers high and dry for upgrades).

amd recently cancelled 28 nm apus , server cpus - sepang and terramar. is it true that amd doesn't change socket unless a new memory is introduced?

Valentine's Day: Intel sends gifts to AMD, First desktop Core i3 Ivy Bridge processors leaked.

We know who will win dancing with the APU Trinity A10-5800K,
the new core "i3 3225 3.3 Ghz. Intel HD 4000 TDP W 55.".

i hope that the core i3 can at least compete against a 65w trinity. i think the gpu performance will be on trinity's side.

Found a decent demonstration of why single threading vs multi-threading (no cheating possible) bench's should be run more.

http://www.sweclockers.com/artikel/14650-prestandaanalys-battlefield-3/5#pagehead

BF3 but in large multiplayer maps. It's been shown that games that use complex multi-player maps require significantly more processing power then they do on single player "timed demo" mode. You can see a BD barely keeping up with a 2500K and surpassing both the 1060T / 980.

I really think it's time we stop benchmarking everything in single player timed demo, at least not competitive PvP games.

do 7zip and truecrypt benchmarks count as real world multithreaded tests? i frequently use 7zip. sometimes use blender, handbrake... g.i.m.p. too.

palladin9479 · Feb 15, 2012

do 7zip and truecrypt benchmarks count as real world multithreaded tests? i frequently use 7zip. sometimes use blender, handbrake... g.i.m.p. too.

Unless your archiving multiple GB's of data every day, then your not going to notice 7-zip.

Truecrypt is kinda crappy, lead dev's are arse holes. That's why I switched to Diskcryptor. (TC is still better the bitlocker).

http://diskcryptor.net/wiki/Main_Page/en

It's a branch of TC due to several developers disagreeing with the lead dev over some issues. DC supports AES-NI along with Padlock and will support whatever future scheme AMD derives. TC supports AES-NI and only AES-NI.

As long as you can achieve full speed encrypted reads / writes then anything past that no longer matters. You can do that on any HDD or HDD array, heck my Via Nano gets full speed disk read / write on a 4x 7200 RPM array. It takes a SSD to be limited by your CPU. Plus that's a cheap shot as Intel has AES-NI while AMD doesn't. Many years ago when Via had Padlock Toms and other sites absolutely refused to include encryption benchmarks in their comparison. Intel copies Via's idea and makes AES-NI on it's chips and suddenly everyone's signing praises, talk about hypocrisy.

All of those tests you outlines utilize the same code segments with the same data sets, all they do is launch a new thread. Intel's loop detection and cache management picks up on what your doing and fits the whole thing into cache, this is why your getting such large scaling. Run handbrake WHILE running blender, two different programs and things will start to even out quickly.

Anyways the whole point is to demonstrate that timed demo loops and single applications running are very poor indicators of processor capacity scaling. I've said this for over 10 years, long before BD was announced. If you really want to know the scaling capabilities of a chip, run multiple unrelated programs at once.

de5_Roy · Feb 15, 2012

huh.. i didn't realize that. so.. for example - running handbrake and blender or may be 7zip and handbrake would be more in the line of proper multithreaded test. i have never tried that. i usually run either one at a time. your idea is good, i am certianly gonna test that.
amd included aes support with the zambezi cpus and they perform quite well in truecrypt benches. they outperform core i5 cpus in truecrypt benchmarks. afaik, via nano also has aes support.

palladin9479 · Feb 15, 2012

huh.. i didn't realize that. so.. for example - running handbrake and blender or may be 7zip and handbrake would be more in the line of proper multithreaded test. i have never tried that. i usually run either one at a time. your idea is good, i am certianly gonna test that.
amd included aes support with the zambezi cpus and they perform quite well in truecrypt benches. they outperform core i5 cpus in truecrypt benchmarks. afaik, via nano also has aes support.

As I don't have a BD CPU I didn't know if AMD licensed AES-NI from Intel or created their own format. If they licensed AES-NI then it should work the same between the two, if they created their own standard (Via Padlock) then the programs would need to be written specifically for it. On TC's website all they do is talk about Intel CPU's performance, no mention was given to AMD / BD thus I assumed (yeah we all know what happens then) that they didn't natively support it. What I do know is that many years ago when I asked the TC dev's to support Via Padlock, I was flatly told "f*ck no" with the reason being that current CPU's were fast enough to support full disk encryption without needing HW support and that it wasn't a big enough performance boost for them to both with it. Later when Intel made AES-NI they were jumping for joy with how awesome they said it was. The fact that I had been doing accelerated encryption for years with Padlock (network packets for an OpenVPN network and some volume encryption) didn't even phase them.

Funny part is everyone's talking about accelerated HDD encryption when the real benefit of HW accelerated encryption is in small devices for routing / NAS / application support. I had a Via C3 (originally) @1Ghz managing an OpenVPN circuit of 5+ simultaneous Lan to Lan connections with 256-bit encryption at full speed. These aren't host to Lan where you have one device, these were full networks connected to each other over the internet and accessing resources across the VPNs. It was a big project between coworkers of mine, we wanted to see how big a private de-centralized L3 VPN would could create. Got a total of five hosts with three in Korea, one in Japan and one in the USA all connected to each other playing Lan hosted video games and chatting on Vent / Teamspeak. SOHO routers can't possibly keep up with that kind of encryption so we had to build our own box's, Active Directory DNS with multiple domains and DC's in different geographic regions. Was an incredibly fun project.

noob2222 · Feb 15, 2012

Found a decent demonstration of why single threading vs multi-threading (no cheating possible) bench's should be run more.

http://www.sweclockers.com/artikel/14650-prestandaanalys-battlefield-3/5#pagehead

BF3 but in large multiplayer maps. It's been shown that games that use complex multi-player maps require significantly more processing power then they do on single player "timed demo" mode. You can see a BD barely keeping up with a 2500K and surpassing both the 1060T / 980.

I really think it's time we stop benchmarking everything in single player timed demo, at least not competitive PvP games.

I have been saying this for a while, that BF3 runs fine on the i3 2100, but good luck getting on a 64-player map with it, mostly being ignored. But I have to agree with you, games need to be tested and shown anomalies such as this. Id rather know if I get into a multiplayer game that the game is going to blow chunks when I get in a group of 8-32 people doing different things at the same time.

There is another example I found recently.

Aside from the game being optomized for the Intel platform, there is the oddidy of the 2600k being faster chip, with HT but being slower. The conclusion was that HT slows it down ... well yes and no.... the game is for lack of words, piss poor programming.

http://www.bit-tech.net/hardware/cpus/2010/08/18/how-many-cpu-cores-does-starcraft-2-use/2

Seems to maximize fps at 3 cores, ok easy enough.

82% cpu usage, not bad, fair split between the other 2 cores.

92% cpu usage ... WTF?! it went up ...

97% cpu usage ..... Piss poor programming

8 threads obvisouly from the hit the 2600k took pushes the game thread over 100% cpu usage. WAY to go bliz, learn to shut the game off at 3 or 4 cores since it strangulates itself when attempting to spread across cores.

What I want to see more of is what techspot is doing with cpu usage charts recently. "CPU bottleneck" may not be an actual cpu bottleneck, Metro 2033 on my system caps at 60% cpu usage on 2 cores when running the built in bench.

Last thing i will add:
You are never testing one cpu against another when using different motherboards, your testing the entire platform. I know I will probably get flamed for this but if the system supports faster memory, use it. don't drop to the lowest supported speed just because the competitor can't make it work. Most of these games are faster because Intel put their PCI-E controller on the chip, and if you recall, game fps went up considerably at the same clock speed.

http://www.anandtech.com/show/2839/7

Until AMD get there, they are using memory speeds to their advantage, but testing is not being done that way.

truegenius · Feb 15, 2012

also
i found that per ghz of phenom2 is only 10% faster than phenom1 (that maybe because of extra cache and ddr3)
that means bd is even slower than per ghz of 65nm amd k10 cpu.

viridiancrystal · Feb 15, 2012

I have been saying this for a while, that BF3 runs fine on the i3 2100, but good luck getting on a 64-player map with it, mostly being ignored. But I have to agree with you, games need to be tested and shown anomalies such as this. Id rather know if I get into a multiplayer game that the game is going to blow chunks when I get in a group of 8-32 people doing different things at the same time.

There is another example I found recently.

http://images.anandtech.com/graphs/graph4955/41703.png

Aside from the game being optomized for the Intel platform, there is the oddidy of the 2600k being faster chip, with HT but being slower. The conclusion was that HT slows it down ... well yes and no.... the game is for lack of words, piss poor programming.

http://www.bit-tech.net/hardware/cpus/2010/08/18/how-many-cpu-cores-does-starcraft-2-use/2

Seems to maximize fps at 3 cores, ok easy enough.

http://images.bit-tech.net/content_...ny-cpu-cores-does-starcraft-2-use/3-cores.jpg

82% cpu usage, not bad, fair split between the other 2 cores.

http://images.bit-tech.net/content_...ny-cpu-cores-does-starcraft-2-use/4-cores.jpg

92% cpu usage ... WTF?! it went up ...

http://images.bit-tech.net/content_...ny-cpu-cores-does-starcraft-2-use/6-cores.jpg

97% cpu usage ..... Piss poor programming

8 threads obvisouly from the hit the 2600k took pushes the game thread over 100% cpu usage. WAY to go bliz, learn to shut the game off at 3 or 4 cores since it strangulates itself when attempting to spread across cores.

What I want to see more of is what techspot is doing with cpu usage charts recently. "CPU bottleneck" may not be an actual cpu bottleneck, Metro 2033 on my system caps at 60% cpu usage on 2 cores when running the built in bench.

Last thing i will add:
You are never testing one cpu against another when using different motherboards, your testing the entire platform. I know I will probably get flamed for this but if the system supports faster memory, use it. don't drop to the lowest supported speed just because the competitor can't make it work. Most of these games are faster because Intel put their PCI-E controller on the chip, and if you recall, game fps went up considerably at the same clock speed.

http://www.anandtech.com/show/2839/7

Until AMD get there, they are using memory speeds to their advantage, but testing is not being done that way.

I said this, well supported this, way back at near the beginning of the thread. AMD by far has more features and add-ons on their platform than intel. Phenom 2 was great because cpu performance kept up with SB and it has a boat load of other features that made you want to buy it. More PCIe lanes, faster memory, ect.
If your buying a computer or building one, you need to realize that the cpu is not only the brain and main component of your computer, you need to know it determines every other part your going to buy in one way or another.
BD is just so far behind in performance it is hard to recommend.
Trinity, however unlikely this is to happen, could fall somewhere near SB or IB in performance cpu wise. (IB is not going to be that much better, whatever the case.) If it falls within 10% of SB or IB then i may buy it over the intel platform.
This looks awfully fanboyish of me, and it does show off my bias towards amd. I want to see them succeed, and if you don't. You're a fool.

Reynod · Feb 15, 2012

I read that Sunday. So much much disrespect to Don Woligroski with no actual facts to back it up.

Shameful.

I for one am really pleased we are not stooping so low as to badmouth them in our forums ... which says a lot about our users here having a more balanced view of things.

I read the thread but declined to post over there.

I note MU and a couple of other users who are also here didn't get into it.

While I disagree the FX was a total fizzer (no disrespect to Don) you have to agree it has not soundly outperformed its predecessor ... in a number of cases it even fails to meet it.

Essentially the cache latency and prefetch logic is poor / broken.

Name an Intel CPU that failed to outperform the previous generation ... I can think of the Pentium IV, and in terms of steppings the Prescott was also poor.

Did we bag those two ... yes we did.

There are some redeeming qualities for AMD's modular approach but at present it is not being felt ... perhaps the next stepping might see something?

Anyway, I just wanted to give you guys a pat on the back for not buying into their dirt tactics.

Toms Hardware Rank = 1035 http://www.alexa.com/siteinfo/tomshardware.com

AMDZone Rank = 284,085 http://www.alexa.com/siteinfo/amdzone.com

Hardly worth discussing further eh?

😉

-Fran- · Feb 15, 2012

Heh, don't worry about that, reynod.

I'd rather a thousand times have Intel/AMD fanbois bashing the other side in the same place rather than drink a whole Intel/AMD fanboi kool-aid of how great every product they make comes out and turns into gold like magic.

----

And regarding the benchies... Well, I just realized that Intel sucks at "real" threading using my new i5 workstation at work. Weird things happen (halts and freezes) when I jump at 100% use and in my AMD machine they don't happen, using the same conditions and even more saturated. That's an interesting point to take a lil' deeper look at. If it's intended, well, it shouldn't. If it ain't, well, bloody fix it, Intel!

Cheers!

truegenius · Feb 15, 2012

bd have upto 40% head room for increase in performance only by making use of full potential (making a bd module to work like a 200% core) of every module and perfect scheduling
but will still lag in per ghz performance per core.

fazers_on_stun · Feb 15, 2012

I read that Sunday. So much much disrespect to Don Woligroski with no actual facts to back it up.

Shameful.

Exactly. Don's (Cleeve I think was his AMDZone username) also called them exactly for what they are - pretending to hold everybody to their own "high standards" for proof and then insisting that the first (and only) benchmarks to agree with their own preconceived conclusions are correct, despite dozens of other reviews showing the opposite.

Don was right - they are a tiny, insignificant and uninfluential fanbois club.

fazers_on_stun · Feb 15, 2012

A "bulldozer" core is exactly 100% of a real core, it has all the components required to be considered such.

This isn't Intel HT where they just virtualize two register stacks for one processing unit, BD actually put two sets of hardware into each of these "modules". Even the shared FPU is capable of two separate 128 bit operations and thus can act like two separate FPU's.

There are only two truly shared components, the scheduler / prediction unit and the L2 cache unit. BD has more L2 cache per core then SB thus we can assume it's not the size of that cache that is a performance issue.

What we're left with, and seeing how it behaves in various benchmarks, is that there is something going wrong inside either the predictor / scheduler unit or the L2 cache unit, or both. This would result in the two cores not running at capacity and stalling out, which would result in the bad single thread performance as well as the general bad state of it's performance.

Also please realize that SB kinda "cheats" on benchmarks. It's design is such that although it only has 256kb of L2 cache, the large fast L3 cache can be used to supplement that single heavy thread. We're talking 6MB+ of L3 going to a single core, with that much you just copy all the code segments into cache and be done with it. Also Intel's loop detection is able to detect when looped tests are being done and start prefetching much deeper then AMD is capable of. Running "multi threaded" benchmarks is bull ***, all they do is spawn multiple threads but use the same code segments / data pool for all of them. Again Intel's genius engineers have it so that once that happens, all the code / data for those threads is kept in one place, no duplicates are made, thus maximizing cache hit rates.

These are the reasons I want to see real multi-process benchmarks done, not a single benchmark at a time that just copies it's treads. With completely different code segments and data sets you would get a real feel for how a CPU's actual processing components are instead of how good it's loop detection and cache integration are.

There has already been some rather extensive analysis done on where BD fell short - I'm too busy to look up the links again but they were mentioned in numerous threads here previously. IIRC one review site had an analysis entitled "death by a thousand paper cuts" or something similar.

The shared front-end means that effectively a BD "core" has just two integer pipelines available when both cores of the module are in use. In contrast, K10.5 had 3 pipes per core available and Core2 and later iterations have 4 pipes.

There are many other 'paper cuts' to BD's performance that ultimately cost signficant performance over what we were led to believe a year ago..

guskline · Feb 15, 2012

Heh, don't worry about that, reynod.

I'd rather a thousand times have Intel/AMD fanbois bashing the other side in the same place rather than drink a whole Intel/AMD fanboi kool-aid of how great every product they make comes out and turns into gold like magic.

----

And regarding the benchies... Well, I just realized that Intel sucks at "real" threading using my new i5 workstation at work. Weird things happen (halts and freezes) when I jump at 100% use and in my AMD machine they don't happen, using the same conditions and even more saturated. That's an interesting point to take a lil' deeper look at. If it's intended, well, it shouldn't. If it ain't, well, bloody fix it, Intel!

Cheers!

Yuka. out of curiosity, what are the specs for your i5 workstation at work?

AMD Piledriver rumours ... and expert conjecture

Administrator

Splendid

Distinguished

Splendid

Distinguished

Distinguished

Splendid

Splendid

Splendid

Splendid

Glorious

Splendid

Distinguished

Splendid

Splendid

Splendid

Splendid

Distinguished

Distinguished

Distinguished

Administrator

Glorious

Distinguished

Splendid

Splendid

Distinguished

Share this page