Just for Fun let me try to restate some of the issues with Bulldozer in regrades to the performance, from most important to least important based on Cient/Work station performance not server.
Longer Pipeline= Higher clock speed which is something Amd failed to do
Branch prediction/Prefetcher
L2 Cache
L1 Cache(WTH did they share it?)
2x ALUs and 2x AGUs for Bulldozer vs 3x ALUs/AGUs for the phenom per core. Plus 2 IPC per core for BD vs 3 IPC for the Phenom BUT the Bulldozer can handle these operations more efficiently then the Phenom.
CMT only has 80% scaling per core when the Phenom had 93% scaling(so good!)
The Longer Pipeline which is not that big of a deal since the Branch Prediction was supposed to overcome this, but it does hint at what Amd was trying to do and it does cause some small latency issues.
How windows handles CMT(based on windows 8 vs 7 benchmarks this is only a 5-10% increase in performance and usually less then 5%)
You can easily see this in these articles below
This article seems like the L2/L1 cache is being part of the problem.
http://www.extremetech.com/computing/100583-analyzing-bulldozers-scaling-single-thread-performance/2
Then if we take a look at this article we can see CMT only can scale 80% so its multithreaded performance is also lower then a TRUE 8 core Phenom would be.
http://www.legitreviews.com/article/1741/11/
Then the Bulldozer is only clocked at 3.6Ghz which is lower then the Phenom 975/980 and is only 8% higher then the 1100T.
So what did Amd improve in Piledriver?
Clock mesh should improve Clock speeds/Power consumption, Global foundries most likely improved their 32nm die as well meaning Piledriver might be made on a newer stepping and this might make Piledriver a little smaller then Bulldozer was and global foundries should have more Piledriver processors at launch vs Bulldozer at launch, which i hope means Amd will have lower prices.
L1 cache has also been tweaked
The Prefetching and branch prediction have been improved as well
Scheduler has also been improved
This most likley leading to a modest 15-20% boost in performance compared to the bulldozer at stock with IPC being around 7-10% better
What Amd left out and we hope they improve with "steamroller"
die shrink even 28nm would be nice
L2 Cache/L1 cache
2x ALUs and 2x AGUs for Bulldozer vs 3x ALUs/AGUs for the phenom per core. Plus 2 IPC per core for BD vs 3 IPC for phenom
Branch prediction/Prefetcher
CMT only has 80% scaling so i would like to see a 10 core steamroller
So this concludes what i think is personally wrong with Bulldozer and some of the improvements they made with Piledriver and dreams i have for Steamroller.
Now What did Amd improve on when it comes to the Bulldozer vs the Phenom?
Number one is support for newer instruction sets(this is easily a performance boost in some areas), Memory controller is on par with the original I7 series.FPU is better since their is only 4 of them in BD vs 6 in the Phenoms and Bulldozer still beats the Phenom FPU is some cases, L3 cache speeds/size. Turbo core,(scratching my head right now), I guess higher clock speeds with the 4170fx.
Since the Phenom had higher scaling and bulldozer had lower performance per cycle the bulldozer is usually only 10% faster in multithreading while being 10-15% slower in single core tasks with having a 9% higher clock speed with turbo on.