This sort of forum worship is usually reseved for the mighty MU ... who has not been around much lately.
I will light a small candle for him.
Sorry guys, I have been really busy recently with work and just got around to reading some of the epic CPU threads. I guess you guys know how that one works- finally get done with all of the school and training and then get a real job, and watch your free time evaporate. At least when I am on call at work and it's not overly busy (it usually
is overly busy though), they just installed unlocked WiFi and THG isn't blocked by the nanny filter.
😀
MU_Engineer..
he's a server freak and knows AMD Operton's like they were his children...
I am a devoted SMP fan, that's for sure
😀 There's just something about using big-iron parts that interests me like how some of you get interested by a super-high-end GPU or a massively overclocked system.
Oh, and back on topic. Bulldozer as a design theory appears to be very solid. AMD knows it isn't going to be able to outrun Intel in per-core performance and possibly in total per-CPU performance. Intel has too many dollars and people in their R&D department to work on eking out a percent here and a percent there for AMD to seriously want to play that game. Intel would start an arms race of sorts and bankrupt AMD, since AMD's market cap is a single digit percentage of Intel's. AMD seems to have been aiming for a space-efficient, dollar-efficient design that could scale well enough from notebook to server to handle the 99+% of the market that doesn't insist on the absolute highest performance at any price (e.g. Intel's top-bin CPUs like the ~$5000 Westmere-EP E7-88xx series.) Increasing throughput and performance is much easier to do by adding threads than by making each set of thread execution hardware stronger. AMD has been adding threads to increase performance for some time, as Thuban has shown. Software keeps continually being more and more multithreaded, so this is not an unreasonable approach. AMD thus is looking to make a chip that handles as many threads as possible but doesn't have an unmanufacturably-large die. AMD's options were to simply add more Stars-type cores, come up with "slimmer" cores and add yet more of them, or adopt SMT. AMD apparently did a lot of research and found that they needed to keep discrete integer cores present but some of the rest of the HW could be shared without a big performance deficit when fully loaded, unlike SMT. So they had a shared frontend, discrete integer cores reminiscent of Bobcat (which was another "good enough performance but cheap to make" chip), shared L2 cache, and semi-shared FPU. My guess is that the FPU isn't too hard to replace with a bunch of Radeon GPU HW in the near future as it isn't nearly as directly tied in with the cores as with Stars. That would fit in with AMD's ultimate goal of Fusion- to have cheap, easy-to-make, space and dollar-efficient integer core modules paired up with GPU HW acting as a super-strong FPU as well as a GPU.
The initial Bulldozer parts failed to meet their performance expectations for non-obvious and still not completely known (or publicly known?) reasons. This happened to AMD the last time they did substantial changes to their uarch with Barcelona. They essentially ran into teething problems with a significantly new design, and I feel this is what happened here. Best guesses as to Bulldozer's lower-than-anticipated performance usually involve the caches, namely that the L1 is too small and the L2/L3 are too slow. I wonder if the FPU scheduler isn't also a bit immature as well and isn't properly "splitting" the FPU for simultaneous 2x128 bit operations on the different cores. 128-bit non-AVX FPU-heavy tasks run about as fast on 4 threads as they do 8, whereas you would think they would run considerably faster. The integer schedulers seem to be fine as Bulldozer does scale fine with non-FPU-heavy tasks out to its full 8 threads. So, my guess as to what is going to be in Piledriver is essentially "fixing" the implementation issues in Bulldozer:
1. Increase in L1 size.
2. Better FPU scheduler
3. Possibly an even beefier FPU, maybe a 2x256 bit unit?
4. L2 and L3 latency reduction
5. Increased uncore speeds
6. Other subtle tweaks that increase performance due to internally-identified issues that we are not aware of. This could be anything from increasing internal cache/core bandwidth, scheduler tweaks, etc.
And then the typical refresh tweaks:
7. Another module- we have heard this before.
8. Support for even faster DDR3 memory, probably DDR3-2133 or maybe even DDR3-2400 support in the desktop parts.
9. A third IMC channel for GPU-equipped parts, we have also heard this before.
10. Higher clock speeds at identical thread counts and TDPs due to a maturing 32 nm process
11. Software tweaks, such as how AMD had the Linux kernel TLB bug fix (which did NOT need to shut off the L3 on Barcelona like the Windows BIOS fix did) and the Cool 'n Quiet individual-core-scaling tweak between Barcelona and Shanghai.
All in all, I fully expect Bulldozer --> Piledriver to be like Barcelona --> Shanghai. Major redesigns have teething problems, period. Even Intel has them, anybody who has tried to run a Sandy Bridge with its new, majorly-redesigned on-die IGP on Linux can easily testify to that one.