About
http://i.imgur.com/b33tNAp.jpg
http://forum.beyond3d.com/showthread.php?t=63622
I suspect strongly its Steamroller not Excavator... and the die shot seems legit in line with others, no visible manipulation quirks, no "mistakes" which always accompany fakes of complex structure inventions(but yes it could be a good fake).
It has double the the L1 size, both data and $I
Accompanying the module approach of Jaguar, it has double fetch, or dedicated fetch with its own branch prediction, which now is clearly multilevel or multi-engine ( as before elsewhere, one local dedicated branch engine per thread one global for all)
The Execution is augmented with what seems with 2 more ALUs NOT AGUs (though this last ones must had been augmented to handle 256bit AVX loads per cycle) per core.
Yes i suspect, since it (since BD) has OoO Load Store which K10 didn't, that there isn't microOPs fusion or packing of ALU+AGU, but that ST has some form of eager execution or run-ahead with
speculative address prediction, which doesn't require more AGU but more ALUs for this data speculation execution.
The FlexFPU is also double, that is, 2 FlexFPUs, probably as rumored with 2 FMAC + 1MMX each, being the 2 FMACs fully bridge now able to execute one 256bit AVX per cycle without halfs. The FPU scheduler most probably is "singular" ( only one) and "decoupled" from the FP pipes as was since BD, that is, one thread/core has seamless access to both FPUs.
What is big *LOL* about some analyses and pertinent "expert" opinions about the shortcomings of BD, is that Steamy only has 4 decode pipes... as before, and as advented in a RWT thread. But as i posted before here it doesn't mean the decodes are not double. In "Vertical Multithreading" it means double dedicated input and ouput buffers for this "decode stages" or <thread domain>, and the decode engine be SMT (simultaneous multithreading, 2 threads at the same time) when in BD/PD that VerticalMT for decode as the rest of the front-end, was *only* 1 thread at a time.
For this, those same 4 decode pipes must had been substantially revamped... but they are the same number of 4 pipes nonetheless... and there are ways to mitigate this, not only the double fetch, but also a lot of *new* CAM (content addressable memory) structures around each core/cluster, can function (or be) like a "decoded cache" of sorts, meaning upon repetitive loops execution proceeds from there alleviating tremendously decode requirements.
Making good faith in the pictures, ST "module" is ~10% smaller than PD, which might mean a process shrink ~20%, which should be then 28nm... i expect Excavator to be on 22 or 20nm FD-SOI, that is considerably smaller, and so have 3 integer core/clusters per module and another *Heterogeneous core* besides the FlexFPUs, namely a crypto/compression dedicated engine(like IBM Z chips).
With all this the 30% improvement for *single-thread" integer IPC (instructions per clock≃ 1.1 or 1.2 or 1.3, the norm for x86) performance that is so much hyped now ( a truly *obsolete* metric for performance, believe me!, it has ages this discussion, its obsolete) is perfectly believable.
The
BIG SURPRISE, since it was not advented last Hot Chips presentation is the Double FlexFPU engine. As rumored if the "2 FMAC + 1MMX" is practically identical performance wise to 2 FMAC + 2MMX of BD/PD, then those 2 FPU, if the implementation is good, could mean almost the double the peak performance of BD/PD (average 80% i guess), and in some cases even pass the 100%...
... i suspect things like Cinebench that rely heavenly on SSE instructions and MT, could lose much of its "shoving" around as a valuable metric... lol...