AMD CPU speculation... and expert conjecture

gamerk316 · May 29, 2013

palladin9479 :

I agree on this; recursive langugages are "better" at being inherently parallel, but lets face it, they are VERY niche. You basically need to chuck every programming language there if you want it handled at the code level.

In theory, a VERY tightly coupled compiler could handle threading automatically during compilation, similar to how OpenMP threads based on the results of compilation (will the thread be blocked? Etc). But again, you need a VERY tightly coupled compiler with the OS, which limits you to first party compilers. Still, given how 90% of everything is MSVC, this might eventually happen...

The only thing about HSA is the OS needs to provide some mechanism for the code to be scheduled onto a target. HSA is not x86 and won't be treated like it. The OS needs to be aware of the new opcodes and how to schedule it. This is similar to how the SIMD FPU is a completely separate co-processor from the OS's point of view, register stack and all.

Its possible AMD could handle it internally, but that does add latency. Ideally, you could schedule to any device you wished, but OS's don't have that level of visability yet. As GPU's get more powerful though, I expect this will gradually change.

gamerk316 · May 29, 2013

mayankleoboy1 :

Its actually worse then that; I've seen enough signaled-Mutexes over the years suck performance out of applications, because one thread is constnatly going "Are you done yet? Are you done yet? Are you done yet?", waiting for a condition that can not possibly be true during the period the thread is running!

Then you run into priority inversion:
Low Priority thread "A" locks a resource
High Priority thread "B" needs the resource; must wait for thread "A" to finish.
Medium Priority thread "C" is ready to run; bumbs thead "A". Thread "B" stuck waiting for thread "A" to finish, which will not happen because its locked out by thead "B".

Now start sharing resources across multiple threads. You. Are. Screwed. In theory, you can copy everything across thread bounderiers and work on local data in some instances, but your memory requirements go through the roof. [Maybe that explains why Frostbite 3 is roumered to require 64-bit OS's? I suspect a giant memory hog of an engine...]

FYI, this is basically how Intels Transactional Memory scheme works: Do the execution without locks, and if the memory contents haven't changed by the time you finished, then its OK to overwrite the results. If someone else has modified the resource however, you have to junk the results, use a traditional lock, and start over. Performance tanks in the worst-case outcome, so in an industry so focused on becnhmarks, I really don't see this gaining much of a foothold.

The program could be using more and more cores, but to make all the threads play well with each other, the synchronising needed is enough to give a poor scaling.

Its actually worse then that: There is a point where simply adding more cores will reduce performance simply due to the amount of time the OS will be undergoing a context switch. The number most cited from studies I've seen indicates that 20 is the magic number, though performance gains decrease from the moment you add a second core, with increases in performance declining every core thereafter. [This is why you don't see close to perfect scaling of performance even when the CPU's are doing maximum amounts of work.]

As said, in games the rendering and physx are truly parallel workloads, which should be done on the GPU anyway.

Which they are. Finally. The stuff thats left is easy. Managing a couple dozen AI objects? Simple. UI? Simple. 3d audio? Simple. Granted, they all add up, but in a multicore system? 50% usage on one core is trivial use of resources.

Though the newer game engines like UE4 or others should be more parallelised, because they were wsritten from scratch. But then again, since they were started 4-5 years back, tech has changed much by now.

The big change I expect they will have is support for Multithreaded Rendering (DX11), which would break up the main render thread into smaller therads. That should help scalability at the expense of some performance. Better physics support should also increase CPU load somewhat, though this won't be due to any changes in making code parallel, just the result of doing more work. [This can be confirmed by using something akin to GPUView to see what cores non-trivial threads are being assigned to.]

cowboy44mag · May 29, 2013

de5_Roy :

y'know, the new consoles have started to seem like cheap (dual socket...?) hpc (micro...?)servers with dual processors (not to be confused with multithreading) with large amount of gpgpu processing available at their disposal. [strike]only a matter of time before[/strike] may be some strange rumor will surface of some crazy people buying 10k of ps4 and build a supercomputer and then japan and u.s. will temporarily suspend console sales out of fear.

(emoticons are used to pacify people who scare easily).

:lol: Love it, and wouldn't be a bit surprised. I think something like that happened after the release of the PS3. I remember there were people who used 4 (or so) PS3s to build a "super computer", the next thing you know people are worried about North Korea getting PS3s. Of course it was never a big enough scare to suspend the sales of the PS3.

Its going to be interesting to see the consoles in action once they are released. At least Sony and Microsoft have given us something to talk about with the upcoming next gen console releases, unlike the defining silence from AMD regarding Steamrollers upcoming release.

I can never get this quoting system to work right :fou:

mayankleoboy1 · May 29, 2013

gamerk316 :

This begs the question : What is the use of scalability if there is no perf improvement ? Using moar coars just for the sake of using moar coars ?

Also, i heard Chrome developers (who would be brilliant, at a guess), saying that they would rather try to extract more performance from a single core (possibly by SSE2/3/4 code) rather than go SMP.

gamerk316 · May 29, 2013

mayankleoboy1 :

Each core is doing less overall work, lowering power draw and heat. Also, Task Manager looks pretty.

Easiest example is Crysis 3. Yes, it scales a LOT better on a 8350 then a 3770k. Oh, by the way, the FPS numbers are identical. Why? Because neither CPU is bottlenecked.

EDIT

Now, take a quad-IB at say, 2.4 GHz and an Octo-PD at 3GHz, and compare results, and things would likely be very interesting. I expect the PD would fare better (probably bottleneck on the 2.4 IB), but would still be slower then the 3770 at stock due to poor per-core performance. [If anyone wants to underclock to test this theory, by all means].

In short, I suspect the old rule would still hold true: A fast quad would likely beat a slow octo in most cases. But since review sites never bother to address the subject of scaling in any of their CPU reviews, this is little more then a guess based on observation of the data available. Until you get to the point where doubling the number of cores and cutting the clockspeed by 33% or more gives the same performance, you can NOT make the argument that programs are scaling well.

Also, i heard Chrome developers (who would be brilliant, at a guess), saying that they would rather try to extract more performance from a single core (possibly by SSE2/3/4 code) rather than go SMP.

Ironically, browsers are good threading targets. Ideally, each tab would be its own thread, since every tab is fully independent of the other [sharing of computer resources aside]. Then again, these are the developers that think spawning multiple PROCESSES to gain access to more RAM is a good idea. As I write this, with a total of three tabs open, I have FIVE chrome.exe processes running. That's just poor design, period.

mayankleoboy1 · May 29, 2013

process per tab is wasteful, yes. But maybe easier/securer to implement than thread per tab ?
All i know is that modern web browsers (mozilla specially) are freakingly complex.

8350rocks · May 29, 2013

esrever :

Mark Cerny from Sony, in an interview with gamastura, said the ARM CPU was to run the OS, and any additional functions that were to be offloaded would be run on 1 CU in the GPU (since there are 64 CUs in the GPU he said they weren't concerned with potential performance loss there).

No CPU cores will be tied up in the OS. He was specifically adamant about that...game developers will have 8 cores to work with.

gamerk316 · May 29, 2013

Mark Cerny from Sony, in an interview with gamastura, said the ARM CPU was to run the OS, and any additional functions that were to be offloaded would be run on 1 CU in the GPU (since there are 64 CUs in the GPU he said they weren't concerned with potential performance loss there).

No CPU cores will be tied up in the OS. He was specifically adamant about that...game developers will have 8 cores to work with.

And since then, the ARM CPU has seemingly vanished from the spec sheet. We'll see, but untl it shows up in the HW...

mayankleoboy1 :

Harder and less secure. IPC is very expensive and hard to manage, and isn't nearly as secure as keeping all the data within a single address space. The reason this has been done is to get around the 2GB memory barrier for 32-bit applications; each process gets its own 2GB address space, no matter how many threads that process creates.

All i know is that modern web browsers (mozilla specially) are freakingly complex.

They shouldn't be; its data input/output/display. But then we added hundreds of middleware API's on top of the internet, and things grew out of control.

8350rocks · May 29, 2013

gamerk316 :

If they scrap the ARM core, then the gist of what I got from the interview would indicate they would just tie up 1 or 2 more CUs on the GPU...(leaving developers on 61 CUs instead of 64). He told gamasutra flatly...under no circumstances would cores be tied up with the OS, they wanted to make certain it would be that way.

I suppose I could see them running it from CUs on the GPU...it wouldn't be difficult to do with a simple console OS. Though, I would find it strange that they would drop the ARM core after they announced it would be in there for this purpose.

-Fran- · May 29, 2013

On the ARM debate for the PS4 and the OS. I really wonder if they can do it... Can 2 ISAs live in the same Kernel code?

In theory sounds very good, since the PS3 OS is very Android-like: light and fast. You don't need much horsepower to actually run a GUI with whistles and candy eye, but making it in another ISA CPU? That's weird.

Cheers!

gamerk316 · May 29, 2013

-Fran- :

Sure, if you double the size of the Kernel. Triple, if you count the GPU.

Hence my skepticism. There's a reason no one has ever bothered with a multiple-ISA architecture.

de5_Roy · May 29, 2013

8350rocks :

were you refering to this? http://www.gamasutra.com/view/feature/191007/
i [strike]read[/strike] skimmed through it just now, and no mention of an arm chip running the os. actually, the intro section says that 'os and ui related discussion were off the table'. cerny says nothing about any arm chip let alone it running the os. he does talk about a 'custom chip' (no mention of what kind) for background tasks when the main system is on standby mode( a lot like connected standby tasks). in the comments section, some other guy mentioned that the arm chip would perform 'download assist' which is in line with earlier rumors of an arm chip performing minor tasks, not the main os, not when the console is actively running. nothing about ps4 games using all 8 cpu cores either.

montosaurous · May 29, 2013

Truth is we don't know what new PC games will use, and information on console games is still quite small. PC games could be heavily threaded, or they could still be best played on quad cores with high IPC. Right now in most games there isn't an extremely huge difference between i5's and FX 8xxx chips. Sure, they may still favor less cores and higher IPC but Piledriver isn't light years behind Ivy Bridge. It's just personal preference and price for now.

8350rocks · May 29, 2013

de5_Roy :

"The reason so many sources of compute work are needed is that it isn’t just game systems that will be using compute -- middleware will have a need for compute as well. And the middleware requests for work on the GPU will need to be properly blended with game requests, and then finally properly prioritized relative to the graphics on a moment-by-moment basis."

-Mark Cerny

Reread it just now, he's talking about middleware there, which is not technically the OS...but they're going to run all the middleware on the CUs in the GPU...so...the next logical leap is that if you're running all the things that help the OS and the program along on the CUs in the GPU...you could also run the OS on the GPU. The tone of his comments about not limiting the hardware leads me to believe this.

Additionally...he mentioned "background" processes in the comment about the ARM chip, in addition to the download feature. I may be reading too deeply into that, but that leads me to believe that it will be doing quite a bit more than just downloads. Else, why have it, if not to take loads off the hardware...?

He is dancing all around outright saying it, but he alludes to the fact that the CUs in the GPU will be used for alot of things...though without going into detail as to what exactly the CUs and ARM chip will be doing...

Here's discussion about the translation of an interview with Mark Cerny from a Japanese media source:

http://www.neogaf.com/forum/showthread.php?t=532077

There he discusses a "secondary OS cpu"

hcl123 · May 29, 2013

noob2222 :

NO pun intended or intention to insult no one, but the real "boyism" part of the story is the generalized stupidity to the point of aggressive passions, about pieces of software used for tests that are NOT even representative of the applications from where they are derived( mostly ICC and tweaks)... meaning a CPU can even be more than 80% more performant yet show at discretion of those "test" software coders less 20% performance than before(quite possible). As example for a same kind of multimedia application but on OpenCL/compute, on the same exact hardware its possible to show not 80% or 100% improvement, but show 800% to 1000% or more ( no not typo, above 10x improvement).

There is a good solution for this, sign the benchmarks with md5 signatures, ensuring that they don't change with time, otherwise ppl never know what they are bickering about, if is the software or the hardware, that is, if both hardware and software change you can't tell which is from which.

Yet some elaborated passionated conclusions and reviews bait in numbers and charts seems to have above total certainty ... of their immense ignorance( i say lol)...

Me as a "fanboy" of performance i would be bickering at software developers not hardware.

BTW that blog post must be fake, did they had access to a "FX" ES ??... when not even kaveri ESs are circulating yet ?... they are only applying the rumored 30% gain from previous results... isnt it logic ?

howee · May 29, 2013

The above blog post is simply speculation, not real world results.

Cazalan · May 29, 2013

mayankleoboy1 :

There is always going to be a trade off between more cores or faster cores. A balance of power and efficiency simply due to the power equation. Power is a function of frequency squared times the voltage. At some point the power starts running away and is no longer manageable without extreme and expensive measures (LN cooling). This could change overnight if a room temperature superconductor were discovered, but that's probably decades or centuries away.

In order for computers to evolve much further they'll need to incorporate AI in hardware. You can see this now with modern FPGAs which have both hard and soft processing cores. They (soft cores) can be reconfigured but it is not entirely seamless and the hardened cores are very weak compared to an i7. That's not just a hardware limitation though as the tools for automating it are incredibly complex. Intel has started to FAB FPGAs for some customers and at some point they could offer an X86 FPGA.

In the mean time we'll continue to brute force the compute jobs with a mix of parallel and serial cores. The manufacturing capabilities already far exceed the knowledge we have of how to architect the billions of transistors at our fingertips. In a few years when 450mm wafer tech is available there will be another explosion of cheap transistors.

8350rocks · May 29, 2013

JAYDEEJOHN :

On the contrary...many developers see what AMD is doing, especially game developers. I know I sure see efficiency in hUMA and HSA, and all the guys I work with are all excited about it.

It doesn't at all surprise me that Sony asked developers what they wanted, and they all said unified memory.

hcl123 · May 29, 2013

About

http://i.imgur.com/b33tNAp.jpg

http://forum.beyond3d.com/showthread.php?t=63622

I suspect strongly its Steamroller not Excavator... and the die shot seems legit in line with others, no visible manipulation quirks, no "mistakes" which always accompany fakes of complex structure inventions(but yes it could be a good fake).

It has double the the L1 size, both data and $I

Accompanying the module approach of Jaguar, it has double fetch, or dedicated fetch with its own branch prediction, which now is clearly multilevel or multi-engine ( as before elsewhere, one local dedicated branch engine per thread one global for all)

The Execution is augmented with what seems with 2 more ALUs NOT AGUs (though this last ones must had been augmented to handle 256bit AVX loads per cycle) per core.

Yes i suspect, since it (since BD) has OoO Load Store which K10 didn't, that there isn't microOPs fusion or packing of ALU+AGU, but that ST has some form of eager execution or run-ahead with speculative address prediction, which doesn't require more AGU but more ALUs for this data speculation execution.

The FlexFPU is also double, that is, 2 FlexFPUs, probably as rumored with 2 FMAC + 1MMX each, being the 2 FMACs fully bridge now able to execute one 256bit AVX per cycle without halfs. The FPU scheduler most probably is "singular" ( only one) and "decoupled" from the FP pipes as was since BD, that is, one thread/core has seamless access to both FPUs.

What is big *LOL* about some analyses and pertinent "expert" opinions about the shortcomings of BD, is that Steamy only has 4 decode pipes... as before, and as advented in a RWT thread. But as i posted before here it doesn't mean the decodes are not double. In "Vertical Multithreading" it means double dedicated input and ouput buffers for this "decode stages" or <thread domain>, and the decode engine be SMT (simultaneous multithreading, 2 threads at the same time) when in BD/PD that VerticalMT for decode as the rest of the front-end, was *only* 1 thread at a time.

For this, those same 4 decode pipes must had been substantially revamped... but they are the same number of 4 pipes nonetheless... and there are ways to mitigate this, not only the double fetch, but also a lot of *new* CAM (content addressable memory) structures around each core/cluster, can function (or be) like a "decoded cache" of sorts, meaning upon repetitive loops execution proceeds from there alleviating tremendously decode requirements.

Making good faith in the pictures, ST "module" is ~10% smaller than PD, which might mean a process shrink ~20%, which should be then 28nm... i expect Excavator to be on 22 or 20nm FD-SOI, that is considerably smaller, and so have 3 integer core/clusters per module and another *Heterogeneous core* besides the FlexFPUs, namely a crypto/compression dedicated engine(like IBM Z chips).

With all this the 30% improvement for *single-thread" integer IPC (instructions per clock≃ 1.1 or 1.2 or 1.3, the norm for x86) performance that is so much hyped now ( a truly *obsolete* metric for performance, believe me!, it has ages this discussion, its obsolete) is perfectly believable.

The BIG SURPRISE, since it was not advented last Hot Chips presentation is the Double FlexFPU engine. As rumored if the "2 FMAC + 1MMX" is practically identical performance wise to 2 FMAC + 2MMX of BD/PD, then those 2 FPU, if the implementation is good, could mean almost the double the peak performance of BD/PD (average 80% i guess), and in some cases even pass the 100%...

... i suspect things like Cinebench that rely heavenly on SSE instructions and MT, could lose much of its "shoving" around as a valuable metric... lol...

8350rocks · May 29, 2013

Cazalan :

I don't know...they have found YBCO to be a type 2 super conductor at 92K...which is a far higher temperature than previously thought possible. That discovery was sort of "happenstance" as it were as well. When the Lanthanum cuprite compound was announced with superconductivity at 35K, they substituted yttrium into the compound and got a superconductivity temp increase of 57K. I would imagine based on that series of events...it might be in the next decade or so that they discover something that can be a superconductor at something closer to 180-200K. Which, while not room temperature, would be pretty easily managed with a sufficient cooling system.

Cazalan :

Are these really becoming relevant? Last I read, ASICs were far more efficient than FPGAs and typically had a much smaller footprint.

Cazalan :

450 picometer is probably still 10 years out I would wager...especially considering they're likely to hit a "wall" around 10nm...they're going to have to develop a radically new process to get much smaller than that. FinFET likely won't allow them to get much further...it will be interesting to see how the limitations are overcome though. We are reaching a point, in terms of hardware, where engineers are about to have to uncover some radical innovations on several key fronts in the next decade to advance the current processes/trends beyond the current limitations. I read about a development in France that used "nanowires" or whatever they called them...IGAs...and they had a "grid" like layout of these sitting vertically inside the chip that would allow the transistor size to shrink further...but that technology was radically experimental at best.

8350rocks · May 29, 2013

hcl123 :

About

http://i.imgur.com/b33tNAp.jpg

http://forum.beyond3d.com/showthread.php?t=63622

I suspect strongly its Steamroller not Excavator... and the die shot seems legit in line with others, no visible manipulation quirks, no "mistakes" which always accompany fakes of complex structure inventions(but yes it could be a good fake).

It has double the the L1 size, both data and $I

Accompanying the module approach of Jaguar, it has double fetch, or dedicated fetch with its own branch prediction, which now is clearly multilevel or multi-engine ( as before elsewhere, one local dedicated branch engine per thread one global for all)

The Execution is augmented with what seems with 2 more ALUs NOT AGUs (though this last ones must had been augmented to handle 256bit AVX loads per cycle) per core.

Yes i suspect, since it (since BD) has OoO Load Store which K10 didn't, that there isn't microOPs fusion or packing of ALU+AGU, but that ST has some form of eager execution or run-ahead with speculative address prediction, which doesn't require more AGU but more ALUs for this data speculation execution.

The FlexFPU is also double, that is, 2 FlexFPUs, probably as rumored with 2 FMAC + 1MMX each, being the 2 FMACs fully bridge now able to execute one 256bit AVX per cycle without halfs. The FPU scheduler most probably is "singular" ( only one) and "decoupled" from the FP pipes as was since BD, that is, one thread/core has seamless access to both FPUs.

What is big *LOL* about some analyses and pertinent "expert" opinions about the shortcomings of BD, is that Steamy only has 4 decode pipes... as before, and as advented in a RWT thread. But as i posted before here it doesn't mean the decodes are not double. In "Vertical Multithreading" it means double dedicated input and ouput buffers for this "decode stages" or <thread domain>, and the decode engine be SMT (simultaneous multithreading, 2 threads at the same time) when in BD/PD that VerticalMT for decode as the rest of the front-end, was *only* 1 thread at a time.

For this, those same 4 decode pipes must had been substantially revamped... but they are the same 4 pipes nonetheless... and there are ways to mitigate this, not only the double fetch, but also a lot of *new* CAM (content addressable memory) structures around each core/cluster, can function (or be) like a "decoded cache" of sorts, meaning upon repetitive loops execution proceeds from there alleviating tremendously decode requirements.

Making good faith in the pictures, ST "module" is ~10% smaller than PD, which might mean a process shrink ~20%, which should be then 28nm... i expect Excavator to be on 22 or 20nm FD-SOI, that is considerably smaller, and so have 3 integer core/clusters per module and another *Heterogeneous core* besides the FlexFPUs, namely a crypto/compression dedicated engine(like IBM Z chips).

With all this the 30% improvement for *single-thread" integer IPC (instructions per clock≃ 1.1 or 1.2 or 1.3, the norm for x86) performance that is so much hyped now ( a truly *obsolete* metric for performance, believe me!, it has ages this discussion, its obsolete) is perfectly believable.

The BIG SURPRISE, since it was not advented last Hot Chips presentation is the Double FlexFPU engine. As rumored if the "2 FMAC + 1MMX" is practically identical performance wise to 2 FMAC + 2MMX of BD/PD, then those 2 FPU, if the implementation is good, could mean almost the double the peak performance of BD/PD (average 80% i guess), and in some cases even pass the 100%...

... i suspect things like Cinebench that rely heavenly on SSE instructions and MT, could lose much of its "shoving" around as a valuable metric... lol...

That could, actually, be the real deal...it's just a module, not an entire view....but...it looks pretty good. (It could be fake...but...it's very well done if it is...)

JAYDEEJOHN · May 29, 2013

Looking at the larger picture, whats there to go after if 10nm is the proposed long cutoff til something else comes about?
Intels marketshare.
Without their process advantage, looking at ARM and up and coming AMD product as well as others, theyre not far off, and the sooner we get to 10 nm, the playing field evens out somewhat, or, moreso than has been available.
My 2c

Cazalan · May 29, 2013

8350rocks :

I'd say they're very relevant. An ASIC will be faster but it takes 18 mo. to develop. You can have a new FPGA design in a day. That's why many ASICs are first modeled in arrays of FPGAs.

8350rocks :

That was 450mm wafers as opposed to the current 300mm wafers. Larger wafers reduce the price of transistors over time. You get more die per wafer due to the much larger area. Intel is looking at 2016/2017 to bring those online.

As far as the nm tech I believe Intel has it covered down to 5nm which is still several years out, and IBM says they can likely do 2nm before more radical changes will be necessary. 3D will continue to be key but it sounds like they are having major growing pains with the reliability of die stacking.

I suppose we could start building the "cloud" in outer space. Then you'd have unlimited free cooling.

griptwister · May 29, 2013

hcl123 :

About

http://i.imgur.com/b33tNAp.jpg

http://forum.beyond3d.com/showthread.php?t=63622

I suspect strongly its Steamroller not Excavator... and the die shot seems legit in line with others, no visible manipulation quirks, no "mistakes" which always accompany fakes of complex structure inventions(but yes it could be a good fake).

It has double the the L1 size, both data and $I

Accompanying the module approach of Jaguar, it has double fetch, or dedicated fetch with its own branch prediction, which now is clearly multilevel or multi-engine ( as before elsewhere, one local dedicated branch engine per thread one global for all)

The Execution is augmented with what seems with 2 more ALUs NOT AGUs (though this last ones must had been augmented to handle 256bit AVX loads per cycle) per core.

Yes i suspect, since it (since BD) has OoO Load Store which K10 didn't, that there isn't microOPs fusion or packing of ALU+AGU, but that ST has some form of eager execution or run-ahead with speculative address prediction, which doesn't require more AGU but more ALUs for this data speculation execution.

The FlexFPU is also double, that is, 2 FlexFPUs, probably as rumored with 2 FMAC + 1MMX each, being the 2 FMACs fully bridge now able to execute one 256bit AVX per cycle without halfs. The FPU scheduler most probably is "singular" ( only one) and "decoupled" from the FP pipes as was since BD, that is, one thread/core has seamless access to both FPUs.

What is big *LOL* about some analyses and pertinent "expert" opinions about the shortcomings of BD, is that Steamy only has 4 decode pipes... as before, and as advented in a RWT thread. But as i posted before here it doesn't mean the decodes are not double. In "Vertical Multithreading" it means double dedicated input and ouput buffers for this "decode stages" or <thread domain>, and the decode engine be SMT (simultaneous multithreading, 2 threads at the same time) when in BD/PD that VerticalMT for decode as the rest of the front-end, was *only* 1 thread at a time.

For this, those same 4 decode pipes must had been substantially revamped... but they are the same number of 4 pipes nonetheless... and there are ways to mitigate this, not only the double fetch, but also a lot of *new* CAM (content addressable memory) structures around each core/cluster, can function (or be) like a "decoded cache" of sorts, meaning upon repetitive loops execution proceeds from there alleviating tremendously decode requirements.

Making good faith in the pictures, ST "module" is ~10% smaller than PD, which might mean a process shrink ~20%, which should be then 28nm... i expect Excavator to be on 22 or 20nm FD-SOI, that is considerably smaller, and so have 3 integer core/clusters per module and another *Heterogeneous core* besides the FlexFPUs, namely a crypto/compression dedicated engine(like IBM Z chips).

With all this the 30% improvement for *single-thread" integer IPC (instructions per clock≃ 1.1 or 1.2 or 1.3, the norm for x86) performance that is so much hyped now ( a truly *obsolete* metric for performance, believe me!, it has ages this discussion, its obsolete) is perfectly believable.

The BIG SURPRISE, since it was not advented last Hot Chips presentation is the Double FlexFPU engine. As rumored if the "2 FMAC + 1MMX" is practically identical performance wise to 2 FMAC + 2MMX of BD/PD, then those 2 FPU, if the implementation is good, could mean almost the double the peak performance of BD/PD (average 80% i guess), and in some cases even pass the 100%...

... i suspect things like Cinebench that rely heavenly on SSE instructions and MT, could lose much of its "shoving" around as a valuable metric... lol...

I SAW THAT PICTURE! I almost posted it too! But I would have felt like an idiot because I had no idea what I was looking at exactly. Glad to know it's possibly real. Sounds powerful lol.

esrever · May 29, 2013

28nm steamroller is so late. Its been 1.5 years since 7970 on 28nm came out at tsmc and we still aren't even close to steamroller. Hopefully Glofo can can keep production up and give decent yields.

AMD CPU speculation... and expert conjecture

Glorious

Glorious

Guest

Distinguished

Glorious

Distinguished

Distinguished

Glorious

Distinguished

Illustrious

Glorious

Splendid

Honorable

Distinguished

Honorable

Honorable

Distinguished

Distinguished

Honorable

Distinguished

Distinguished

Champion

Distinguished

Distinguished

Splendid

Share this page