AMD CPU speculation... and expert conjecture

Page 85 - Seeking answers? Join the Tom's Hardware community: where nearly two million members share solutions and discuss the latest tech.
Status
Not open for further replies.

hcl123

Honorable
Mar 18, 2013
425
0
10,780


SB/IB utilizes more of L3 because of their scheme, the all cache hierarchy is "inclusive" (very mostly in any case) up to the L3. Its a good scheme for some workloads, meaning only the L3 (really) has to be snooped for coherency, and this kind of traffic can really spoil the fun for big caches fast processors. OTOH is quite less flexible for variances of the scheme.

AMD L2 can be as good for speed, yet maintain the bigger size, and maintain the flexibility. Make the L2 adaptive (meaning certain blocks can be turned off under low pressure and or 1 thread per module) and way predicted ( L1 data is already way predicted since BD v1). Make L2 1.5MB and the L3 2.5MB per module, that along with adaptivity and way prediction the L2 could be as fast as SB/IB yet be much bigger(cherry on top of cake, according to as i understand the layout, it could make those dies smaller)...

To imitate the "snoop avoidance" feature of SB/IB, and with more space for the L3, make a good "coherency directory" inside each L3 block, even for internal on-die traffic.

So for more than 1 CPU module inside a die (4 cores) this would make a good decision to maintain L3 to ( its not only those "oh! s**t" things).
 

hcl123

Honorable
Mar 18, 2013
425
0
10,780


25% reduction for "Branch Code" is quite extraordinary if truth, more so because it might include "transactional memory" processing like Haswell. It says in the fine printing <1>... number of tests, including those of "transaction" processing ... 30% reduction on I misses is also extraordinary, the L1-I is already quite big, perhaps the "loop awareness" is being felt here

Yes not conclusive, but might indicate that Steamroller has HTM and or Hardware Lock Elision of sorts.

And 30% it is "general computing across the board", its not IPC in the sense of counting only the "integer" part of processing (states clearly in the fine printing of the first slide <2>... number of tests, including those digital media, productivity and gaming applications )

This indications have striking similarity with the presentations of APUs not CPUs. In any case with all that, it might get "north" of those so much publicized 10-15% for each generation in terms of pure "integer" IPC.( which is more and more a "borged" metric... tending to irrelevancy).



Make the CPU a better helper for those tasks it really is good at... break it in "pieces", a decoupled control+access/execute paradigma(DAE)... then those modules could even have CU cores and an ICE engine in them.

http://www.xbitlabs.com/news/cpu/display/20120207135832_Engineers_Show_Way_to_Improve_Performance_of_AMD_Fusion_Chips_by_20.html
up to 113% better for a DAE scheme, >20% average (yet it could be much better)

As long this continues to coming, showing the Gustafson's Law (a puzzle why intel let him go, when AMD made in a VP right away) invalidating the Amdahl's Law... no more words needed, this is the future...

http://blogs.amd.com/fusion/2013/03/18/new-aviary-sdk-and-windows-8-app-optimized-for-amd/
(11x = 1100% *MOAR*) average

https://amdfusion.activeevents.com/scheduler/catalog/catalog.jsp?sy=798
Optimizing VLC Using OpenCL and Other Fusion Capabilities

"" VLC is a free and open source cross-platform multimedia player and framework that plays back most types of multimedia files. In our optimization work, we first integrated an AMD-exclusive algorithm, Steady Video™, into VLC, to stabilize shaky video played with VLC. Then we used OpenCL to optimize VLC’s scaling filter, which is used to enlarge or shrink the video on the fly during playback. This OpenCL optimization has achieved speedups of up to 10x on Llano and 18x on Trinity compared to a competing CPU.( :lol: ) A de-noise filter in VLC is used to reduce the noise in video. Due to the algorithm’s data dependency nature between pixels, it is hard to parallelize. Through our optimization efforts, we have implemented an OpenCL version which is already faster on Trinity APU than on a competing CPU. Together these features/optimizations will enable a much better video playback experience on HSA platforms. ""
(>10x = >1000% *MOAR*) average
 

hcl123

Honorable
Mar 18, 2013
425
0
10,780


Yes in present they don't get not even 8 threads yet, though in the SIMT(single instruction multi-threading) scheme of things is already many many thens of threads. But can't this change ? ...

Tiling/supertiling comes to mind as a neat trick that could be improved by more CPU cores... along with Ray Casting/Tracing(many variants possible). Most of this, like "graphic compute tricks", along with that "interactivity" processing that you say, doesn't require much more if any more single thread power (more so because it could reduce the burden of CPU drawing)... but it could use more Multithreading just nicely...
 


Existing trinity and richland APU's will be compatible on new boards with GDDR5 packages and unlike SB to IB chipset changes where you lost functions, all of Kaveri's benefits will be there for predecessors, obviously on a lower level but nevertheless you will have the benefits of GDDR5 support.



 

hcl123

Honorable
Mar 18, 2013
425
0
10,780


5.2Ghz commercial speed for a 24/7 highly complex OoO and very big, 6 core tons of cache, server CPU die (meaning no turbo for this >500mm²)... as been here since 2010 at 45nm *PD-SOI*, no HKMG no Finfets...

Recently they changed to 32nm HKMG *PD-SOI*, for almost half the power and almost half the size (~300mm²) and set the speed to 5.5Ghz commercial 24/7 (no turbo, no water cooling really).

I'm talking about IBM z196. Even in here it seems the PC world will be very late at catching up... damn! it seems the PC world lately is very late at catching up anything relevant of innovation...

Smashed at low power by the ARM paradigma for portable... smashed on the other end for high performance high speed solutions...

 

mayankleoboy1

Distinguished
Aug 11, 2010
2,497
0
19,810


500mm2 :O on a 45nm process :O :O
what was the power draw ? 500W, 1KW ?

 

hcl123

Honorable
Mar 18, 2013
425
0
10,780


You seem not to appreciate fully the context. Its CERTIFIED FOR 24/7, and the chip has tons of cache 24MB if IIRC (triple of SB/IB) for 6 threads/6cores, and impressive RAS features including "redundant/reliable execution" (meaning each instruction is processed twice in parallel ).. yes it might had as i calculated (never exactly reveled by vendor) around 220W, nothing that scares a gamer with a good GPU...

The 32nm version reduced power by 35% IIRC (- 140W) and its quite smaller.

But appreciate fully, replace redundant execution by a SMT hyperthreading scheme, shed almost all RAS features not necessary, give it only 12MB cache, and interfaces for off-die L4 are not needed also ... and you could have a "desktop" chip with ~250mm², with 12 threads, and perhaps 5.8Ghz for 125W at 32nm PD-SOI... and is a monster cruncher with "eager execution" and fully checkpointed ops, and one of the more performant ever bypass networks for OoO exec, i'm sure single thread *integer* will beat hands down any intel or amd CPU...

The main trick for speed is that it has a long pipeline with 26 stages... yes speed daemon like P4, but even more cruncher beast than any other intel CPU (ok vector is kind of weak, but this is CPU not trying to turn a CPU into a GPU (AVX1/2/LNI) or a GPU into a CPU (x86 cores for crying out loud) for marketing/market sake) ... there is no law that says one must invalidate the other, and IBM just proved it....

To emphasize is that is also heterogenous, each 6 core die has 2 very good compress/encrypt co-processores that have been beating some records in the market... so if vector would be important they could include Cell style SPE co-processores, that would be orders of magnitude better than try everything unified on the same pipeline.. that is... 4 advanced SPE co-processores for altivec 2, could mean ~300mm² for that, and less than 5.5GHz, yet most probably it could win Vector to, hands down, compared with any current PC world CPU.

Face it, there are reasons for this, IBM is yet the best for CPU, and design chips every 2-3 years, is very different than try designs that must change every year or less for marketing/market reasons...
 

hcl123

Honorable
Mar 18, 2013
425
0
10,780
it is... complicated contracts with guaranties for both sides, and a minimal entry spec big vertical machine with those CPU clusters, would cost you almost half a million dollars.... top spec can go beyond 1 million...

No this is not for "current markets" much less for "PC world"..

And honestly i think this style setup is "demode", the future most growing server market will be "cloud", at least 8 ARM vendors with server solutions think the same... lets see if IBM can adapt...

[ EDIT: oh! sorry... the 5.8Ghz 125W exercise... No AFAIK IBM doesn't design for itself ( Wii U is for "other") but big Server chips with its own interconnects no less than 4 levels of cache an on on... there main market is those big machines including the top supercomputers... no "desktop" chips ( and i appreciate why lol) AFAIK ... ]
 


IBM Power8 hasn't been released yet, so like all things take a wait and see approach. Going over 5Ghz isn't impossible, it just requires trade off to keep power draw to a minimum. Now I'll admit I don't keep up with PPC nearly as much as I do SPARC or x86 so I'd have to ask some folks I know about it. I can say that PPC has a dramatically different performance profile then it's competitor SPARC. Both of those are in a completely different world then x86 though, there should never be direct comparisons between them.
 
ah. thanks for the explanation guys. i was thinking that since apple used to used powerpc cpus, at least some mac pros might switch to those cpus from x86 since both amd and intel seem to hit some kind of ghz limit. i know it's kind of a long shot... as for amd making a 5 ghz cpu - i thought they're the only one who would release a high clocked cpu if they needed... at least an x86 one.
 


PPC used to be a very efficient architecture, but now, IPC wise, it doesn't fare particularly well. Also remember the POWER arch (as opposed to the PPCxxx series) are essentially server grade chips, and not really meant for general users.



The issue is mainly power draw, and by extension, heat. Hence why most chips are still, almost 7 years later, still hovering from the mid 2GHz to mid 3GHz range. We've seen a slow trend upward, but not much else. At least IPC has improved a ton since the P4 days though...
 


The Xbit article is basically explaining how the use the CPU as (essentially) a pre-fetcher for the GPU. No huge shock that getting the GPU data faster speeds up the system as a whole.

DAE has significant downsides: specifically, the cost of any branch mis-prediction is large (requires a buffer flush). Also, any control intensive code (OS Kernels) is not well suited for any DAE based architecture, hence why its not used in general computing machines (PC's, etc).

http://en.wikipedia.org/wiki/Decoupled_architecture

As long this continues to coming, showing the Gustafson's Law (a puzzle why intel let him go, when AMD made in a VP right away) invalidating the Amdahl's Law... no more words needed, this is the future...

Amdahl's Law will always be true: You can not speed up code via parallization by more then the percent of the code that is parallel. If 90% of the code is serial in nature, then you can achieve a 10% speedup via parallization. If 90% is parallel, you can achieve up to a 90% speedup via parallization. Not that hard a concept to grasp.

Gustafson's Law is also true: Its cheap to parallize large datasets; that's what GPU's do during Rasterization. The larger the dataset, the greater the speedup via making processing of the dataset parallel. I have never argued against the point [quite the opposite, I've long called for moving physics onto the GPU, since multiple-object interactions bring current CPU's to their knees].

http://blogs.amd.com/fusion/2013/03/18/new-aviary-sdk-and-windows-8-app-optimized-for-amd/
(11x = 1100% *MOAR*) average

Again: Moving execution of code to the GPU speeds up processing. Nothing new here.

https://amdfusion.activeevents.com/scheduler/catalog/catalog.jsp?sy=798
Optimizing VLC Using OpenCL and Other Fusion Capabilities

See above. Moving large datasets to a massively parallel arch speeds up execution. Which kinda proves the point I have been trying to make: That all the stuff that is ALREADY parallel is being moved to the GPU, leaving the CPU with the stuff that is mostly serial in nature.
 

hcl123

Honorable
Mar 18, 2013
425
0
10,780


ummm... i think what you describe is more like "execution based prediction schemes" like Sun Rock chip was supposed to implement, a form of "pre-computation" or internal Scout threads. This is worst since those computations in most schemes(including Rock) are discarded, which is terribly for power.

The "decoupled" factor is also kind of blurred... many academic uarch exercises can be decoupled yet not be truly DAE. Matter of fact the BD family uarch is decoupled yet not DAE.

The DAE i was talking about, envision the real separation of control+access processing elements from the execution processing elements, a lot like the CMP (cluster multi-threading) of BD families if you wish but more separated, nothing is discarded, and being one notorious difference is that the C+A part always runs ahead, that is, is made intentionally to run-ahead, quite large difference sometimes, being so able to catch many cache misses, pre-fetch more precisely and resolve many branches ahead.

The worst problem with this is exactly "loss of decoupling or of run-ahead distance in the code" (not in the uarch lol) which requires complicated control and specially very good "data and branch speculative" execution schemes. This has been the real "crux", who gets this part right hits jack-pot IMO. Perhaps if branch prediction keeps evolving with forms of eager execution with checkpoint, and data speculation transition to a next level with address/alias prediction and value reuse, and HTM takes off the ground, then DAE in a CPU can be possible.

OTOH the real execution pipes can be simplified and could be put to clock-gate more often. The rest is identical, but OoO with SMT is kind of very natural and guarantied by the C+A part. The speculative execution part is necessary to avoid loss of decoupling events, but the real advantage is in OoO which execution visible window, with a good checkpointing scheme, can be in the order of thens of thousands of instructions (not kidding), without the controlling structures having to augment in proportion. DAE can be like OoO on super steroids.

That Xbitlabs article is only a small approximation to the potentialities.



Yes, i think both are good friends lol



yes but when the speedup involves 3 digits is kind of ecstatic... when involves 4 is "nirvana" lol ( CPU or GPU we fail to think about them in those states)...

--------------

Shame on me, i've been passing wrong info...

That IBM z chip that i was talking about, with commercial speed of 5.5Ghz, its not 300mm² and has not 24MB of cache... its 32nm PD-SOI alright, but it has 6x2 =12MB of L2 (1MB I + 1MB D per core) + 48MB L3 ( L1 is very small comparatively), so 60MB of cache, and it has ~600mm²... lol .. and its not only 2 heterogeneous cores per die, now its 1 per core, or 6 in total, which makes those some kind of "modules" ... pardon my french...

Cherry on top of cake, besides much better branch and checkpointing... it now has Hardware Transactional Memory with a dedicated store cache, and profiling or instrumentation of code in hardware ( the 1th intel has with Haswell but more primitive, without dedicated store cache... the 2th AMD has with BD but more primitive to)

Who wants this for desktop ?
http://www.hotchips.org/wp-content/uploads/hc_archives/hc24/HC24-9-Big-Iron/HC24.29.930-zNext-Shum-IBM.pdf

 

griptwister

Distinguished
Oct 7, 2012
1,437
0
19,460
5Ghz, because 4Ghz obviously isn't enough... Lol.

*edit* When talking 5Ghz wall, I was talking OC, not being marketed as a 5Ghz option. People are getting very close to 5Ghz with the FX 8350 from what I've heard...
 

Cazalan

Distinguished
Sep 4, 2011
2,672
0
20,810


That part while very technologically impressive is like watching an FX-8350 hitting 8Ghz on liquid Nitrogen. It CAN get the speeds but at costs that no average human can afford. Last I heard the CPU itself was drawing over 1200W. No telling what the full system was using. Buying the z196 is even trickier. You essentially have to rent them and pay by how many cores are activated and how much memory is enabled.

Guess that's where Intel got the idea of paying to unlock features of the CPUs.

 

hcl123

Honorable
Mar 18, 2013
425
0
10,780


They (IBM) are selling it.

And 1200W is a whole "big" socket package that has 6 chips in it, 4 CPU + 2 High speed interconnect switches with the all L4, that each of them is capable of drawing more power than the CPU chips.

Its different markets alright, different propositions... but in the end the exercise is meant to show that there isn't "artificially imposed barriers or walls", and that one Fab process can't be the best at everything. In the end the old knowledge stills prevails -> "you can't have REAL low power and REAL high performance/high clock at the same time".

A parallel with GPUs in those examples is striking, those cards can be over 300W (4 makes 1200W), yet ppl don't really complain about power they complain about FPS, and doubt they would trade half the FPS for half the power or less( in the end they can't have both at the same time).
 

noob2222

Distinguished
Nov 19, 2007
2,722
0
20,860


over 5 with a h100 or h70 closed loop cooler.

http://www.tomshardware.com/reviews/fx-8350-vishera-review,3328-2.html
http://www.overclockersclub.com/reviews/amd_fx8350/3.htm

5.125ghz on toms, 5.2 ghz on oc club.
 

Cazalan

Distinguished
Sep 4, 2011
2,672
0
20,810


Cost isn't an artificial barrier. Those chips just show that for an extra 10k per chip you can get an extra 20% clock speed. Not really efficient for anything close to a personal computer.
 
Status
Not open for further replies.