AMD CPU speculation... and expert conjecture

Page 257 - Seeking answers? Join the Tom's Hardware community: where nearly two million members share solutions and discuss the latest tech.
Status
Not open for further replies.
btw, does anyone think that according to amd's own uarch roadmaps, amd shouldn't have an fx-type cpu for steamroller and/or after steamroller and that some kind of a high performance apu will replace the flagship cpus. i'm disregarding prices for now.

I don't have any information on how far along they are with integrating the iGPU as a "super FPU" via SIMD. APU's have no real place in full featured desktop systems, mid end dGPU's are far too cheap and come with dedicated high speed memory. AMD's best approach would be to create a desktop SR based on the FM2+ chip and just not put an iGPU inside it. Socket FM2+ can provide everything that AM3+ does while also having support for video outputs of APU's. Just because something doesn't have a "FX" label doesn't mean we won't see a high clocked 8 core CPU released at a market price of ~$200 USD.
 
Thuban's hex cores in performance at 3.3ghz is not surpassed by the Vishera's 8 core at 4ghz. 32nm Thuban would have still consumed less die space than a 8 core or even hex core Vishera and performed faster in single threaded workloads. All they needed was to redesign the IMC and front end a bit and I am sure a 1100T say in 32nm form with more transistors would have out performed a 8350 in gaming senses by around 20-30% at lower clocks, that is the games that stress single threads ie: Civ 5, SC2, Shogun etc where AMD CPU's are far behind Intel's

This makes no sense at all. I'm really trying to understand where your getting this from. You can already get a four core @32nm K10 CPU and it doesn't perform anywhere near what you just stated. Since we're talking "old single threaded software" then it being a four vs six is irrelevant. The Thuban was never a great CPU, the four core was vastly superior as you could overclock it to a much greater amount (I still have mine). My 8350fx is still better then my 970BE, especially after I started jacking up the speeds. I've even run tests on my 3550MX, though the laptop's cooling system prevents me from going any faster then 3.0Ghz (single core) or 2.4Ghz (all four).
 
let's take a look at this for a second.

only 2. that's Two (barely)sentences, totally unrelated to the majority of the post, are borderline on-topic. the rest is simply my intel is better than amd. wth.
this reminds me of those spams that copied part of a forum post and then put in some kind of url in the mix or at the end.
aw crap, my post if off topic as well... looks like i am getting pulled into this vortex... must... get... out... DNFTT!! *swims away*
 

speculating here: is it possible for amd to use a compute units for beefed up co-processing/accelerating(edit: instead of replacing discreet gfx) inside a high performance 'flagship' 'apu' instead of using it for display and gfx purposes in regular apus? for example, kaveri, kabini take the mainstream markets and for high performance desktop (hedt) and servers, amd makes that flagship, the one with a big die with 2 CUs and 4 modules along with integrated pcie controller.
 

Krnt

Distinguished
Dec 31, 2009
173
0
18,760


But we have already seen that Phenom was very dependent of its L3 cache, good examples are the first generation (Agena, Toliman), and even the Thuban had a small performance hit because of the L3 cache not being big enough. Also the llano was very limited in clockrate because the way the GPU was integrated with the CPU, so it doesn't work at all as a good representation of the Phenom II at 32nm , maybe a limited representation of the Athlon II.

I am very convinced that the SR will fix the main modular architecture problem that was its decode stage as you can see here:
http://www.realworldtech.com/bulldozer/5/

The decode stage shown there is for the Bulldozer is feeding a complete module, and we are going to have twice as much in the Steamroller, but I think 3 decodes per core would have been enough and more efficient, lets wish the Fetch stage doesn't become a bottleneck.
 
@hafijur

"Gaming", "high end" and "mobile" do not belong in the same paragraph much less sentence. By definition there is no such thing as "high end" mobile as anything mobile would require sacrifices in performance. We can talk "desktop replacement" which is a notebook so obscene that it's no longer mobile and just services as a movable desktop unit (Sager comes to mind). I had one of these for awhile, my back hated me from lugging it through international airports. I eventually replaced it with a HP DV6z with an A8-3550MX and 6670M dGPU. Much better experience overall, cheaper too.

APU's make the most sense in the sub $900 market which is most definitely not the "low end". It's the mainstream mobile segment. When you get into the sub $600 area they become the only real option to play games while also being light and mobile. Really can't stress how important these two market segments are, they represent the vast majority of sales.

This is when we start discussing real economics, whomever makes the "fastest" CPU at X is irrelevant. Consumers buy computers to do things, they want to purchase the system that will do those things at the lowest cost.
 
But we have already seen that Phenom was very dependent of its L3 cache, good examples are the first generation (Agena, Toliman), and even the Thuban had a small performance hit because of the L3 cache not being big enough. Also the llano was very limited in clockrate because the way the GPU was integrated with the CPU, so it doesn't work at all as a good representation of the Phenom II at 32nm , maybe a limited representation of the Athlon II.

That's people being confused. Caching is a tiered system, first L1 is checked, if it's missed the L2 is checked, if it's missed again then L3 is the final safety new before having to resort to main memory. So it becomes a question of how much time you must wait between each check. The size determines how much old data you can keep around and thus the probability of needing that old data. Predictors and prefetch should ensure everything is in side L1/L2 cache, this if you EVER have to access L3 then something very bad has just happened.

L1 / L2 is dedicated (well supposed to be anyway), it has a relatively low access time compared to L3 which is shared across the entire chip. The Phenom II had 512KB of L2 cache with 6MB shared across 4 ~ 6 cores, the Llano has 1MB of L2 cache. That's x2 the amount of fast L2 cache, more then enough to offset any performance penalty from missed L2 lookups (performance gained from less L3 lookups would offset performance lost from not having L3 at all). I can't express how slow typical L3 cache is, its damn near worthless unless you build a benchmark specifically to test for it (extremely large working data sets that can't fit inside L2).

Also Llano's iGPU has ABSOLUTELY NOTHING to do with it's CPU overclocking. I know this because I happen to have one sitting on my living room table that I've tweaked the hell out of. You can disable the iGPU entirely if you have a dGPU present, that's how I got my Llano to 3ghz. It's no different then any other APU.
 

Krnt

Distinguished
Dec 31, 2009
173
0
18,760


Predictors and prefetch haven't ever been strong on AMD, so its a normal thing that bad things happen, and its needed to access L3 or even RAM at many cases, and since L2 is not big enough to replace L3, when you ditch L3, performance penalties are very noticeable.

I know L3 cache is slow but normally should not be slower than accessing RAM so its always an improvement,
Here are some images that could help us understand:
sandra-latency.gif

sandra-cm-latency.png

BTW, Trinity works pretty good without its cache L3.

I think you are right about Llano overclock, I think that issue was only with the locked multiplier versions or when the iGPU was enabled, too bad we can't have it with L3 cache, it would have been an awesome CPU.

 


Nobody gives a flying **** about how cheap you got your laptop for. The Mobile APU is designed for the general laptop market, not some desktop replacement with an over sized screen. Sure, you can compare a $400 Mobile processor to a Limited Edition Phenom II TWKR redux which you can do the same thing with a 8320\8350 and a $70 cooler or a AMD FX 9370... Quite ironic how you are shouting how "amazing" Iris is and you get a laptop with a GPU it was supposed to "kill off".
b82a35e0-389a-494e-a87e-429b55e69a33.jpg
 
@Krnt

You pretty much just proved my point, though you may need to understand ASM to interpret that information.

L1 tends to be 32~64KB data cache (instruction and data cache's are separate for L1). When working set is larger then that then part of it will reside in L2 which is 128KB ~2MB large.

Here is the point I've been trying to make, data working sets are rarely bigger then 128KB. CPU's don't work on 1MB worth of data at any once instance, instead they work on 64-bits worth of data at a time. Your program gets a slice of CPU time to execute a bunch of work then it's switched out and the cache is reloaded, this happens hundreds to thousands of times per second. Really large datasets are exceedingly rare for consumer programs. Thus the probability of a program working on more then ~256KB worth of data between next context switch is incredibly low, you practically have to force the issue. That is why Intel has small amounts of cache on their CPU's.

Now as to why L3 would "seem" to give a performance increase, it's the prefetch not reading far enough ahead or the code not being organized such that it states what data address its going to need. This gets pretty deep into machine language but just know that your CPU see's instructions are an infinite stream of serial code and must look ahead to know what data to preload into cache. Larger L2 is always preferred over L3 due to this.

Now people really need to stop comparing AMD to Intel for cache, they use radically different caching logic / mechanisms. Intel's "L3" acts more like a shared L2 cache. AMD use's an exclusive model where Intel use's an inclusive model.

EX:
8350fx (8 core / 4m module)
16kb L1d (64kb L1i) per core for 64kb of L1d and 512kb of L1i cache
2MB per module of L2 and 8 MB of shared L3
Total Cache: 16MB + 64KB + 512KB

HW
32KB L1D and 32KB L1I per core for a total of 128KB L1D and 128KB L1I.
256KB L2 per core
6MB (8MB for the expensive ones) shared L3 cache.
So that would be 6 + 1 + 128 + 128?
No it wouldn't each level of cache contains a copy of all the data of the next level. L3 contains all the data from L2 and L2 contains all the data from L1. This reduces the cache amount significantly. That 256KB of L2 is actually only 128KB of exclusive data, the 6MB of L3 is only 5MB of exclusive data.

Now take something like the Phenom II

64KB of L1d and 64KB of L1i per core.
512KB of L2 per core
6MB shared L3.

Vs the Llano
64KB of L1d and 64KB of L1i per core
1MB of L2 per core
no L3.

That should hopefully help people understand why directly comparing cache systems between these systems is a VERY bad idea. Their so radically different that direct comparisons are useless. Instead actual benchmarks need to be used and the results is that cache latency is far more important then cache size. This is where BD failed hard, it's small L1d cache gives it a high likely-hood of having to go to L2 to get data, it's L2 has a high latency which forces the core to unit to stall out. Essentially the BD uArch is prone to excessive stalling. Increasing the size of L1D and lowering L2 latency, or making a more accurate / deeper predictor, is what's needed to fix that issue. It's also why L3 is not an issue at all as by the time you get there you've already stalled out and things are already bleak.
 


The modular design has its issues, L3 cache isn't that much of a big deal for the arch due to the issues stated above ala Trinity faring quite well.
 


That was one of the goals of the Fusion design. By configuring the entire CPU to be modular it makes it easier to remove the SIMD (FPU) and replace it with something else. GPU's are just vector processors the same as the SIMD units inside our modern FPU's. They both do the exact same thing but use different languages to accomplish this. As long as everything can communicate in the same memory range *hint hint* and there is some sort of common language, it's possible to turn a GPU into a super FPU. There is a ton of prework that needs to be done for this to happen and we might actually see this in Excavator. What happens is the stream of code contains typical SIMD (SSE/AVX/ect..) instructions, the scheduler would have to direct them to some unit sitting between the CPU and the iGPU that would be responsible for scheduling and controlling the massive vector CPU. It would be very similar to how the old 80387 worked.
 

hcl123

Honorable
Mar 18, 2013
425
0
10,780


You are a confused and confusing guy lol

What is "DP/BD" ... double precision and ?

"" The 8 FLOPs claimed by AMD correspond to a single (shared) FlexFPU per module. "" ... yes i doubted but not contested. But only if is 4ops per 128bit pipe, and only for 32bit FP ops. What you don't understand is that a "vector" which is like an agglomeration of simpler operations, and in this case multiply+add has to be present with up to 4 operands, and since the "issue port" is only 128bit large, for being per cycle it must correspond to 128bit FMA4 instructions, or vectors of 4x 32bit ops because if 64bit ops, then it will not fit "per cycle" trough 128bit ports(must take 2 or more cycles, or be split in halfs for 2 pipes)

Those (128bit FMA4) exist are part of the XOP package. But since intel with is cloth imposed an embargo, nobody really uses XOP outside of very very few specialized apps, so that claim is not about a pure pervasive FLOP capability but only special case. Besides vectors of 32bit FP ops is not what is most useful, better would be vectors of 64bit ops.

So your 128GFLOP can only be of a 4 module CPU, because 8 flops per FlexFPU is 32GFLOPS, and only a 4 module chip could have 128GFLOPS from the CPU side... if that is what you are stating... Flops is = number of Ops x frequency; and that case per FlexFPU is 8 x 4Ghz =32GFLOPS not 128... and yet not very useful because only in the case of 128bit FMA4 instructions. 256bit FMA4 XOP instructions uses both FMAC pipes, though they are vectors of 64bit ops, the rate is half. Got it now ?

This is what lead me to the speculations of 256bit FMACs, that is, one single FMAC is 256bit large internally and *must* have a 256bit port... and using 2 FlexFPU per module leads up to 32 64bit ops per 4 module CPU. Now this could be very useful, for all those scientific applications.

"" The diagram of the Steamroller module has been posted here numerous times. I posted again a pair of post above: there is no "each FAMC pipe is now 256bit large ""... have you seen an actual, after all this delays, diagram of an actual Steamroller chip ?... NO!

Yes is quite possible that the APU modules only have 1 FlexFPU, after all they have a GPU, that has much more FLOP capability, and compute programing is here to stay. But a server/FX chip could very well have 2 FlexFPU per module, the decode "kind of double" and double dispatch, and other improvements could be sufficient to sustain the needed rates even with 256bit FMACs on 2 FlexFPU per module.

Yes those were not revealed in any "slide presentation", but were disclosed in a "programmer guide" just google AMD FP256, probably youll go to planet3dnow... a help...

http://www.planet3dnow.de/photoplog/file.php?n=24314&w=o

256bit AVX instructions executed with full-width internal operations and "pipeline" rather than decomposing then into 128bit sub-operations ... can only mean the pipeline executing them has 256bit "issue ports" and so is 256 bit wide also, even if internally its composed of bridged 128bit sub-pipelines also suitable for other operations. Going by 2 128bit pipes working together, it would have to be in halves, or the arbitration for Register File access will be a hell and the chip will clock slower.

Now that seems to me like they are needing 256bit FMAC pipes...doesn't to you ?



You are indeed confusing and confused... you must tell from those numbers what is due to GPU and what is due to CPU cores, and what corresponds to "single precision" and "double precision" and what kind of vectors use them ... 244GFLOPS seems a little low even for a GT2, but must be single precision, since the GPU barely moves on "double precision"... but if there is magic and is only CPU side, just "present the math please" that leads to those numbers.... Flops is = number of Ops x frequency

( no way in hell could haisfail have 244GFLOPS from the CPU cores side with only 3.6Ghz... gzz.. what propaganda do to ppl heads lol (edt))
 

hcl123

Honorable
Mar 18, 2013
425
0
10,780


Predictors and pre-fetch can lead to cache pollution and alignment issues since different cache levels can have different organization. That has been the issue.

Faster than all that, is Load to store and or memory disambiguation, and that has been the sole advantage of Intel... now practically gone.

Llnao already have 8T Cell SRAM with 1MB of capacity for each core exactly to abviate the need for a L3. I think more awesome would had to have predictors and pre-fetch and a very good load to store scheme like its going for Steamroller(which in llano is very weak)... then it would had been grand. But this already supposes a greatly different design. Then there is the issue of the FlexFPU and all those XOP and AVX support, if AMD didn't have none of those instructions support like in Llano, every benchmark would be pervasively around what AMD doesn't have, even if "current software" doesn't have it either (which has been always the case).

But then with Llano with FlexFPU capabilities, with the only possibility of shared dedicated L1/L2 levels, we invariably fall into the "module" approach, specially if wanting to have the vector/FP code out of the integer code path, which is must better for clock... and in here the main reason why the tweak of the FO4 was possible. This lead to the other side of the coin, a FlexFPU capability (very large and bulky) on a shared exec topology like K10 (or intel for that matter) would have chips clocking lower. What it could possibly gain by ILP it would lose by clock, so a top Llano SKU would had the same performance [IPC rate (instruction per clock) oops wrong, IPC = nº inst/clock, measures ILP not total performance, and less ILP but + clock, can lead to the same performance] of a top BD SKU... only OC much worst LOL...

I think that wasn't because AMD engineers are stupid, or for the "lack of trying", that BD happened.
 


He has been on about that for a long time yet Iris proves fallacy over rationality. OEMS ship Iris based parts with mid-high end mobile discrete GPU's for the very reason that HD5200 Pro is only good up to 1366x768, beyond that despite twice the bandwidth a common HD8660/6800K features by 1600x900 its about on par by 1920x1080 its starting to lag behind. Iris was supposed to knock the HD6670 and GTX640/650M out of play and yet these discrete GPU's offer considerably more performance.



 

noob2222

Distinguished
Nov 19, 2007
2,722
0
20,860


That was actualy the point I was trying to get across. This 8x FPU performance is in very limited - highly threaded HSA programs.

On the GPU side, Kaveri APU would make use of upto 512 stream processors or (MADs) which amount to 8 compute units.

Read more: http://wccftech.com/amd-kaveri-apu-architecture-detailed-generation-apu-featuring-steamroller-gcn-cores/#ixzz2cMTBQTFf

ya, i know quoting from WCCF, but the point is that Kaveri has 8 compute units, relating to 8 extra float point pipelines. In order to achieve this theoretical 8x GFLOP performance, the program has to be able to utilize 12 simultaneous fp threads (4 FlexFP in the SR cpu core, and 8 CU through the slower clocked IGP).

My "single core" statement was in refrence to having some miracle happen so that the 8 CU are seen as one pipeline that can be executed through a single thread. Im sure most of us realize thats likely impossible at this time.

Even Intel shoves 2 threads through a single core, not the other way around, shoving 1 thread through 2 cores, much less 8 cores.

In short, all this talk of 8x GFLOPS is only going to be valid in highly threaded HSA programs. Current programs won't see much benefit, other than those that currently support OpenCL.

Can kaveri's 4 cores with hsa support replace the 8350 in todays programs? I don't see it happening, but obviously someone else does.
 

juanrga

Distinguished
BANNED
Mar 19, 2013
5,278
0
17,790


Do you mean the performance bug affecting the B0/B1 series of Bulldozer and corrected in the B2 revision?



The hypothetical possibility of using the iGPU for compute and a dGPU for graphics was discussed before. I considered some advantages/disadvantages of this dual approach.
 
G

Guest

Guest
Does anyone know what the likely gain in performance will be with steamroller?

I waited for bulldozer three years and I became disappointed, I am very cautious about new AMD stuff. My Llano laptop Beats Trinity hands down clock for clock.
 
^^ unless amd royally screws up, there will be very noticeable performance gains on both cpu and igpu sides with kaveri apus over llano. llano was more of a first try where kaveri will be better integrated, delivering higher performance.
edit: especially in mobile (laptops).
 

your laptop get slapped by trinity laptops. period.
 

noob2222

Distinguished
Nov 19, 2007
2,722
0
20,860


yep, maybe this will put the nail in the "omg we need phenom V" coffin.

AMD-Average-Gaming-Performance.png


only in Skyrim is the 4.0 ghz Phenom II able to keep up with the stock fx-6350. In crysis 3, the fx 6350 completely obliterates the Phenom II.

Crysis-3-Lowest-Frame-Rate.png


Looking at the Athlon 750k (trinity without gpu), kaveri has to be a major improvement to catch the 6350 and 8350.

Who knows what AMD is hiding, they are being too quiet.
 

hcl123

Honorable
Mar 18, 2013
425
0
10,780



More with Richland laptops... i'm quite convinced this ones (Richland) even gain for Iris on higher resolutions (FullHD) and higher settings(a beattng)... and much sure in image quality (which has always been a very strong point of ATi even before coming AMD). Performance is not always FPS, the problem is that there isn't a coherent way to benchmark image quality.
 
If you look at the newer games and newer programs, the Piledriver 2 module performs better than the 4 core phenom 2s at the same clocks. The modules do have problems with minimal frame rates while additional cores alleviates. The FX6XXX are a bargain for budget gaming these days.
 
Status
Not open for further replies.