AMD CPU speculation... and expert conjecture

Page 92 - Seeking answers? Join the Tom's Hardware community: where nearly two million members share solutions and discuss the latest tech.
Status
Not open for further replies.
No, this isnt the direction theyre heading, thats the business coming their way.
In other words, traditional wont shrink any more than the market allows for, which is shrinking, but consoles and SFF is growing, and the pie chart shows that relationship.
There will be less Inel product as well, since the traditional market is shrinking.
If you have 3 consoles, and made 2 gfx solutions for the thre, and now have all of it sowed up, its not 50% more, its 3x more, including cpu, and then look at the chart again
 
sorry for the brainfart in in the earlier post. gave up editng it.
i meant to write that they expect a lot of dough from the console deals while x86 desktop accounts for less and less overall revenue. who woulda thought one of the possessive conditions of intel's x86 agreement would allow amd to make moar money (selling cpus instead of licensing arch because of the non-transferability:))
you're right about the shrinkage, jdj.
 

mayankleoboy1

Distinguished
Aug 11, 2010
2,497
0
19,810
The xbitlabs article says that Steamroller single core perf improvement is expected to be 5-10% . Which is not bad. The only problem i see is that AMD cant increase the default clock speeds across the whole product line, which they did with Piledriver.

So Steamroller could be 'only' 10% faster average than Piledriver. Compared to 15% Piledriver improvements over Bulldozer.
 

only 10% max from all that 'claimed' internal tweaking, branch prediction and bandwidth improvement? they coulda gotten that much from just die-shrinking pd. i am not talking about kaveri here.
please correct me where i'm wrong - from 32nm to 28nm should bring in at least 15% less power use unless amd pulls another bulldozer. extra space (opened up by the shrink) went to ... extra decod unit...? i read somewhere that decoding stage uses a big chunk of power. so amd's reworking the architecture is mainly trying to maximize cpu performance by keeping the said 15% perf./watt and offering nothing further? shouldn't the cpu performance be higher than 10% if it (15% efficiency) was really achieved?
 

jdwii

Splendid


I'm pretty sure its going to be a bigger improvement then that i mean Amd even said so. :D

Look at Piledriver their is a 3-7 Percent improvement compared to bulldozer in IPC and that was a touch up now look at all the changes with steamroller. BUT if this holds true and their is a 5% improvement i think Amd needs to step up and say their design just isn't as efficient regardless on the engineering changes.
 
resonant clock mesh technology is another one of my concerns. rcm, theoretically, should further improve performance per watt before die shrinking, at least at stock settings. what if rcm is not as much compatible with amd's designs as amd previously claimed? isn't that what happened with piledriver(arch)? new architecture, re-designs and tweaks, rcm and new process together should enable amd to achieve significantly higher cpu performance while maintaining 15% perf/watt improvement with each new architecture. that's why the 10% number looks so inadequate. what is missing here?
 

mayankleoboy1

Distinguished
Aug 11, 2010
2,497
0
19,810
IC Insights 2012 top 25 semiconductor companies ranked by sales :

156v287.jpg


and annual growth

2lcl0jo.jpg
 

truegenius

Distinguished
BANNED
BUT if this holds true and their is a 5% improvement i think Amd needs to step up and say their design just isn't as efficient regardless on the engineering changes.
and if this happens then i will say "bring out the k10 at 28/32nm"

http://www.anandtech.com/bench/Product/700?vs=362
x4 980 vs 4300
if we keep clock speed same and core counts same then k10 is still good and better in many cases (fx4300 have advantage of turbo upto 4ghz) (after R&D they delivered a product (pd) that is almost equal to ages old arch (k10) )

and a die shrink with reduced l3 cache will be able to keep a quad core k10 chip @3.2ghz in under 65w tdp and die size will be less than bd/pd

so clearly a better value for money product with large profit margins and low tdp to catch attention of oems

llano could be good but it does not clock well , indeed llano parts are not able to match phenom 2 's clock/overclock speed (can't surpass stock clock of x4 980)

as amd is going to use gpu to do fp work in their apu so using bd/pd for that time frame (till the arival/implementation of true hsa) is not that good imo

if steamroller does not widen the gap of performance defference vs k10 in positive direction for per core and per ghz then i will say again "why you AMD no 28/32nm K10" :/
 

Cazalan

Distinguished
Sep 4, 2011
2,672
0
20,810


At this point they'd probably go with a souped up Jaguar over going back to K10. If you look back at how Brazos started it was basically a K10 sliced down for power efficiency. They've since added modern instructions and built it back up. The biggest downside of Jaguar is the clock speeds. 2.0-2.4Ghz isn't much compared to where K10 was at.
 

Blandge

Distinguished
Aug 25, 2011
316
0
18,810


AMD is using GF's 32nm SOI for Piledriver. Kaveri will be on TSMC's 28nm bulk and Steamroller will be on either TSMC's 28nm bulk or GF's 28nm bulk. The power characteristics may not see any (or very little) improvement from 32nm to 28nm because of going to SOI to bulk.

 

hcl123

Honorable
Mar 18, 2013
425
0
10,780


ummm... because K10 is not optimized to have the "huge" FlexFPUs with XOP/FMA4, which never materialized thanks to intel(for obvious reasons)... and which BTW can be more important than single thread "integer" IPC...

BD was a tradeof according to the available space( yet making already a big chip). Besides K10 is not in the format for "heterogeneous" cores, like FlexFPUs and in the future compress/encrypt cores ala IBM Z (i'm convinced), and with it "integer" cores with SMT/hyperthreading(like the flexFPUs) and so have up to 4, very well accelerated, threads per module(at least 3 threads per module). I think "excavator" will be that, 16 threads in comparatively a little more space than where the actual 8 reside (smaller nodes, and the cache space in BD is already huge).

Very different uarchs with very different targets in mind.

 

my speculations and questions are based on steamroller and kaveri being made on glofo's 28nm.... SOI...? i was pretty sure that glofo will make the high performance apus(steamroller-based, kaveri) while tsmc makes the low power ones (jaguar-based, kabini).
i don't understand why amd would switch to bulk after reading good things about SOI in earlier posts. ofc i don't have enough knowledge on fabrication processes.
 

hcl123

Honorable
Mar 18, 2013
425
0
10,780


Just an added note... AFAIK and understand Kaveri/Steamroller will have the same number of 4 decode pipes as the BDv1/2, only organized in a total different way as it seems. So there will not be any relevant more power in this. Just to remind that AMD BD family is "vertical multithreading", it functions by clever dedicated "buffers between stages or <thread domains>". And so like the FlexFPU that is SMT (simultaneous multithreading -> 2 threads at the same time), also the <decode domain> can be SMT(with 2 dedicated "domain" front-end buffers)...

To add also that the previous family like K10, functions very well for with only 3 decode pipes, and also the "complex pipe" doesn't have to be blocking the others pipes upon microcode or FP/vector code as it now does, and it can be Out-of-Order to, more specially if there will be 2 threads that don't share any dependency between instructions (the reason why the complex decode blocks the others in x86 -> a reason of crap x86 not uarch). So in practice it could be said its 2 decode engines (double)... sharing pipes a detail...

As stated AFAIK, Kaveri will be 28nm bulk at "GF" not "TSMC" (which will have the Jaguar styles)... but since there seems to be a clear division between APUs and DT, and even on this APUs between what is for "mobile / low power" and what is DT, 28nm bulk could be only Kaveri "mobile" where bulk could be much more comfortable... the rest i think is premature to speculate, since only in up to half of 2014, a date when GF stated 28nm FD-SOI will ramp to full production readiness( as shown by STMicro this could be a very good advantage, even price/performance good for GPUs)... also not to discard the possibility for 28nm PD-SOI (the SHP) for the FX/server brands...

 

8350rocks

Distinguished


One of the reasons why dual-core Bulldozer modules [the same may be said about Piledriver] are not completely efficient is because they have only one instruction decoder for two ALUs and one FPU. With steamroller, AMD not only incorporated two decoders per module, but also increased instruction cache size (to lower i-cache misses by 30%), enhanced instruction pre-fetch (the number of mis-predicted branches is down by 20% compared to Bulldozer ) as well as improved max-width dispatches per thread by 25%. AMD believes that Steamroller will provide 30% improvement in ops per cycle.

AMD also advanced single-core execution by implementing 5%-10% more efficient scheduling, incorporated higher-capacity register files and performed some other tweaks. It should be noted that while integer pipes of Steamroller will not be too different from existing ones, the floating point pipe will be a bit redesigned. In general, AMD promises that both integer and floating point per-core performance of Steamroller will be higher than they are today with Bulldozer micro-architecture.

Where are you coming up with 5-10%?

They expect a 30% improvement in IPC...that's 3x more improvement than what PD over BD was...? (In terms of IPC...that's HUGE)
 

hcl123

Honorable
Mar 18, 2013
425
0
10,780


I think the previous CEO is the culprit... that "economist" (aargh!) not engineer, Seifert or something... he committed the large part of production to GF and on "bulk" to standardize with TSMC and so cut costs deeply ( if it would be good or sell good or not never crossed his mind as it seems).

AMD since then payed a fortune to get out of GF and to get out of any "exclusive" deals... i think at least this might mean the intention of diversification of "fab process" possibilities, if not they would had stayed with GF, and make everything at GF bulk with few GPU exceptions( i think the previous contract even mandated Jaguar on GF)...
[EDIT: "diversification" which is a very good thing for "competition", just to think how fast GF improved well the good ole 32nm PD-SOI for Richland... when previously it seemed to get nowhere ]

Also in the meanwhile is when the announcement of STMicro FD-SOI fabbed at GF appeared... and STMicro and ST-Ericsson entered the HSA foundation... a lot of food for speculation lol...

Yet in the midst of the all turmoil, little time to diverge many designs from previous intentions, and still get things on time(yet already here some reasons for delays.. i think...).
 

hcl123

Honorable
Mar 18, 2013
425
0
10,780


I don't think that "IPC" is integer IPC as in the traditional way, it includes a lot of improvements with dealing with FP/vector code for multimedia etc.. At least its what is on the "30% more Ops" on that slide presented in previous posts, seems to indicate(fine print). Caches can play a huge roll in this, store to load mem ops to, latency of the FlexFPU also, 2 decodes it will be mainly about 2 threads per module... all bodes well for finally getting the Integer Core AGU pipes to crunch ALU ADDs + MOVs... if that is the case then 15% more IPC is not far fetched... or even more... putting steamroller right on top of Haswell IPC wise ( differences will be mostly mandated by SOFTWARE, favoring one uarch more than another, even for benchmark code, and it has been so since long ago).

30% more IPC in the "traditional" sense would mean a SR clearly more performant than Haswell... that is, even for single-thread...

 

8350rocks

Distinguished


IPC = Instructions Per Cycle

All that would really mean is that the architecture's capability is going to be utilized efficiently. It does technically possess 150-175% of the horsepower of the intel chip...if it was optimized efficiently internally, it would be a more effective architecture than what intel currently offers.

Why is it so astonishing that they might actually capitalize on it?

Further, they are streamlining the FPU instruction pipeline reducing latency to the FPU and allowing the FPU to run 3 IPC on top of the core's separate IPC...so technically each module could execute 11 IPC each 4 for each core and 3 for the FPU with a 25% reduction in FPU latency by streamlined pipeline.

So an 8 core SR chip can execute 44 IPC...intel can't touch that.

Here is a more indepth look at the improvements coming in Steamroller:

http://www.brightsideofnews.com/news/2013/3/6/analysis-amd-kaveri-apu-and-steamroller-core-architectural-enhancements-unveiled.aspx
 

hcl123

Honorable
Mar 18, 2013
425
0
10,780


Intel doesn't use GDDR4 for their IB-E HEDT platforms !?...

It could compete with Intel on this and gain(at least multithreading wise, by large), even against the top 3.6/4Ghz 4960x SKU, if AMD uses a G34 socket platform with a MCM of 6 modules same 12 threads and 24MB to 28MB of cache...

I think they will not go there, no more market for this exercises with very expensive SKUs where intel dominates and where the propaganda barrier factor is enormous , even if with a good 28nm PD-SOI process and a little better power management, they could put that MCM chip at the same or better than 3.6/4ghz.

will the more down to earth 4 modules 8 threads single die SKU suffice ?... don't think so!.. at least multithreading wise no, but for clock, i think even top Vishera 2 this year, could be way above that 3.6/4Ghz on the good ole improved 32nm SOI.

... unless the FX SR version for 2014 is on IBM style SOI lol... and then those chips could pass with "commercial" speed not OC (which could be much more for any side) the 5Ghz on 28nm SOI(not likely)... being an alternative for staying with only 8 cores for top end, add 2 FlexFPU per module, or 1 per core, it might be that it could win all the line for FP/vector code even against top intel HEDT ( even integer code if the difference in clock is large enough). Multithreading in that case, the top intel HEDT with 2 more cores and 4 more threads will win, i think.

 

hcl123

Honorable
Mar 18, 2013
425
0
10,780


yes yes... but IPC for x86 is 1 to 1.1 or up to 1.2 "instructions per clock" for a little more optimized code. "full fury" optimized code could be much more, yet staying clearly below the 2 instructions per clock per thread [makes one wonder why have CPU cores with more than 2 ALU pipes ... right ?] . And NO ONE in current software offerings, that everybody uses, has "full fury"... not even benchmarks that can be more optimized for a particular uarch than another...

[EDIT: of course in this we are talking about NON VECTOR integer code, which is the traditional and logical way of counting IPC, since architecturally x86 is NOT a "vector" uarch... though good vectorized code could in many cases pass the 2 IPC barrier... ]

Since IPC is talked so much this days, it would be nice to have a "benchmark" that would have code specifically to indicate those numbers ( 1 to 1.2 or so IPC)...so the results would be 0.x to 1.z, or something like that.

Performance depends on so much factors that is hard to beggin with, but with those numbers in mind it would be easy to understand that on x86, no CPU today is "technically or potentially", anything like 150 to 170% IPC wise more performant than another. Not even Haswell compared with Jaguar, that is, 0.9 compared with 1.2 is around ~35% in "potential"... its much more than that, by many other factors, including instructions in the super bloated pointless x86 logical extensions ( like in vector from SSE to AVX), that one chip have natively, while in another must be executed by "microcode" (painful slow)... which might indicate that the "spree to bloat" to newer instruction extensions was/is only to gain "competitive advantage", accounting with this factor( seems obvious no ?)...

High degrees of advantage/speedup only comes with high parallel data code and high parallel execution models (cache and memory and clocks and even logical instructions could never reach that >100% speedups) ... but in here we are NOT talking about CPU but more like GPGPU models...

That is where AMD has an advantage IMO, and it spells APU not CPU. But the apps for this are yet to few and too far in between to make a decisive advantage... "never settle programs" for Multimedia apps could be the salvation of AMD... but doubt those kind will appear as "common benchmark" apps anytime soon...


 
Status
Not open for further replies.