AMD CPU speculation... and expert conjecture

hcl123 · May 29, 2013

@8530rocks http://www.tomshardware.co.uk/forum/352312-28-steamroller-speculation-expert-conjecture/page-77#10881979
(this thread seems to move faster than i can post... i might getting obsolete in the typing lol )

Yes going below 10nm there is clearly a barrier for current lithography and else techs.

But 10nm is already 4x smaller than current 28nm... a 500mm² Nvidia titan with only 125mm²... it could easily mean chips at least 3x smaller than now( considering that some structures don't scale well) more than 20 threads per chip even for desktop/client offerings, is at hand.

The big problem is exactly IPC and the instruction parallelism of current ISA implementations. You could have a single core 3 times as big yet it wouldn't be 3x as performant. Most probably not even 2 times as performant. That is why power is such a big concern even with advanced power management, and that is why Moore's Law is said to be slowing down, just doesn't make sense to make things as big as possible or as feature rich as possible, more so because the interconnect would grow tremendously making performance even worst (that is why the photonics push)... new software or new ISA much be invented...

So at least *multithreading* even for client applications must be on the menu for the future, and that is why in long tern ARM has a good chance of being in the place that x86 is now. ARM has both power and multithreading capabilities above x86. I don't think FPGA will take the spotlight, since heterogeneous with a good dose of superior fixed function hardware can take the wind out of FPGA sails. But a truly trully specialized solution will always have market, but not that ends ups transforming FPGA into general purpose.

So ARM with its "relaxed" memory model is simply better for performance with multiple threads, which might seem a contradiction since it is used mostly for low power lower thread count portable devices, but its truth. There is a reason why many implementers are already pushing ARM on server, its not only power or "micro-servers", it is better, specially the 64bit version is very clean, and a relaxed memory model means a transactional memory model is almost half done. i see it from microservers to big iron if there are enough pushers.

What is clearly a surprise is the recent times push for the single thread mantra... its *contra-natura*... and will not stick for long. Oh! well its clearly the market is driven by a lot of propaganda, propaganda is a considerable force even in IT, not only politics, but doubt it could arrest progress indefinitely. IT has been able to evolve faster than marketing campaigns, that is why traditional marketing with pervasive TV and newspaper ads, in IT, simply just doesn't make sense...

Cazalan · May 29, 2013

esrever :

Would be nice to know whats going on but we probably won't hear much officially until the Q2 earnings call on July 19th. Right now it's all about the Jaguar.

Global Foundries has a booth at DAC next week, with one of the headlines;
"Advanced Technology: 28nm ready and ramping, and next is 20LPM and 14XM."

http://www.semiwiki.com/forum/content/2342-global-foundries-does-dac.html

JAYDEEJOHN · May 29, 2013

And is my explanation in what we see in our consoles.
Today, we have/can max out our HW, back then we didnt, the top gpus and near top cpus were used, and not so today, simply because, what we have today is midstream HW in our consoles, tho it matches the top of yesteryear in TDP.

So those dismayed by this HW in our consoles, all we are seeing is this power/perf wall we are hitting, its not Sony or M$ or AMDs fault, its simply the law of physics.
The litho standards we have now are no where near good enough for 10nm, but can be done, and yet if we dont get the new ultra litho going, we will never get lower.
Not using what weve been using

hcl123 · May 29, 2013

esrever :

That is consequence of all the turmoil with GloFo and AMD getting out of that partnership. At the time of Dirk Meyer i suspect GloFO with AMD direction was pushing hard SOI tech for 32 and 28nm. A new management was imposed by the principal owners of GloFo and all went in reverse. The epilogue of the drama finished last year with AMD getting out of GloFo entirely. But the last management of GloFo also went bust, its was all about bulk "mantra" and SoC even in the underwear, being the superior SOI tech even regarded as inferior... yet it didn't managed o get one single top IDM implementer of the ARM SoC armada, for the their *supposed* superior HKMG bulk high performance 28nm process ( or the low power one for that matter).

Qualcomm was there, tested, tasted, saw it, saw what they did to AMD... and said goodbye...

I think GloFo re-gained a lot more respect for AMD, since the notice was clear that they could even lose the only top IDM they had left, and no prospects of gaining another one in short order. That is why the old 32nm PD-SOI was revamped almost 15% and Richland happened... that is why the FD-SOI partnership with STMicro was possible... with all that, is even possible that the FX/server variants of Steamy is 28nm PD-SOI.

With all this "bulk" dance decisions(even at AMD) and SOI improvements, its no wonder things got delayed.

hcl123 · May 29, 2013

griptwister :

Yes i saw other opinions about the possibility of being real. If not is a damn good fake.

And its powerful alright, data speculation execution was speculated even before the BDver1 was out, the rest seems pretty identical encompassing evolutions even from the Cat(Jaguar) variants which don't have shared front-end. The double FlexFPU i advented it in some discussions about the superiority of this modular design, when everybody else was crying bloody fail that should die... and the possibilities at pieces like cinebench.

What i'm curious is what exactly are all those new CAM structures around each core/cluster.

Also it seems AMD make a terrific job at shrinking some structures independent of process shrink, notably CAM structures which come in check with some rumors, which makes it not only about design libraries and the 30% smaller FlexFPU at the same node as was officially revealed.

Cazalan · May 29, 2013

8350rocks :

Apparently its to satisfy some new EU requirement for traditional low power modes As the consoles are entertainment hubs and not just game machines. There are also some Energy Star compliancy guidelines that are voluntary but they want the seal for marketing reasons.

http://ec.europa.eu/enterprise/policies/sustainable-business/ecodesign/product-groups/sound-imaging/files/console_maker_proposal_en.pdf

Game consoles released from January 2013 consume at a maximum
1 watt in standby mode
20 watts in video playing/streaming mode

Jaguar is low power but it's not THAT low power even when everything is turned down. So they need a "always on" processor to do that.

Cazalan · May 29, 2013

JAYDEEJOHN :

Hmm, how is Samsung already making 10nm NAND memory?

palladin9479 · May 29, 2013

mayankleoboy1 :

Overall performance will increase, provided you have the hardware. Scalability just means the ability of a piece of software to attain higher performance as hardware resources increase. There are real physics limitations on how deep "fast" you can make a single processing unit where as no such limitation exist on how wide "many" you can make that same unit. The entire SPARC ISA was designed with this very principle in mind, that it's easier to go many for total system performance then to go deep.

Of course the benchmarks used on this site inevitably revolved around "gaming" which has historically been difficult to code for the "many". It's too interactive and every step requires the solution of several other problems before the next decisions can be made, this creates large complex dependencies strings and results in code chokepoints and bottlenecks. Compare this to industrial applications like rendering where all the work is known about ahead of time and can be evenly and exactly distributed amongst all processing resources. Or database / webapps where your processing requirements scale with the number of simultaneous connections / users and all that extra hardware can be put to use to service the higher user demand. As gaming software evolves it too will start to require wider processors, especially as you introduce problems that resemble the previous two industrial examples.

hcl123 · May 29, 2013

Cazalan :

?! ^ ... probably is testing, which is perfectly alright. There have been tests even for 5nm transistors, which doesn't mean that the process and its tools and techs are anywhere ready for production level. One thing is making a few test runs, another is making a whole wafer with good enough yields and performance consistency to be in production.

Cazalan · May 29, 2013

hcl123 :

It's been in production since last year. They have some new 3-bit in production as well. Granted memory is different than logic but it's still a lithography process.

http://www.engadget.com/2013/04/11/samsung-nand-flash-production/

"Samsung started production of 10nm-class 64Gb MLC NAND flash memory in November last year, and in less than five months, has added the new 128Gb NAND flash to its wide range of high-density memory storage offerings. "

mayankleoboy1 · May 29, 2013

palladin9479 :

Overall performance will

increase, provided you have the hardware. Scalability just means the ability of a piece of software to attain higher performance as hardware resources increase. There are real physics limitations on how deep "fast" you can make a single processing unit where as no such limitation exist on how wide "many" you can make that same unit. The entire SPARC ISA was designed with this very principle in mind, that it's easier to go many for total system performance then to go deep.

Of course the benchmarks used on this site inevitably revolved around "gaming" which has historically been difficult to code for the "many". It's too interactive and every step requires the solution of several other problems before the next decisions can be made, this creates large complex dependencies strings and results in code chokepoints and bottlenecks. Compare this to industrial applications like rendering where all the work is known about ahead of time and can be evenly and exactly distributed amongst all processing resources. Or database / webapps where your processing requirements scale with the number of simultaneous connections / users and all that extra hardware can be put to use to service the higher user demand. As gaming software evolves it too will start to require wider processors, especially as you introduce problems that resemble the previous

two industrial examples.

And this is exactly what i have been saying. It is not easy to parallelize & perf the common , day to day programs. Even games are not inherently breakable into smaller pieces. The tasks that are trivial to break into smaller pieces, are already doing so, using a GPU.

hcl123 · May 29, 2013

Cazalan :

no that 10nm indication i suspect strongly (almost 100% sure) is not about the process node, but the size of some structure in the cell or flash transistor(probably the channel)... or is counting the "nm" level counting with the number of bits per cell, which should make a 25 to 20nm process...

Even much bigger process nodes have structures at those 10nm of length or wide, or even quite smaller... STI trench isolation comes to mind, even at 45nm this structures are pretty small. OTOH some CAM structures or cascaded transistors can have features sizes above 200 or 300nm in length, even at 32 or 28nm processes.

Usually the node size is a direct correlation with gate lengths and the masks average feature size, though i don't know what the gate length norm for flash techs are. Nevertheless if its 20nm "node", for flash is already very good.

palladin9479 · May 30, 2013

mayankleoboy1 :

Define "day to day programs". Web-browsing can be done in parallel , Office apps can easily be done in parallel, media encoding / playing is the easiest thing to encode in parallel. Out of all the "day to day programs" only gaming is resistance to programming parallel and only due to the myriad of dependencies created from established methods that were developed in a serial fashion. Converting code that was designed and mapped out serially (flowchart style process designs) into parallel code is extremely hard, you can implement this method or that method parallel but then you'll smack into a code choke-point and be stuck in idle. You need to design the whole damn thing to be parallel form scratch, the entire process flow needs to be focused on not creating code choke-points. You change from looking for things to write in parallel into ensuring your design doesn't create situations that would prevent you from processing in parallel. Of course doing this will mean that you sometimes have to use a different method that would be less efficient if only to ensure you don't choke yourself later. So you will lose some performance but also gain in scalability that would then overcome that initial performance penalty.

mayankleoboy1 · May 30, 2013

everything can be done in parallel , in theory. It is only when you actually try to redesign the functionality that you realize how difficult it is, and many times you just cant come up with an algorithm that will have the same functionality, is backward compatible, but can be broken down into smaller, fairly independent tasks.
If AMD/Intel can come up with a SMT web browser, then that company has a high chance of totally dominating the chip market.

Web-browsing can be done in parallel , Office apps can easily be done in parallel, media encoding / playing is the easiest thing to encode in parallel.

Web browsers are notoriously difficult to make parallel. On top of that, C/C++ are not inherently thread safe languages. The languages developed to be inherently thread/memory safe are too limited and slow.
That is the reason Mozilla are developing a language that is designed to be fast, memory safe and Parallel. By reading their github posts, the major problem they face is that they are parallel, but perform not too good. Reason is locking, message passing, Synchronization between main thread and side thread etc.

Media transcoding is trivial to parallelize so i do not see your point.
The much bigger challenge is to parallelize the hard-to-parallelize data , which usually means small data streams. This is particulaly important for AMD , as they are betting their future on HSA, which is basically "parallelize everything" approach , for now.

Regarding office apps, lets talk about basic things like search and replace. That is still single threaded. Saving a Docx to a Pdf, that is still single threaded. Mostly the graphics and data-crunching part is multithreaded, which again is easy , no , trivial to do.

PS : Using compiler tricks like OpenMP work only on the appropriate data types ,such as HPC or Graphic or big data. And that too when you write the loop, data dependency in such a way that each loop is independent. The compiler is completely strict, and wouldnt OpnMP-ize even an iota of unsafe data. This is easy when the data to be processed is independent, like HPC , but for complex things like browsers, it is useless, and often times lead to pref regression as the code is threaded, but due to excessive mutexes, perf slows down.

Edit : could anyone direct me to a multithreaded calculator and text viewer for desktops ?

esrever · May 30, 2013

hcl123 :

esrever :

That is consequence of all the turmoil with GloFo and AMD getting out of that partnership. At the time of Dirk Meyer i suspect GloFO with AMD direction was pushing hard SOI tech for 32 and 28nm. A new management was imposed by the principal owners of GloFo and all went in reverse. The epilogue of the drama finished last year with AMD getting out of GloFo entirely. But the last management of GloFo also went bust, its was all about bulk "mantra" and SoC even in the underwear, being the superior SOI tech even regarded as inferior... yet it didn't managed o get one single top IDM implementer of the ARM SoC armada, for the their *supposed* superior HKMG bulk high performance 28nm process ( or the low power one for that matter).

Qualcomm was there, tested, tasted, saw it, saw what they did to AMD... and said goodbye...

I think GloFo re-gained a lot more respect for AMD, since the notice was clear that they could even lose the only top IDM they had left, and no prospects of gaining another one in short order. That is why the old 32nm PD-SOI was revamped almost 15% and Richland happened... that is why the FD-SOI partnership with STMicro was possible... with all that, is even possible that the FX/server variants of Steamy is 28nm PD-SOI.

With all this "bulk" dance decisions(even at AMD) and SOI improvements, its no wonder things got delayed.

That makes a lot of sense but I don't expect SOI to be viable going forward especially going into finfets and trying to reduce power consumption. I thought both steamroller and excavater will be on bulk. I do wonder if they can even get much performance gains from 32nm SOI to 28nm bulk.

de5_Roy · May 30, 2013

8350rocks :

let's not take any 'logical leaps' or read too much into the 'tone of comments'. since he's being so vague about the os and it's related stuff, i'll just leave it at that. from the second link, i understand that the secondary cpu is a part of the southbridge, one of the many dedicated hardware blocks to run specific tasks (reminds me of 'ninja core'), in this case system monitoring and background downloading for energy efficiency purposes. seems like a 'connected standby' chip with a bit more muscle.
that term, 'secondary os cpu' is never mentioned by cerny, it's from some other guy's post in the comments section. again, nothing about running the os. however, it could easily be the situation that cerny and others are keeping it under wraps or is still under development.
at this point, i am gonna agree with esever's hypothesis and not read too much into things until more credible information emerges.

hcl123 :

if cinebench is so biased, amd should really stop using cinebench. they should officially and explicitly denounce cinebench for cpu benchmarks instead of putting it in their own promotions. they should also ask hardware reviewers and put it in the review guides to not use cinebench. iirc they denounced some other benchmark claming they were partial towards intel or something, that was good. they should do the same with cinebench and start using softwares like handbrake and 7zip for cpu benching (i use both so they'd be far more useful for me

).
i think that the guy in the blog was quoting promo slides from steamroller unveiling event (as if those were the truth, nothing but the...). the raving m.i.l.f.'s (mindless intel loving fanboy) comments just made things more amusing. :lol: the writer likely added 30%~ to every bd/pd benchmark and assumed cpu model numbers with assumed clockrates. imo, it shouldn't be that hard to gain 30%-45% performance improvement over 'unmitigated failure'(not my words

). even easier if you delay the product for over a year, then launch it and claim 'we prefer quality over haste(!)'.

HSA Foundation Announces Its First Specification.
HAS Foundation Delivers Programmer’s Reference Manual
http://www.xbitlabs.com/news/other/display/20130529180931_HSA_Foundation_Announces_Its_First_Specification.html
AMD Will Receive $60 - $100 for Every SoC for Next-Gen Game Consoles - Financial Analysts.
AMD to Benefit from PlayStation 4, Xbox One, Even If They Are Unsuccessful
http://www.xbitlabs.com/news/cpu/display/20130528222526_AMD_Will_Receive_60_100_for_Every_SoC_for_Next_Gen_Game_Consoles.
a little jaguar redux (and why intel and ms are/will be at each others throats soon, giving advantage to amd):
http://www.anandtech.com/show/6992/amd-opteron-x1150-x2150-kyoto-kabini-heads-to-servers

JAYDEEJOHN · May 30, 2013

First to clarify, I said 10nm can be done using current methods, but cost effectiveness is yet to be seen.

Going by Annands link, it appears Intel may have truly missed the mark.
While shrinking its cpus using the worlds best process, it supplies its igpus in only its top of the line HSW to compete with AMD, even forfeiting some power and heat doing so.
Meanwhile AMD went smaller with their igpus, and opens new capabilities in this low power low end many core solution.
Sounds like everything LRB was supposed to be

de5_Roy · May 30, 2013

do we really have to wait till 10nm? i think common platform alliance can effectively put pressure on intel at 14nm mobility nodes, even if glofo fails to deliver(typical) in time. intel is already behind, their new products seem to be sku-ed in a way that protect their high end lineup instead of aiming for higher revenue. imo the whole tech world is at a major transition stage right now and pretty much everything we've taken for granted over the past few years is up for grabs.
qualcomm and samsung have a huge opportunity here. amd might have a chance as well... but they suffer from chronic poorus executionitis and recurring 'loser mentality' despite having the most robust and 'agile' ip portfolio. unless RR gets rid of those problems, amd will be the new second/third best behind qualcomm and/or samsung.

jdwii · May 30, 2013

Cinebench is not biased it relates quite, Plus the module design does share a FPU which is one of the main issues. We can tell Amd barely touched on it with Piledriver and performance per clock went up or "IPC"

sarinaide · May 30, 2013

needless to say two FPU's per module should see results exponentially improve, as to where is what we will speculate on until the cows come home. All I want to say is that people got on Rory Reed's case but ultimately he was the man charged with putting the peices back together after Ruiz left them in a mess and since the new regime has taken over not only has the bleeding stopped, analysts are still predicting AMD to be heading North from now on in. Tough decisions are the reason you get to the top job and while it hasn't been easy or not over yet, at least AMD is far better off now than it was 18 months ago.

As for the diagram, I hope its Steamroller but then again we do know that Jim Keller will influence Excavator along with the others. so its just waiting now.

mayankleoboy1 · May 30, 2013

http://www.fudzilla.com/home/item/31531-apple-poaches-dozen-amd-veterans

gamerk316 · May 30, 2013

8350rocks :

From a design standpoint, a unified architecture makes a lot more sense that what we currently have, at least when using a 64-bit OS [otherwise you run into significant memory problems; hence why I hope Win 9 officially kills 32-bit support]. Saves needing to use a middleware API to get access to the GPU. That being said, lets see the most powerful CPU + GPU combo one can reliably put on a single die at a reasonable cost.

gamerk316 · May 30, 2013

palladin9479 :

Databases/Web connections scale naturally; each connection is independent of the rest, each database row is fully independent. They scale naturally, simply make each instance its own thread, and process away. Thats why SPARC is still used in database computing. That also explains why SPARC failed miserably on the desktop.

The stuff that does scale is being moved to the GPU. That doesn't leave much left that you can use. The reason the CPU isn't maxed anymore is simply because GPU's have gotten so powerful so fast [remember when they were only 2x the processing power of the CPU, rather then 200x?], API's have been moving more and more of the processing to them. The stuff thats left is almost all serial code that can't be broken up [at least in a way that increases performance].

And I fully expect that trend to continue: As GPU's gain 20-30% performance per generation, more and more processing will move to them. Physics is next, and I wouldn't be surprised if AI follows in the next 2 decades or so. The only reason we still NEED a CPU is the fact that a single execution unit on a GPU sucks performance wise in comparison; if GPU's had maybe half the IPC of a CPU per "core", then the need for CPU's vanishes. Frankly, in 20-30 years, I don't expect CPU's to still be around in the traditional sense; I expect a single execution unit that looks a lot more like a GPU then a CPU.

gamerk316 · May 30, 2013

palladin9479 :

Scales naturally per connection.

Office apps can easily be done in parallel

Scales naturally per workspace

media encoding / playing is the easiest thing to encode in parallel.

You have a media format that looks something like this:

[Header Data]
[Data Chunk]
[Data Chunk]
...
[Footer Data]

Converting the individual data chunks scales naturally; one thread per chunk. Easy.

Hence why point: There is stuff that is trivial to make parallel, and we did. A long time ago. And I'm telling you: Games do not scale well. And the parts that DO we made parallel about 20 years ago. Thats why we have GPU's, and don't do everything on the CPU anymore.

gamerk316 · May 30, 2013

On top of that, C/C++ are not inherently thread safe languages. The languages developed to be inherently thread/memory safe are too limited and slow.

C/C++ is PERFECTLY thread safe, the problem is the pthread library SUCKS (sorry, the inability to suspend a thread makes pthreads functionally broken. I'm sorry if I pissed of ever Linux/BSD/Unix/OSX fanboy out there, but your threading library is horrid) and C++ lacked a native threading API until just this year, forcing you to use OS dependent threading mechanisms.

Secondly, thead safety is up to the developer, not the API. Its up to the developer to make sure that resources are maintained in the proper state. If I'm writing to a resource in two different threads that can potentially be running at the same exact time, then you better damn well be sure that both threads are using the proper synchronization methods. Problem is, when you have 100 devs working a program, if you forget ONCE, you are screwed.

AMD CPU speculation... and expert conjecture

Honorable

Distinguished

Champion

Honorable

Honorable

Distinguished

Distinguished

Splendid

Honorable

Distinguished

Distinguished

Honorable

Splendid

Distinguished

Splendid

Splendid

Champion

Splendid

Splendid

Splendid

Distinguished

Glorious

Glorious

Glorious

Glorious

Share this page