AMD CPU speculation... and expert conjecture

juanrga · May 10, 2014

lilcinw :

The 14nm node opens to door to the design of a 8-core APU, where 8 big-cores would represent about 25% of the total die area. Once one has 8- big cores inside an APU, the need for a separate dCPU vanishes.

The real game changer is the 10nm node.

jimmysmitty :

Today we can extract heat from 220-300W CPUs using air cooling. The main problem for high-performance APUs has been the memory bottleneck of PCs. AMD can push a 2TFLOP APU for the PS4 because the console is using a fast system memory beyond the slow DDR memory used in PCs.

Memory bottleneck is solved by HBM/HMC memory. AMD is already integrating HBM on the next APUs. Nvidia is doing the same (APUs with up to 1.6TB/s BW). Intel will start selling 'CPU' with packaged MCDRAM (500GB/s BW) the next year.

For the sake of comparison, a GTX680 has 192 GB/s BW available.

truegenius · May 10, 2014

juanrga :

lilcinw :

The 14nm node opens to door to the design of a 8-core APU, where 8 big-cores would represent about 25% of the total die area. Once one has 8- big cores inside an APU, the need for a separate dCPU vanishes.

The real game changer is the 10nm node.

jimmysmitty :

Today we can extract heat from 220-300W CPUs using air cooling. The main problem for high-performance APUs has been the memory bottleneck of PCs. AMD can push a 2TFLOP APU for the PS4 because the console is using a fast system memory beyond the slow DDR memory used in PCs.

Memory bottleneck is solved by HBM/HMC memory. AMD is already integrating HBM on the next APUs. Nvidia is doing the same (APUs with up to 1.6TB/s BW). Intel will start selling 'CPU' with packaged MCDRAM (500GB/s BW) the next year.

For the sake of comparison, a GTX680 has 192 GB/s BW available.

what would you like to say to those who use features like quad sli or crossfire?

and in 2020 if someone tries to play a game on his 4 monitor setup each at 8k resolution then how these apus would be able to store the game data need for processing by igpu ? would amd use the crystals of some broken crystal ball ?

de5_Roy · May 10, 2014

juanrga :

so, uhm.. what type of memory does PS4 use? i guess slow ddr is ... slow.

jdwii · May 10, 2014

juanrga :

8350rocks :

This time I include your first try to answer and your second try to answer. It is interesting to see how your answers evolve over time, but continue showing the same disparity with what AMD says.

AMD: K12 is a new high-performance ARM-based core.
8350rocks (1st try): No. K12 != ARM.
8350rocks (2nd try): K12 is not ARM only, that was my point.

AMD: The TAM for x86 is decreasing, while it's increasing for ARM.
8350rocks (1st try): No. The TAM for x86 is increasing and it is gaining ground against ARM.
8350rocks (2nd try): HEDT is growing.

AMD: ARM and x86 cores will be treated as first class citizen. ARM will win over x86 in the long run.
8350rocks (1st try): No, AMD considers ARM will be a niche market.
8350rocks (2nd try): http://www.anandtech.com/show/6536/arm-vs-x86-the-real-showdown/14

AMD: We will abandon CMT and will return to classic SMT design.
8350rocks (1st try): No. AMD new cores will be based in a redesigned CMT architecture (note: CMT is a version of SMT).
8350rocks (2nd try): I said specifically, "AMD will not use HTT".

AMD: We promise 10TFLOP APU by 2020
8350rocks (1st try): No. AMD cannot produce a 10 TFLOP APU for doing anything right now
8350rocks (2nd try): LOL...right...10 TFLOP APU...ok...I will believe that when I see it...

Sony: only six cores are available to games two cores are reserved by OS and for background/dev tasks
8350rocks (1st try): No. Eight cores are fully available to games because the OS is run in a separate chip.
8350rocks (2nd try): No source from you...and my SDK says ALL 8 CORES are available for game loads...

Sony: Jaguar cores are clocked at 1.6GHz
8350rocks (1st try): No. Jaguar cores are clocked at 2.75GHz.
8350rocks (2nd try): I said the patents registered mentioned the highest frequency on the APU to be 2.75 GHz. Additionally...the PS4 is clocked about 2.0 GHz...so your prediction was wrong. Where is your crystal ball now?

juanrga :

I suspect to see a huge performance benefit from Trinity(Richland) APU on a laptop to a Kaveri since Kaveri has around 15% more performance per clock compared to the phenom in my testing and its clocked higher as well at 2.7Ghz vs 2.5Ghz for A10-5750M. Probably a good 20% boost on average in CPU performance with a nice boost in GPU performance Since its going to use GCN. Honestly i might upgrade my old llano laptop to this if i can find a good laptop with the best APU for 650$ or less.

sapperastro · May 10, 2014

de5_Roy :

DDR5. Unsure of the speed or latency.

de5_Roy · May 10, 2014

sapperastro :

i was expecting a reply from juan...
it is very unusual that he'd take this long to answer such a simple query. may be real life business or something...

juanrga · May 10, 2014

truegenius :

I would say the same that I said the first time, and the second time, and the third time,... that I was asked about that.

truegenius :

Since you seem so confident, just let me know how many DRAM memory have these APUs?

jimmysmitty · May 10, 2014

juanrga :

Memory is nowhere near the main bottleneck in a system right now and while in servers faster RAM would be nice, it is not what is needed for a APU to shine as current APUs do taper off.

The PS4 APU is actually a bit under impressive if you look at it from a GPU standpoint. It has the same amount of shaders as a HD7870 yet with even the CPU portion, it is slower in TFLOPS (1.84 vs 2.54) than a HD7870.

And pulling off 200-300w of heat from a CPU is fine. It would be the 250-300w of heat from the GPU plus the 80-150w + of heat that would be hard for air coolers to do as easily.

juanrga · May 10, 2014

de5_Roy :

The PS4 uses GDDR5 @ 2750MHz as system memory. DDR3 @ 3000MHz would offer only about one half of the bandwidth.

I guess your computer DDR3 system memory peaks at something as 34GB/s. The PS4 system memory peaks at 176GB/s.

juanrga · May 10, 2014

sapperastro :

DDR5 doesn't exist. The last JEDEC standard is DDR4, which is not still in the market.

GDDR5 != DDR5

de5_Roy · May 10, 2014

juanrga :

de5_Roy :

The PS4 uses GDDR5 @ 2750MHz as system memory. DDR3 @ 3000MHz would offer only about one half of the bandwidth.

I guess your computer DDR3 system memory peaks at something as 34GB/s. The PS4 system memory peaks at 176GB/s.

how is gddr5 "beyond" the "slow" ddr memory in pcs?

jdwii · May 10, 2014

juanrga :

de5_Roy :

The PS4 uses GDDR5 @ 2750MHz as system memory. DDR3 @ 3000MHz would offer only about one half of the bandwidth.

I guess your computer DDR3 system memory peaks at something as 34GB/s. The PS4 system memory peaks at 176GB/s.

But yet CPU's work better with lower latency and GPU's work better with more bandwidth.

Cazalan · May 11, 2014

jdwii :

APUs are a compromise, plain and simple. Kaveri had to compromise CPU speed for GPU speed. There are some advantages though, like cost.

In maybe 5-6 years you'll be able to get consumer APUs with 3D memory at a reasonable price. It all depends on how quickly Micron/SKHynix/Samsung can ramp their production and get the costs down.

For reference, right now the only board you can get with 3D memory on it is 25k. I should be getting mine (for work) sometime this summer.

juanrga · May 11, 2014

jimmysmitty :

Memory bandwidth is the main bottleneck for current APUs. It is the reason why Richland increased native DRAM frequency from 1866MHz to 2133MHz. It is the reason why top APUs from Intel introduce L4 cache. It is the reason why the Xbox1 APU uses ESRAM and it is the reason why the PS4 APU uses fast GDDR5 instead of slow DDR3.

AMD had originally planned a faster top Kaveri APU with 6 CPU cores and more powerful GPU (that would hit 1TFLOP). That original APU concept used fast GDDR5 memory. However, there were problems with one of the memory suppliers and then AMD canceled its original plan, reduced CPU cores to 4, lowered the GPU clocks, fused the (now unused) GDDR5 memory ondie controller and released the final top Kaveri APU that you can purchase in the store, the one that relies on slow DDR3 memory.

The PS4 APU has been designed to fit inside a console. The HD7870 has been designed to fit inside a desktop. Consoles are limited at about 100W TDP. Desktops can have up to 10x more TDP than consoles.

If you were to use a HD7870 for a console you would cut its TDP by about one third to fit it in the 100W thermal limit, which implies lowering the clocks if you want maintain the same number of execution units.

Assuming cubic scaling one third less dissipation correspond to lowering clocks by ~1.4. which reduces TFLOPS from 2.54 to ~1.81.

Then you would have a CPU+dGPU configuration that would be more expensive, less reliable, and clearly slower. Both Sony and Microsoft rejected a CPU+dGPU configuration for the hardware of its consoles since the first minute that they start thinking about the design.

juanrga · May 11, 2014

de5_Roy :

I have mentioned how GDDR5 module provides twice more bandwidth than a similarly clocked DDR3 module.

jdwii :

As I mentioned plenty of times in this thread, a CPU is a LCU whereas a GPU is a TCU. The "L" is from "Latency".

But a CPU working better with lower latency doesn't imply that CPU is insensitive to bandwidth.

From the textbook Programming Many-Core Chips

It’s becoming clearer that memory bandwidth rather than memory latency is the next bottleneck that needs to be addressed in future chips. Consequently we have seen a dramatic increase in the size of on-chip caches, a trend that is likely to continue as the amount of logical gates on a chip will keep increasing for some time to come, still following Moore’s law. There are however several other directions based on novel technologies that are currently pursued in the quest of improving memory bandwidth:

I bolded the relevant part. He then reports some of the research directions, including stacked RAM, which I mentioned above in one of my former posts. From the HMC consortium FAQ:

What is the problem that HMC solves?
Over time, memory bandwidth has become a bottleneck to system performance in high-performance computing, high-end servers, graphics, and (very soon) mid-level servers. Conventional memory technologies are not scaling with Moore's Law; therefore, they are not keeping pace with the increasing performance demands of the latest microprocessor roadmaps. Microprocessor enablers are doubling cores and threads-per-core to greatly increase performance and workload capabilities by distributing work sets into smaller blocks and distributing them among an increasing number of work elements, i.e. cores. Having multiple compute elements per processor requires an increasing amount of memory per element. This results in a greater need for both memory bandwidth and memory density to be tightly coupled to a processor to address these challenges. The term "memory wall" has been used to describe this dilemma.

Why is the current DRAM technology unable to fully solve this problem?
Current memory technology roadmaps do not provide sufficient performance to meet the CPU and GPU memory bandwidth requirements.

What are the measurable benefits of HMC
HMC is a revolutionary innovation in DRAM memory architecture that sets a new standard for memory performance, power, reliability, and cost. This major technology leap breaks through the memory wall, unlocking previously unthinkable processing power and ushering in a new generation of computing.
+ Increased Bandwidth — A single HMC unit can provide more than 15X the bandwidth of a DDR3 module.
+ Reduced Latency – With vastly more responders built into HMC, we expect lower queue delays and higher bank availability, which will provide a substantial system latency reduction.
+ Power Efficiency — The revolutionary architecture of HMC allows for greater power efficiency and energy savings, utilizing 70% less energy per bit than DDR3 DRAM technologies.
+ Smaller Physical Footprint — The stacked architecture uses nearly 90% less physical space than today’s RDIMMs.
+ Pliable to Multiple Platforms — Logic layer flexibility allows HMC to be tailored to multiple platforms and applications.

In fact their site correctly claims that this new memory technology is needed for exascale supercomputers:

Eventually, HMC will drive exascale CPU system performance growth for next generation HPC systems.

To throw some numbers to the discussion, the HMC 1.0 specification already consider sustained bandwidths on pair with a modern L3 cache from Intel. This is like if your CPU had access to a 100x bigger L3 cache

http://www.extremetech.com/computing/167368-hybrid-memory-cube-160gbsec-ram-starts-shipping-is-this-the-technology-that-finally-kills-ddr-ram

http://www.extremetech.com/computing/152465-microns-320gbsec-hybrid-memory-cube-comes-to-market-in-2013-threatens-to-finally-kill-ddr-sdram

As I mentioned above AMD plans to offer HBM as option for the K12 APU. I can also confirm you that the ultra-high-performance designed by Nvidia doesn't use any L3 cache, but an ordinary hierarchy L1/L2 and then stacked DRAM with a total throughput of about 1.6TB/s.

juanrga · May 11, 2014

Cazalan :

Don't confound the general concept with a particular implementation of it.

de5_Roy · May 11, 2014

juanrga :

can you repost it, please? i follow this thread. i am absolutely certain that you have never mentioned how gddr5 is "beyond" the "slow" ddr memory in pcs. by that i mean that you have never provided any technical explanation on that matter. my query is sorta two-parter. part 1 - how is gddr5 beyond ddr3 i.e. what makes gddr5 so much more advanced than ddr3 in terms of technology and performance - assuming that's what you mean by "beyond". part 2 - how is ddr (i guess you mean ddr3) slower than gddr5?

truegenius · May 11, 2014

juanrga :

what ? i didn't got your question

i didn't born in pondicherry, didn't studied in Britain thus a little bit slow in English :cheese:

so me want you question explain

( cavemen english

)

juanrga :

because they don't need to have swapable gpus thus single chip solution was better.

juanrga :

average joe don't use exascale indeed extreme joe's extreme desktop cpu is not an exascale cpu

juanrga · May 11, 2014

As I mentioned before AMD is dropping CMT by SMT for new architecture

http://www.xbitlabs.com/news/cpu/display/20140510165441_AMD_to_Introduce_New_High_Performance_Micro_Architecture_in_2015_Report.html

Not a lot of details about the new micro-architecture are known at present. What is recognized for sure is that it will drop CMT in favour of some kind of SMT (something akin to Intel’s HyperThreading) technology to improve performance in both single-threaded and multi-threaded cases.

About the unfounded rumor that AMD would return to SOI, Keller admits that he is "happy" with the 14/16 FINFET process and that it "looks pretty good". He also admits is targeting frequencies close to 4GHz for its K12 core.

colinp · May 11, 2014

juanrga :

As I mentioned before AMD is dropping CMT by SMT for new architecture

http://www.xbitlabs.com/news/cpu/display/20140510165441_AMD_to_Introduce_New_High_Performance_Micro_Architecture_in_2015_Report.html

Not a lot of details about the new micro-architecture are known at present. What is recognized for sure is that it will drop CMT in favour of some kind of SMT (something akin to Intel’s HyperThreading) technology to improve performance in both single-threaded and multi-threaded cases.

About the unfounded rumor that AMD would return to SOI, Keller admits that he is "happy" with the 14/16 FINFET process and that it "looks pretty good". He also admits is targeting frequencies close to 4GHz for its K12 core.

16-core Steamroller? Not according to your crystal ball!

8350rocks · May 11, 2014

juanrga :

8350rocks :

This time I include your first try to answer and your second try to answer. It is interesting to see how your answers evolve over time, but continue showing the same disparity with what AMD says.

AMD: K12 is a new high-performance ARM-based core.
8350rocks (1st try): No. K12 != ARM.
8350rocks (2nd try): K12 is not ARM only, that was my point.

AMD: The TAM for x86 is decreasing, while it's increasing for ARM.
8350rocks (1st try): No. The TAM for x86 is increasing and it is gaining ground against ARM.
8350rocks (2nd try): HEDT is growing.

AMD: ARM and x86 cores will be treated as first class citizen. ARM will win over x86 in the long run.
8350rocks (1st try): No, AMD considers ARM will be a niche market.
8350rocks (2nd try): http://www.anandtech.com/show/6536/arm-vs-x86-the-real-showdown/14

AMD: We will abandon CMT and will return to classic SMT design.
8350rocks (1st try): No. AMD new cores will be based in a redesigned CMT architecture (note: CMT is a version of SMT).
8350rocks (2nd try): I said specifically, "AMD will not use HTT".

AMD: We promise 10TFLOP APU by 2020
8350rocks (1st try): No. AMD cannot produce a 10 TFLOP APU for doing anything right now
8350rocks (2nd try): LOL...right...10 TFLOP APU...ok...I will believe that when I see it...

Sony: only six cores are available to games two cores are reserved by OS and for background/dev tasks
8350rocks (1st try): No. Eight cores are fully available to games because the OS is run in a separate chip.
8350rocks (2nd try): No source from you...and my SDK says ALL 8 CORES are available for game loads...

Sony: Jaguar cores are clocked at 1.6GHz
8350rocks (1st try): No. Jaguar cores are clocked at 2.75GHz.
8350rocks (2nd try): I said the patents registered mentioned the highest frequency on the APU to be 2.75 GHz. Additionally...the PS4 is clocked about 2.0 GHz...so your prediction was wrong. Where is your crystal ball now?

Alright, I am no dignifying your lunacy with another response.

If you do not post a link to a verifiable source documenting exactly any claims you make from here on out, I am just ignoring you...

Additionally, not one of your claims about ARM have panned out, and AMD themselves have not stated anything confirming any of your claims.

8350rocks · May 11, 2014

juanrga :

lilcinw :

(i)
I am working in an article about how the chips of year 2020 will be. The stuff is very interesting, because includes a change of paradigm in the architecture of computers, and I decided to share my knowledge about that in this thread, adding also some details about AMD future plans and about related plans from Nvidia and Intel.

(ii)
I provided some math/phys relevant details such as the ratio of local compute to data movement for current silicon and the ratio for future silicon showing why a dGPU doesn't scale up and has to be rejected.

This derived in some posts ranging from the educated "I don't think so" to the usual ad hominem by the same guys of always.

(iii)
Then I decided to share some quotes of real experts disagreeing with the laughable comments from the last guys. Dongarra is a famous HPC expert, but the Nvidia Research Team is also behind the gaming Geforce cards.

(iv)
The same arguments apply to gaming. The only difference with HPC is on the order of magnitude of the problem associated to the nonlinear scaling of the future silicon.

I explained that gaming is evolving towards using the GPU also for compute (e.g. physics or AI). Once you are using the GPU for compute you have to confront problems that are very similar to those mentioned for HPC.

(v)
I explained that GPUs for gaming are not designed in a vacuum. The disappearance of the top-end discrete GPUs used for compute will affect the development of the cheaper gaming GPUs. The explanation is the same that I used here to predict that AMD wouldn't release Steamroller FX CPUs once the server roadmap was made public and it showed the cancellation of Steamroller Opteron CPU.

The same people who didn't understand then my argument and said me "wait for the desktop roadmap" is the same people who doesn't understand my argument now and believes that discrete GPUs for gaming will be released even after the Sun disappear. :sarcastic:

(vi)
There is no serious problem with 250W sockets. AMD is already selling 220W CPUs and you can find 300W coolers in the market. Moreover the expensive 200--300W APUs that are being designed for exascale supercomputers don't need to be reused for gaming, in the same way that expensive 150W Xeons used in fastest supercomputers are not found in gaming PCs.

(vii)
Beyond my HPC sources, I also have relevant sources from gaming and rendering communities. E.g. I have material from a very well-known guy and he agrees with me on that GPUs for gaming will disappear. Not only we agree on this, but we also agree on how will be the gaming hardware that will replace the GPUs. He is already thinking on new advanced algorithms/code that will be used for the future games. But this is material reserved for my future article, now it is time to watch the possible funny reactions of the many 'engineers', 'game-developers', and Mr-I-have-a-friend-at-AMD in this thread.

HPC != HEDT

If that was the case, then you would see 256 node home PCs with coprocessors for FP ops and games would be designed to run on 100+ cores. Graphics would look like dreamworks quality CGI from hollywood movies, and your average power bill in the states would run about $300-400 monthly.

Additionally...what are your qualifications? How is it you feel ever so more qualified to speculate (being generous with that word, as it is mostly garbage you spout) about future PC technologies? There are many here with MANY more years experience in RELEVANT fields that disagree with you than just me. So...feel free to speculate away, however, I have a feeling your magic 8 ball is going to break soon...

8350rocks · May 11, 2014

juanrga :

From your same article:

What is, perhaps, more important is that one of AMD’s official document detailed a sixteen-core AMD Bulldozer-derived processor. If the company pursues this opportunity an goes for a 16-core chip featuring Steamroller or Excavator cores, its new chips based on the new micro-architecture will only be available in 2016 or even 2017.

I know nothing...and I know no one at AMD, yep...that is certainly provable with that statement above.

According to your magic 8 ball, there would be no more HEDT dedicated dCPUs. But what do we find? PLANS FOR A NEW HEDT dCPU!?! Holy cow...what's next chicken little? The sky is falling?

8350rocks · May 11, 2014

juanrga :

jimmysmitty :

Memory bandwidth is the main bottleneck for current APUs. It is the reason why Richland increased native DRAM frequency from 1866MHz to 2133MHz. It is the reason why top APUs from Intel introduce L4 cache. It is the reason why the Xbox1 APU uses ESRAM and it is the reason why the PS4 APU uses fast GDDR5 instead of slow DDR3.

AMD had originally planned a faster top Kaveri APU with 6 CPU cores and more powerful GPU (that would hit 1TFLOP). That original APU concept used fast GDDR5 memory. However, there were problems with one of the memory suppliers and then AMD canceled its original plan, reduced CPU cores to 4, lowered the GPU clocks, fused the (now unused) GDDR5 memory ondie controller and released the final top Kaveri APU that you can purchase in the store, the one that relies on slow DDR3 memory.

The PS4 APU has been designed to fit inside a console. The HD7870 has been designed to fit inside a desktop. Consoles are limited at about 100W TDP. Desktops can have up to 10x more TDP than consoles.

If you were to use a HD7870 for a console you would cut its TDP by about one third to fit it in the 100W thermal limit, which implies lowering the clocks if you want maintain the same number of execution units.

Assuming cubic scaling one third less dissipation correspond to lowering clocks by ~1.4. which reduces TFLOPS from 2.54 to ~1.81.

Then you would have a CPU+dGPU configuration that would be more expensive, less reliable, and clearly slower. Both Sony and Microsoft rejected a CPU+dGPU configuration for the hardware of its consoles since the first minute that they start thinking about the design.

Ironically, ALL the developers working on PS4 and XB1 will both tell you the decision to use ESRAM on the XB1 is the achilles heel and is actually a hindrance rather than a help. The ability to use all memory addressable for tasks as a coherent block on PS4 is far more useful and will only hinder XB1 performance moving forward (as we are currently seeing with lower FPS on cross platform games versus what PS4 can provide).

juanrga · May 12, 2014

de5_Roy :

juanrga :

can you repost it, please? i follow this thread. i am absolutely certain that you have never mentioned how gddr5 is "beyond" the "slow" ddr memory in pcs. by that i mean that you have never provided any technical explanation on that matter. my query is sorta two-parter. part 1 - how is gddr5 beyond ddr3 i.e. what makes gddr5 so much more advanced than ddr3 in terms of technology and performance - assuming that's what you mean by "beyond". part 2 - how is ddr (i guess you mean ddr3) slower than gddr5?

My original words are quoted above in this same message. I mentioned how GDDR5 is beyond the "slow" DDR3. I didn't mention "why", but I can do it now you ask me. GDDR5 allows data to be transmitted on both peaks of the signal, this is why a single 3000MHz GDDR5 module provides double peak throughput than a single 3000MHz DDR3 module: 48GB/s vs 24GB/s, respectively.

AMD CPU speculation... and expert conjecture

Distinguished

Distinguished

Splendid

Splendid

Honorable

Splendid

Distinguished

Champion

Distinguished

Distinguished

Splendid

Splendid

Distinguished

Distinguished

Distinguished

Distinguished

Splendid

Distinguished

Distinguished

Honorable

Distinguished

Distinguished

Distinguished

Distinguished

Distinguished

Share this page