AMD CPU speculation... and expert conjecture

Page 571 - Seeking answers? Join the Tom's Hardware community: where nearly two million members share solutions and discuss the latest tech.
Status
Not open for further replies.


Technically, dGPUs can access system memory, just like any external device. Theres no architectural reason why they can't. The main reason to send over a copy of the data to the GPU is to avoid the really slow process of sending a ton of data over the PCI-E bus. Same reason why CPUs have three levels of cache, instead of just reading everything out of main memory every time it needs it.

VRAM, like CPU Cache, is an attempt to hide really slow memory transfers.

You know, latency. There is a huge concern for doing tasks the AMD HSA guide says is not for latency intensive workloads being overly affected by latency and I don't really understand it.

If the added latency plus the decreased time to perform a task is faster than one with lower latency and longer time to perform the task, it's a win.

Would you rather have 100ms of latency and a second to finish a task or 1ms of latency and 4 seconds to finish a task?

Depends on the workload in question. There are times where you accept longer execution time for lower latency.

Simple example: FPS versus Frame Times. The i3 can still spit out 60 FPS in games, so you can obviously game with it, right? Oh right, the frame spikes. That's your latency. But you still meet your 60 FPS per second average, even if the individual frame graphs look like crap. So in this case, I'd easily take 45 FPS and lower latency over 60 FPS and latency that looks like a roller coaster.
 


I remember 5 years ago, back in the BD rumor thread, when I said that I would be reminding everyone in 5 years "I TOLD YOU SO".

In 5 years, I will be reminding everyone, yet again, "I TOLD YOU SO".
 

jdwii

Splendid


12% the A10 7850K, is around a 7750
http://www.techpowerup.com/reviews/AMD/R9_295_X2/24.html
 

juanrga

Distinguished
BANNED
Mar 19, 2013
5,278
0
17,790


HSA offers huge performance benefit. This has been shown.

That fallacy is only in your imagination. AMD will be using HSA. Nvidia and Intel will not. I already mentioned the alternative approaches used by Nvidia and Intel. I repeat: UVM CUDA and neo-heterogeneity, respectively.



You missed the former slide

HSA-HUMA_05.jpg


Can you spot the APU?

You also missed the link given in the previous page



HSA over PCIe3 is nonsense, because HSA is about treating both CPU and GPU as first class peers, whereas a dGPU on a PCIe slot is an accelerator/coprocessor.
 

juanrga

Distinguished
BANNED
Mar 19, 2013
5,278
0
17,790


You can maintain high profits when there is no competition. Now Intel has announced its new 'CPU', Nvidia will lose most of its traditional profits in HPC, which means less money to spend in the development of next gen of hardware, which implies a poor product... which will have to be sold cheaper to remain competitive, which means another cut in profits and the cycle repeats again up to Nvidia goes bankruptcy. The only way to avoid bankruptcy is if Nvidia abandons GPUs and develop APUs/SoCs to compete against Intel and others, that is what Nvidia is doing.

But even ignoring the laws of economy, the fallacy of your argument is that you ignore the laws of physics. You can do it, Nvidia engineers/scientists cannot.

Programmers care where the GPU is by two reasons: performance and difficulty of programming. As mentioned before, HSA brings simplicity of programming when compared to tradditional CPU+GPU model

In fact, thanks to the GPU being on die and accessing to unified memory game developers can do things that cannot do in a traditional PC.

You can have CPU and GPU manipulating cooperatively the same data at once. It is one of the advantages of HSA. This is why AMD claims that HSA APU is more than a CPU and a GPU.

As anyone knows throughput computation requires lots of memory bandwidth. This is why Nvidia and AMD sell its GPGPUs with 320GB/s GDDR5 memory instead of using slow 29GB/s DDR3 memory. It is the reason why Intel has replaced slow GDDR5 memory in its compute cards with fast MCDRAM (>500GB/s). The APU/SoC designed by Nvidia uses stacked memory with >1.5TB/s.

By pretending to compare the performance of a top GDDR5 GPU to a weak APU constrained by slow DDR you are not fooling any Nvidia engineer.

I ask you again, what company will do your imagined GPUs?
 

underlined mine. interesting. can you show how this is done? and where amd has claimed this?
 
You can maintain high profits when there is no competition. Now Intel has announced its new 'CPU', Nvidia will lose most of its traditional profits in HPC, which means less money to spend in the development of next gen of hardware, which implies a poor product... which will have to be sold cheaper to remain competitive, which means another cut in profits and the cycle repeats again up to Nvidia goes bankruptcy. The only way to avoid bankruptcy is if Nvidia abandons GPUs and develop APUs/SoCs to compete against Intel and others, that is what Nvidia is doing.

Note gain the assumption that APU's will somehow magically gain over 100% performance per year, every year, going forward, and thus drive NVIDIA bankrupt. You start with a fallacy to reach your desired conclusion to prove your fallacious argument.

But even ignoring the laws of economy, the fallacy of your argument is that you ignore the laws of physics. You can do it, Nvidia engineers/scientists cannot.

Nowhere have I stated that dGPUs will continue to scale; you already see the generational gains starting to slow. My argument is that APUs will scale even WORSE, and never reach the same computing potential of dGPUs.

Programmers care where the GPU is by two reasons: performance and difficulty of programming. As mentioned before, HSA brings simplicity of programming when compared to tradditional CPU+GPU model

In fact, thanks to the GPU being on die and accessing to unified memory game developers can do things that cannot do in a traditional PC.

You can have CPU and GPU manipulating cooperatively the same data at once. It is one of the advantages of HSA. This is why AMD claims that HSA APU is more than a CPU and a GPU.

You can not have multiple cores manipulate the same memory address at the same time. Period. That's the fastest possible way to crash your entire system. One device must wait on the other to finish, unless using a memory scheme similar to Transactional Memory (and as I noted the last time this came up, the cost of Transactional Memory can be very high in the worst case).

As anyone knows throughput computation requires lots of memory bandwidth. This is why Nvidia and AMD sell its GPGPUs with 320GB/s GDDR5 memory instead of using slow 29GB/s DDR3 memory. It is the reason why Intel has replaced slow GDDR5 memory in its compute cards with fast MCDRAM (>500GB/s). The APU/SoC designed by Nvidia uses stacked memory with >1.5TB/s.

And why CPUs have multiple layers of high speed cache. And why dGPUs from the very beginning have been off die from the CPU.

By pretending to compare the performance of a top GDDR5 GPU to a weak APU constrained by slow DDR you are not fooling any Nvidia engineer.

There is a difference between computational performance and memory performance. We can measure the maximum theoretical throughput of an APU, even if you don't have the memory bandwidth necessary to drive it. And those numbers I gave were exactly those: Maximum theoretical performance at peak processing. Which means even with infinitely fast memory, current APUs are over 500% slower then current dGPUs.
 


You can do it if you use something like Transactional Memory that checks to see if the memory has changed before saving the result. We had this discussion last year, remember? The downside to such a scheme, is if the memory HAS changed, you have to dump the result, put a software lock in place, and do the computation all over again. That's why you don't see it used often, since the worst case is worst then traditional threading models.

Of course, traditional threading models don't thread if the threads in question need access to the same memory, to avoid such a situation in the first place.

But yeah, glad I wasn't the only one who noticed the violation of basic memory management.
 

colinp

Honorable
Jun 27, 2012
217
0
10,680
Easy question, Juan. When do you expect the top APU from AMD or Intel to have the in excess of 25% of the performance of a top CPU + dGPU?
 

con635

Honorable
Oct 3, 2013
644
0
11,010
Looking at the Ps4 apu compared to the speculation of the next gen skybridge apu it has fast memory, 8 small efficient x86 cores, hsa, r7 265'ish gpu and all for around 100w, suppose it really is *truly* next gen in a way!
In 2-3 years with around haswell ipc, more ghz and even less power consumption those new 8 cores coupled with future gen gpu cores and hbm will kill off any dgpu/dcpu lower than current 290 non 'x' and i7 for a qtr of the power consumption. Maybe juan has something.
 

yeah, there was a discussion about transactional memory last year. homework time for me.
 

szatkus

Honorable
Jul 9, 2013
382
0
10,780
Hey, let's the math!

R9 290X is 438 mm^2 chip.
A10-7850K is 245 mm^2 (47% is iGPU, so CPU needs only around 130).

Assume that we have some new socket which supports ~300W APU and super cool memories (GDDR5 or something).

(438 - 130)/438 ~ 70%

Ok, so today we would lose at least 30% of performance.

Let's go to 2020. Kaveri even today isn't high-end CPU, so I want 8 future-Intel cores (yeah, consoles, we need a lot of threads!) with L3 and AMD iGPU (Ultimate APU).

8 core Sandy Bridge is 435 mm^2 today.

10 nm:
Let's assume that it will be 8 times smaller (because these cores will have sligthly more transistors than SB).
435/8 ~ 54
(438 - 54)/438 ~ 87%
Much better.

7 nm:
435/16 ~ 27
(438 - 27)/438 ~ 93%
And we lose just 7% of performance by integrating with a CPU.

Still, there are few other non-technical problems...
 

jdwii

Splendid


Well if we take into account that Amd will probably make a design that is a little bit more efficient per die space compared to a 295X for example GCN 2.0. Remember Juan is probably talking about Arm(like we will be using that for are high end rigs) Arm cores are tiny wimpy cores with IPC around a Atom not a actual CPU that can do hardwork such as an I5-I7-FX. If you only allow 10% space for the CPU then yes Juan would be right.
 

szatkus

Honorable
Jul 9, 2013
382
0
10,780


I assumed that rest of the die is occupied by some GPU from future (GCN 5.0 or something :) ).
 

juanrga

Distinguished
BANNED
Mar 19, 2013
5,278
0
17,790


http://developer.amd.com/resources/heterogeneous-computing/what-is-heterogeneous-system-architecture-hsa/

627x1000px-LL-760de547_AMD-ARM-TrustZone-portada.jpeg
 

juanrga

Distinguished
BANNED
Mar 19, 2013
5,278
0
17,790


You are twisting the original claim and then relabeling as "magic" what rest of us call physics.

APUs scale better. Your belowed Nvidia engineers did the math.

Don't twist my words I wrote "same data". The performance benefit of HSA has been shown with actual cores.

CPUs have multiple layers of high speed cache on die for reducing latency penalty due to slow main memory. GPGPUs have multiple layers of high speed cache plus a fast memory subsystem. A typical GPGPU from Nvidia or AMD has GDDR5 and >300GB/s.

When you mentioned "current APUs" you picked a slow APU bottlenecked by DDR memory and a top GPU that uses GDDR5. You also omit that we are making claims about future APUs not about "current APUs". What part of "year 2020" is not still understood? You would read the "25x20" talk given by AMD (link given before).
 

juanrga

Distinguished
BANNED
Mar 19, 2013
5,278
0
17,790


It is not easy to predict, because it depends of lots of factors: market evolution, foundries roadmaps, memory maker roadmaps,...

What we know for sure is that the next year the top GPGPUs from NVIDIA and AMD will start loosing value:

# The fastest card from Nvidia costs $5300 has 12GB of GDDR5 is rated at 235W and peaks at 1.43 TFLOPS (DP).
# The fastest 140W CPU is well below 1 TFLOP.
# Intel KL has 16GB of ultrafast MCDRAM (>500GB/s) is rated at 200W and peaks at >3 TFLOPS (DP).
 

juanrga

Distinguished
BANNED
Mar 19, 2013
5,278
0
17,790


http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=6031577

The graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers between the CPU and GPU over PCIe. Emerging heterogeneous computing architectures that "fuse" the functionality of the CPU and GPU, e.g., AMD Fusion and Intel Knights Ferry, hold the promise of addressing the PCIe bottleneck. In this paper, we empirically characterize and analyze the efficacy of AMD Fusion, an architecture that combines general-purpose x86 cores and programmable accelerator cores on the same silicon die. We characterize its performance via a set of micro-benchmarks (e.g., PCIe data transfer), kernel benchmarks(e.g., reduction), and actual applications (e.g., molecular dynamics). Depending on the benchmark, our results show that Fusion produces a 1.7 to 6.0-fold improvement in the data-transfer time, when compared to a discrete GPU. In turn, this improvement in data-transfer performance can significantly enhance application performance. For example, running a reduction benchmark on AMD Fusion with its mere 80 GPU cores improves performance by 3.5-fold over the discrete AMD Radeon HD 5870 GPU with its 1600 more powerful GPU cores.


See? current APUs are already faster in some compute tasks.
 

juanrga

Distinguished
BANNED
Mar 19, 2013
5,278
0
17,790


As I mentioned many pages ago the Nvidia design (10nm) devotes about 87.5% of the die space to iGPU and only 12.5% to the 8 CPU-cores.

A priori we could think that Nvidia would gain 12.5% more performance with a discrete GPU, because would have 12.5% more die space, but engineers know that part of that extra die space would be used to add stuff such as a new memory controller and PCIe-like logic for the interconnect; as a consequence the extra 12.5% performance is finally lost.

Note: we will be not using GDDR5 memory then because it is too slow and too power-hungry (doesn't scale-up). As mentioned before those APUs will use stacked memory with terabyte-level bandwidth.
 

juanrga

Distinguished
BANNED
Mar 19, 2013
5,278
0
17,790


Wait, he used a 8-core i7 extreme as baseline. That is a x86 core. And latter he considered a 25% bigger core.

You must be living under a rock, but last weeks several companies have presented their HPC/server plans. They are designing (some are already shipping samples to customers) ARM cores with Haswell-Xeon IPC, but you can continue believing that ARM is only something used in phones. :lol:
 

juanrga

Distinguished
BANNED
Mar 19, 2013
5,278
0
17,790


I don't doubt that some people would prefer a 5x more expensive system that would be 10x slower and require a 4000W PSU plus a bank credit to pay the electricity. But this is a minority.

The history of computers is characterized by integration. If we eliminate the memory controller and the FPU from the die and go back to the past, we could upgrade the memory or the FPU without trashing the whole chip now, but we would be purchasing something slower, expensive, and power-hungry.

Some people even would want to go more backward in time. Why integrating ALUs, AGUs, caches, branching unit... on same die? Those people would want to maintain each piece by separate and use real upgradeable computers as those

the-worlds-first-computer+invented.jpg


<end sarcasm>
 

colinp

Honorable
Jun 27, 2012
217
0
10,680


You didn't answer my question at all. I don't care about theoretical maximum Flops, I don't care about compute, I definitely don't care about KL, I only care about frame rates. You've said that by 2020, dGPUs will be obsolete, but didn't give an answer of the roadmap to that goal - that's what I'm looking for.

Let me make it even easier. Suppose by 2016, the best consumer-grade CPU + dGPU set up (let's call them the Intel i7-6790k and AMD R9-490x can produce 60fps on Crysis 5 at ultra detail level on a 4K display. Will the best APU be able to manage 15+ fps?
 
Status
Not open for further replies.