News AMD Puts Hopes on Packaging, Memory on Logic, Optical Comms for Decade Ahead

Page 2 - Seeking answers? Join the Tom's Hardware community: where nearly two million members share solutions and discuss the latest tech.

bit_user

Champion
Ambassador
The only reason HBM costs more is because it is still a niche product.
Even if you're 100% correct about that, it still doesn't address my latter point (see updated post). I don't see consumer dGPUs going away, as long as there continues to be demand for that amount of compute/rendering power. It's too impractical to put all of that in a single package.

Apple tried, with the M1 Ultra, and the benchmarks (see Geekbench) show their GPU performance claims are wildly inflated. It's nowhere near a top-end dGPU, in spite of amassing a respectable aggregate memory bandwidth.
 

LawlessQuill

Prominent
Apr 22, 2021
9
2
515
I think GDDR is its own thing and and the numbering has no direct correspondence with regular DDR memory standards.


Did you mean discrete GPUs getting phased out? Won't happen. The high-end GPUs will remain distinct from CPUs, for the foreseeable future. The main reason is GDDR memory, which has far higher bandwidth than is available to a CPU. It also turns out to be useful to upgrade your GPU without having to toss out your old CPU, as well.

At the low and even mid-range, we could see iGPUs with in-package memory eroding the market segment of dGPUs, but Apple's M1 Ultra shows that even packing like 8 channels of LPDDR5 in-package isn't enough to compete with high-end dGPUs. Don't believe me? Check its Geekbench scores.
https://www.notebookcheck.net/Preli...t-5-times-that-of-standard-DDR5.580580.0.html

No I mean phasing out discrete cpus. With the new advancements in bridge technology, direct storage, SAM, and AI optimizations I personally believe CPUs could be relegated to a few chiplets attached to a gpu die, within the GPU itself. At a certain point, DDR RAM on the GPU could be possible as well (separate from VRAM).

Perhaps the new chiplet wouldnt be referred to as a gpu, but it makes more sense as one you could plug and play in any high end PC with its own cooling and IO attached. A singular unit with very little space between every essential component, bridged like infinity fabric between all components for maximum bandwidth.

Not only is it possible, its been proposed since 2017 by Jensen himself.

CPUs are slow, inefficient, and barely used to capacity. Even years into multithreading a plethora of apps still use single cores to power them. The state of cpus is quite honestly pathetic compared to the advancements in all other computer technologies.

 

bit_user

Champion
Ambassador
I blame the article's author for confusing DDR and GDDR. The only actual reference in the article is to GDDR7. Looking at her other articles, it seems this sort of technology development isn't her main beat, so we can't necessarily presume she knows what she's writing about.

No I mean phasing out discrete cpus.
That's a distinction without a difference.

A singular unit with very little space between every essential component, bridged like infinity fabric between all components for maximum bandwidth.
People seem to grossly overestimate the drawbacks of having a GPU sit over a PCIe (or similar) bus connection. The main win for integration isn't getting rid of PCIe, but it's rather that you can avoid duplication of assets in both GPU and CPU memory. So, it's mainly a cost-savings.

The exception is in AI training & some HPC compute loads, which is why AMD is making the MI300 and Nvidia has their Grace + Hopper superchip idea.

Not only is it possible, its been proposed since 2017 by Jensen himself.
He's always going to spin things in a GPU-centric direction. Anyway, you can look at their Tegra line of SoCs, their DPUs, and their Grace + Hopper "superchip" to see where they're going. These are all for specialized applications and markets.

CPUs are slow, inefficient, and barely used to capacity.
Only if you put the wrong kind of workload on them. If you'd run a CPU-oriented workload on a GPU, you would have to conclude exactly the same (except worse).

Even years into multithreading a plethora of apps still use single cores to power them.
Granted, programming models, languages, APIs, and even ISAs need to improve, to support better utilization of multiple cores. That's still very different than trying to treat a GPU like a CPU. It's just not.
 
  • Like
Reactions: TJ Hooker

bit_user

Champion
Ambassador
  • Like
Reactions: LawlessQuill

InvalidError

Titan
Moderator
Even if you're 100% correct about that, it still doesn't address my latter point (see updated post). I don't see consumer dGPUs going away, as long as there continues to be demand for that amount of compute/rendering power. It's too impractical to put all of that in a single package.
I'd wager that incremental compute cost being what it has become lately, we'll probably hit a brick wall within the next seven years where the only way to increase performance is to throw more silicon (and money) at it just like how we apparently already hit a scaling brick wall on SRAM. Once we get there, it will make more sense to buy all of the compute you need in the smallest package possible since nothing meaningfully faster will be coming out short of paying N-times as much for approximately N-times the performance.
 

bit_user

Champion
Ambassador
Once we get there, it will make more sense to buy all of the compute you need in the smallest package possible since nothing meaningfully faster will be coming out short of paying N-times as much for approximately N-times the performance.
I still don't agree that special-purpose accelerators (which mostly fall on the GPU side of the dividing line) won't continue to change and evolve. Things like new video codecs, AI hardware, rendering techniques, and optimizations to the hard-wired blocks which do these things will still motivate GPU upgrades at a higher pace than CPUs.

Furthermore, when scaling only happens by adding silicon or burning more power, then it makes even more sense to have them decoupled so that:
  1. You can scale performance by adding more GPUs.
  2. You can more effectively cool multiple packages than one (unless using a highly-customized form factor).

If CPUs had stayed at like 4 cores, then we could see things play out as @LawlessQuill said and have the GPU subsume the CPU as just a minor component. But CPUs appear to be adding cores too fast for that to happen. They each demand to be a component in their own right, except at the low and the extreme high end.
 

InvalidError

Titan
Moderator
I still don't agree that special-purpose accelerators (which mostly fall on the GPU side of the dividing line) won't continue to change and evolve. Things like new video codecs, AI hardware, rendering techniques, and optimizations to the hard-wired blocks which do these things will still motivate GPU upgrades at a higher pace than CPUs.
Not if it comes with a linear cost scaling with performance. Once we get there, $500 will perpetually buy you about the same overall performance. If your new features add 20% die space, you will need to pay 20+% more to get those features. We'll be solidly in replacement economy where most people keep their PC for 10-15 years

As for stuffing more GPUs in there, I doubt AMD and Nvidia have any interest in going back to SLI/CF with their grossly inconsistent results and endless support woes.
 

LawlessQuill

Prominent
Apr 22, 2021
9
2
515
As far as specialization goes, it would be interesting to see PC slowly begin to focus on driver and software optimization as standards become less deviated across the board as everything starts to plateau. A gpu with the raw performance of a 4090 with apple level optimization would be a force to be reckoned with. I also reckon we'd get charged monthly fees eventually for these updates.
 

bit_user

Champion
Ambassador
Not if it comes with a linear cost scaling with performance.
Why do you assume it will? Nvidia claimed the Tensor cores in their RTX 3000 line offered 4x the performance of the previous generation, so they cut the number in half. I recall seeing some corresponding claim about less area being devoted to them, as well, but I don't recall where.

AMD touted a number of improvements and optimizations in the ray tracing cores of its RX 7000-series, which again seemed like they probably offered disproportionately greater performance increases than the amount of transistors needed.

That's the beauty of fixed-function accelerators of complex tasks - you can often find such disproportionate optimizations. I think this will keep us in new GPUs for some time yet to come.

Once we get there, $500 will perpetually buy you about the same overall performance.
Once we get there, companies will be focusing on all sorts of ISA tweaks to squeeze out more performance per mm^2 and performance per W.

If your new features add 20% die space, you will need to pay 20+% more to get those features.
We could see FPGA-like techniques come to the fore, where dies contain a significant pool of reconfigurable logic, so it can be shared between different fixed-function tasks.

We'll be solidly in replacement economy where most people keep their PC for 10-15 years
Or quantum computing unlocks new materials that keep the upgrade treadmill going for at least a decade longer than you think. Even once density has ceased to increase, there's probably still room for efficiency improvements and cost reductions, which could then unlock greater area.

As for stuffing more GPUs in there, I doubt AMD and Nvidia have any interest in going back to SLI/CF with their grossly inconsistent results and endless support woes.
That's mainly because GPUs have been improving at such a pace that new models present an easy path to adding performance. When new models cease to be much faster than older once, adding more will be a tempting way to increase performance. And with CXL, the logistics and programming model might be simpler, as well.
 

InvalidError

Titan
Moderator
Once we get there, companies will be focusing on all sorts of ISA tweaks to squeeze out more performance per mm^2 and performance per W.
Maybe, maybe not. Much of the "architectural tweaking" lately has been around enlarged caches, usually to the tune of doubling the amount of area dedicated to a given cache tier to get a 5-10% performance improvement. With almost every 'tweak' requiring more silicon, the performance per dollar will remain flat at best.

We could see FPGA-like techniques come to the fore, where dies contain a significant pool of reconfigurable logic, so it can be shared between different fixed-function tasks.
What clock frequency would the stuff on FPGA run at? Likely less than 500Mhz and you need 5-10X as much silicon space to implement a given function as you would on ASIC since most space in FPGA is routing and general-purpose stuff you rarely get to use more than half of at any given time. I cannot imagine any performance-critical stuff where FPGAs would make sense, it would need to be for stuff that is too slow to do in software but not common or important enough to bother implementing as a fixed function.

That's mainly because GPUs have been improving at such a pace that new models present an easy path to adding performance. When new models cease to be much faster than older once, adding more will be a tempting way to increase performance. And with CXL, the logistics and programming model might be simpler, as well.
Adding more will only be 'tempting' if you can afford doing so and GPU manufacturers manage to make it work properly this time around.

Before wishing for multi-GPU to come back, you'd need AMD and Nvidia to sort out their multi-GCD packages. If they cannot make multi-chip graphics work well as a single-package design, you can kiss goodbye at it having any chance in hell of working as separate packages or boards.

CXL won't save multi-GPU: Pascal's NVLink had 80+GB/s as a dedicated chip-to-chip link and that still wasn't enough to make SLI work right. I suspect the amount of GPU-to-GPU chatter has extremely steep scaling if you want to achieve good performance scaling without programmers having to go 100 miles out of their way to optimize their games for your specific multi-GPU hardware.

Also, I'd expect a very large number of people to simply give up on pushing graphics any further: how many people would actually give a damn about anything faster than a 4090 when that can already do 4k RT Ultra at 100+fps in nearly everything? For most normal people, the incremental visual improvements beyond that would be far beyond the point of diminishing return.
 
  • Like
Reactions: usertests

Kamen Rider Blade

Distinguished
Dec 2, 2013
1,227
729
20,060
Samsung and SK Hynix both have compute-in-memory solutions. I think at least Samsung's puts the compute in the bottom die of the stack.
I checked the HBM-PIM slide decks, they aren't on the direct bottom, they're like 2x Silicon layers above it.
Close to the bottom, but not quite.
FLD6skF.jpg



Actually, their CPUs had it first. If you remember back in Ryzen 3000-series, the I/O die had the memory controller + I/O.
True, they did segment it from earlier designs, but I'm thinking of segmenting it off into it's own thing AGAIN.
Like into it's own die and using OMI for a serial connection to link back to the cIOD.
You know how the GPU's moved the Memory Controller into it's own die, the same could happen on the CPU side where the Memory Controller is it's own die, sitting physically closer to the DIMMs and feeds the data back via a Serial Connection similar to OMI.

Yeah, like maybe they could've used a different I/O die to effectively back-port Zen 4 to AM4, so that people could use it on cheaper motherboards and with DDR4.
AMD did the calculus, and they didn't see the financial value in doing so.
It also simplifies product support:
AM4 = DDR4
AM5 = DDR5
Less confusion for their customer base, less SKU's to worry about, etc.
 

Kamen Rider Blade

Distinguished
Dec 2, 2013
1,227
729
20,060
And why is it that CPUs have lower memory bandwidth in the first place? The need for customizable memory size using DIMMs. Once CPUs have on-package memory as mainstream, the memory can be whatever the manufacturer wants it to be. Had Apple wanted to, it could have gone 2-4xHBM3E.
The DIMM aspect is one of it.
The other aspect is because regular DDR has lower latency than GDDR.
GDDR has much higher bandwidth, but worse latency.

It's deliberate trade-offs made for the target use case.
 

Kamen Rider Blade

Distinguished
Dec 2, 2013
1,227
729
20,060
CXL won't save multi-GPU: Pascal's NVLink had 80+GB/s as a dedicated chip-to-chip link and that still wasn't enough to make SLI work right. I suspect the amount of GPU-to-GPU chatter has extremely steep scaling if you want to achieve good performance scaling without programmers having to go 100 miles out of their way to optimize their games for your specific multi-GPU hardware.
I think the only realistic way to do multi-GPU is to copy what Apple & TSMC did and have the two Dies connected together via some form of Parallel bus.
That lowers the latency enough and gives enough bandwidth to make it work.
After that, you need to make sure the drivers handle things properly.
That's the harder part, making sure your drivers work correctly without relying on SLI/CrossFire in the "Alternate Frame Rendering" of the past, but properly distribute the work to render every frame together and communicate across the interface as necessary.

That's an area where I see AMD going, they already have this implemented on the CDNA side.
Kmj5df0.jpg

Now it's a matter of time before AMD gets the drivers working.
 

bit_user

Champion
Ambassador
What clock frequency would the stuff on FPGA run at? Likely less than 500Mhz and you need 5-10X as much silicon space to implement a given function as you would on ASIC
I think a growing trend among FPGAs is to provide higher-level building blocks, which could go some ways towards addressing the density, frequency, and power-efficiency issues.

I cannot imagine any performance-critical stuff where FPGAs would make sense, it would need to be for stuff that is too slow to do in software but not common or important enough to bother implementing as a fixed function.
What about as a higher-performance alternative to microcode?

Before wishing for multi-GPU to come back, you'd need AMD and Nvidia to sort out their multi-GCD packages. If they cannot make multi-chip graphics work well as a single-package design, you can kiss goodbye at it having any chance in hell of working as separate packages or boards.
Multi-board would seem to hold good potential for multi-display rendering, at least. I recall a workstation board with dual-Fury GPUs that AMD was pitching towards VR, with the suggestion that each GPU could render the output for one eye.

CXL won't save multi-GPU: Pascal's NVLink had 80+GB/s as a dedicated chip-to-chip link and that still wasn't enough to make SLI work right.
SLI became increasingly niche, which reduced incentives for developers to support it. With CXL, you'd have a cache-coherent foundation for multi-GPU that's ubiquitous.

Also, I'd expect a very large number of people to simply give up on pushing graphics any further: how many people would actually give a damn about anything faster than a 4090 when that can already do 4k RT Ultra at 100+fps in nearly everything? For most normal people, the incremental visual improvements beyond that would be far beyond the point of diminishing return.
In your scenario of stagnant performance per $, multi-GPU would give people an upgrade path that didn't require discarding their initial investment.
 

bit_user

Champion
Ambassador
You know how the GPU's moved the Memory Controller into it's own die, the same could happen on the CPU side where the Memory Controller is it's own die, sitting physically closer to the DIMMs and feeds the data back via a Serial Connection similar to OMI.
There's no point in using a serial connection for in-package communication. I do wonder if you might be onto something, except maybe the idea would be to put the memory controller in the base of a LPDDR5 stack. I wonder if that would offer any advantages.
 

bit_user

Champion
Ambassador
https://en.wikipedia.org/wiki/Apple_M1
M1 series uses LPDDR, not HBM.

Sorry, Apple didn't try using HBM at all on the M1 line.
I know that, as you can clearly see from post #20.
"Apple's M1 Ultra shows that even packing like 8 channels of LPDDR5 in-package isn't enough to compete with high-end dGPUs."​

We were talking about the generic idea of a high-end iGPU. Regardless of the memory technology, Apple managed to aggregate 800 GB/s of memory bandwidth, which is about 80% of a RTX 4090. And yet, their M1 Ultra iGPU struggles to compete with a RTX 3070.
 

bit_user

Champion
Ambassador
I think the only realistic way to do multi-GPU is to copy what Apple & TSMC did and have the two Dies connected together via some form of Parallel bus.
That's if you want to make it transparent to software. On the other hand, if software is aware that there are two GPUs, it could probably partition the work and data more intelligently.

That's an area where I see AMD going, they already have this implemented on the CDNA side.
RDNA3 was an explicit rejection of that idea, or did you not read the article? They said they couldn't make the chiplet approach work by splitting the GCD, which is why they instead decided to move the I/O and L3 cache onto chiplets.

GPU Compute is a different problem domain with different data movement patterns. That's why they could use multi-die in that case. And, in point of fact, the MI200 is visible to software as two separate GPUs. They didn't attempt Apple's conceit, but also they don't have an interconnect anywhere near as fast as Apple's.
 
https://www.notebookcheck.net/Preli...t-5-times-that-of-standard-DDR5.580580.0.html

No I mean phasing out discrete cpus. With the new advancements in bridge technology, direct storage, SAM, and AI optimizations I personally believe CPUs could be relegated to a few chiplets attached to a gpu die, within the GPU itself. At a certain point, DDR RAM on the GPU could be possible as well (separate from VRAM).
Yeah that's basically what intel did with nervana/habana
The iGPU has all of the stuff they are selling and the cpu cores on it are just coincidental.
Available on pci-e and even on nvme.
But these are business things, for the desktop people will always want the choice for a much faster dGPU.
Intel-NNP-I-1000-at-OCP-Summit-2019.jpg


intel-nnp-l-summary.png
 

InvalidError

Titan
Moderator
SLI became increasingly niche, which reduced incentives for developers to support it. With CXL, you'd have a cache-coherent foundation for multi-GPU that's ubiquitous.

In your scenario of stagnant performance per $, multi-GPU would give people an upgrade path that didn't require discarding their initial investment.
SLI/CF was niche because most developers didn't want to have to do model-specific optimizations to make it work properly, which is support hell. I bet most developers will choose to work with whatever the mainstream level of performance is instead of having to massively increase the already excessive amount of effort that goes into the graphics budget to push things any further.

Also, mostly stagnant performance per dollar doesn't mean there won't be architectural changes breaking interoperability along the way. By the time you want to boost your 5+ years old GPU's performance by adding another, you may need to get two new ones instead.
 

bit_user

Champion
Ambassador
Yeah that's basically what intel did with nervana/habana
Nervana is dead. Intel killed it when they bought Habana. Anyway, that's a fair point, but it's more like embedding a CPU into a peripheral GPU, than the sort of fusion where the system's central processing unit is a CPU/GPU hybrid.

I think Falcon Shores is probably even more analogous to Nvidia Grace+Hopper and AMD MI300


Hey, at least the industry seems to be converging around this idea. How well it ages will be interesting to see.

I just think the economics of HPC is very different than what people want from a gaming PC. People seem pretty outraged at GPU prices now, but if you look at Apple's M1 Ultra solution, the very cheapest configuration you can buy with the fully-enabled GPU costs $5000.



Of course, that includes the Apple Tax, but it's a lot more than their other M1 Macs. I just don't see either that pricing or the lack of upgradability sitting well with a lot of high-end gamers. Or, in an era where upgrades don't so much, then you still want modularity so you can replace individual components that break, which is going to be even more important if people are keeping their GPUs longer.
 
Last edited:

InvalidError

Titan
Moderator
Or, in an era where upgrades don't so much, then you still want modularity so you can replace individual components that break, which is going to be even more important if people are keeping their GPUs longer.
How often do CPUs fail without something else (usually high-side VRM FET failing shorted and sending 12V to Vcore) failing first? Very rarely. High-graphics SoCs would likely have to meet the same expectations before AMD and Intel can deploy them at scale. We'll see how likely that is to happen when AMD, Nvidia and Intel launch their first generations of extensively 3D-stacked chips. I doubt their datacenter clients would take kindly to a noticeable increase in overall failure rates.
 

bit_user

Champion
Ambassador
How often do CPUs fail without something else (usually high-side VRM FET failing shorted and sending 12V to Vcore) failing first? Very rarely.
Talking about consumer CPUs? That's largely because they don't run flat-out as much or as hard as people run their GPUs. You're focused on the wrong problem.

And RAM does definitely develop errors at a higher rate than CPUs fail.

I doubt their datacenter clients would take kindly to a noticeable increase in overall failure rates.
They get a longer warranty and will likely replace their hardware within that time. Also, server CPUs and GPUs are de-rated to run at lower speeds for better longevity and stability. With consumer CPUs and GPUs, the temptation is always to juice them harder to get the performance win, and you just assume people don't do it enough to cause failures before the warranty expires.