News AMD Puts Hopes on Packaging, Memory on Logic, Optical Comms for Decade Ahead

InvalidError · Feb 23, 2023

bit_user said:
Cost.

The only reason HBM costs more is because it is still a niche product. The fundamental structure of HBM is the same as DDR and GDDR. Once more fabs have all of the necessary equipment to do TSVs and 3D-stacking in mass quantities, the cost difference should drop drastically.

bit_user · Feb 23, 2023

InvalidError said:
The only reason HBM costs more is because it is still a niche product.

Even if you're 100% correct about that, it still doesn't address my latter point (see updated post). I don't see consumer dGPUs going away, as long as there continues to be demand for that amount of compute/rendering power. It's too impractical to put all of that in a single package.

Apple tried, with the M1 Ultra, and the benchmarks (see Geekbench) show their GPU performance claims are wildly inflated. It's nowhere near a top-end dGPU, in spite of amassing a respectable aggregate memory bandwidth.

LawlessQuill · Feb 23, 2023

bit_user said:
I think GDDR is its own thing and and the numbering has no direct correspondence with regular DDR memory standards.

Did you mean discrete GPUs getting phased out? Won't happen. The high-end GPUs will remain distinct from CPUs, for the foreseeable future. The main reason is GDDR memory, which has far higher bandwidth than is available to a CPU. It also turns out to be useful to upgrade your GPU without having to toss out your old CPU, as well.

At the low and even mid-range, we could see iGPUs with in-package memory eroding the market segment of dGPUs, but Apple's M1 Ultra shows that even packing like 8 channels of LPDDR5 in-package isn't enough to compete with high-end dGPUs. Don't believe me? Check its Geekbench scores.

https://www.notebookcheck.net/Preli...t-5-times-that-of-standard-DDR5.580580.0.html

No I mean phasing out discrete cpus. With the new advancements in bridge technology, direct storage, SAM, and AI optimizations I personally believe CPUs could be relegated to a few chiplets attached to a gpu die, within the GPU itself. At a certain point, DDR RAM on the GPU could be possible as well (separate from VRAM).

Perhaps the new chiplet wouldnt be referred to as a gpu, but it makes more sense as one you could plug and play in any high end PC with its own cooling and IO attached. A singular unit with very little space between every essential component, bridged like infinity fabric between all components for maximum bandwidth.

Not only is it possible, its been proposed since 2017 by Jensen himself.

CPUs are slow, inefficient, and barely used to capacity. Even years into multithreading a plethora of apps still use single cores to power them. The state of cpus is quite honestly pathetic compared to the advancements in all other computer technologies.

Nvidia CEO says Moore’s Law is dead and GPUs will replace CPUs

Breaking the law.

www.pcgamer.com

bit_user · Feb 23, 2023

LawlessQuill said:
https://www.notebookcheck.net/Preli...t-5-times-that-of-standard-DDR5.580580.0.html

I blame the article's author for confusing DDR and GDDR. The only actual reference in the article is to GDDR7. Looking at her other articles, it seems this sort of technology development isn't her main beat, so we can't necessarily presume she knows what she's writing about.

LawlessQuill said:
No I mean phasing out discrete cpus.

That's a distinction without a difference.

LawlessQuill said:
A singular unit with very little space between every essential component, bridged like infinity fabric between all components for maximum bandwidth.

People seem to grossly overestimate the drawbacks of having a GPU sit over a PCIe (or similar) bus connection. The main win for integration isn't getting rid of PCIe, but it's rather that you can avoid duplication of assets in both GPU and CPU memory. So, it's mainly a cost-savings.

The exception is in AI training & some HPC compute loads, which is why AMD is making the MI300 and Nvidia has their Grace + Hopper superchip idea.

LawlessQuill said:
Not only is it possible, its been proposed since 2017 by Jensen himself.

He's always going to spin things in a GPU-centric direction. Anyway, you can look at their Tegra line of SoCs, their DPUs, and their Grace + Hopper "superchip" to see where they're going. These are all for specialized applications and markets.

LawlessQuill said:
CPUs are slow, inefficient, and barely used to capacity.

Only if you put the wrong kind of workload on them. If you'd run a CPU-oriented workload on a GPU, you would have to conclude exactly the same (except worse).

LawlessQuill said:
Even years into multithreading a plethora of apps still use single cores to power them.

Granted, programming models, languages, APIs, and even ISAs need to improve, to support better utilization of multiple cores. That's still very different than trying to treat a GPU like a CPU. It's just not.

LawlessQuill · Feb 23, 2023

Thanks, that was very informative. Ill do some reading on the superchips from AMD and nvidia, interesting stuff!

bit_user · Feb 23, 2023

LawlessQuill said:
Thanks, that was very informative. Ill do some reading on the superchips from AMD and nvidia, interesting stuff!

Nvidia's blogs might be your best resource on Nvidia's Grace.

As for AMD's MI300, I don't know if you'll find a lot more than here:

InvalidError · Feb 23, 2023

bit_user said:
Even if you're 100% correct about that, it still doesn't address my latter point (see updated post). I don't see consumer dGPUs going away, as long as there continues to be demand for that amount of compute/rendering power. It's too impractical to put all of that in a single package.

I'd wager that incremental compute cost being what it has become lately, we'll probably hit a brick wall within the next seven years where the only way to increase performance is to throw more silicon (and money) at it just like how we apparently already hit a scaling brick wall on SRAM. Once we get there, it will make more sense to buy all of the compute you need in the smallest package possible since nothing meaningfully faster will be coming out short of paying N-times as much for approximately N-times the performance.

bit_user · Feb 23, 2023

InvalidError said:
Once we get there, it will make more sense to buy all of the compute you need in the smallest package possible since nothing meaningfully faster will be coming out short of paying N-times as much for approximately N-times the performance.

I still don't agree that special-purpose accelerators (which mostly fall on the GPU side of the dividing line) won't continue to change and evolve. Things like new video codecs, AI hardware, rendering techniques, and optimizations to the hard-wired blocks which do these things will still motivate GPU upgrades at a higher pace than CPUs.

Furthermore, when scaling only happens by adding silicon or burning more power, then it makes even more sense to have them decoupled so that:

You can scale performance by adding more GPUs.
You can more effectively cool multiple packages than one (unless using a highly-customized form factor).

If CPUs had stayed at like 4 cores, then we could see things play out as @LawlessQuill said and have the GPU subsume the CPU as just a minor component. But CPUs appear to be adding cores too fast for that to happen. They each demand to be a component in their own right, except at the low and the extreme high end.

InvalidError · Feb 23, 2023

bit_user said:
I still don't agree that special-purpose accelerators (which mostly fall on the GPU side of the dividing line) won't continue to change and evolve. Things like new video codecs, AI hardware, rendering techniques, and optimizations to the hard-wired blocks which do these things will still motivate GPU upgrades at a higher pace than CPUs.

Not if it comes with a linear cost scaling with performance. Once we get there, $500 will perpetually buy you about the same overall performance. If your new features add 20% die space, you will need to pay 20+% more to get those features. We'll be solidly in replacement economy where most people keep their PC for 10-15 years

As for stuffing more GPUs in there, I doubt AMD and Nvidia have any interest in going back to SLI/CF with their grossly inconsistent results and endless support woes.

LawlessQuill · Feb 23, 2023

As far as specialization goes, it would be interesting to see PC slowly begin to focus on driver and software optimization as standards become less deviated across the board as everything starts to plateau. A gpu with the raw performance of a 4090 with apple level optimization would be a force to be reckoned with. I also reckon we'd get charged monthly fees eventually for these updates.

bit_user · Feb 23, 2023

InvalidError said:
Not if it comes with a linear cost scaling with performance.

Why do you assume it will? Nvidia claimed the Tensor cores in their RTX 3000 line offered 4x the performance of the previous generation, so they cut the number in half. I recall seeing some corresponding claim about less area being devoted to them, as well, but I don't recall where.

AMD touted a number of improvements and optimizations in the ray tracing cores of its RX 7000-series, which again seemed like they probably offered disproportionately greater performance increases than the amount of transistors needed.

That's the beauty of fixed-function accelerators of complex tasks - you can often find such disproportionate optimizations. I think this will keep us in new GPUs for some time yet to come.

InvalidError said:
Once we get there, $500 will perpetually buy you about the same overall performance.

Once we get there, companies will be focusing on all sorts of ISA tweaks to squeeze out more performance per mm^2 and performance per W.

InvalidError said:
If your new features add 20% die space, you will need to pay 20+% more to get those features.

We could see FPGA-like techniques come to the fore, where dies contain a significant pool of reconfigurable logic, so it can be shared between different fixed-function tasks.

InvalidError said:
We'll be solidly in replacement economy where most people keep their PC for 10-15 years

Or quantum computing unlocks new materials that keep the upgrade treadmill going for at least a decade longer than you think. Even once density has ceased to increase, there's probably still room for efficiency improvements and cost reductions, which could then unlock greater area.

InvalidError said:
As for stuffing more GPUs in there, I doubt AMD and Nvidia have any interest in going back to SLI/CF with their grossly inconsistent results and endless support woes.

That's mainly because GPUs have been improving at such a pace that new models present an easy path to adding performance. When new models cease to be much faster than older once, adding more will be a tempting way to increase performance. And with CXL, the logistics and programming model might be simpler, as well.

InvalidError · Feb 23, 2023

bit_user said:
Once we get there, companies will be focusing on all sorts of ISA tweaks to squeeze out more performance per mm^2 and performance per W.

Maybe, maybe not. Much of the "architectural tweaking" lately has been around enlarged caches, usually to the tune of doubling the amount of area dedicated to a given cache tier to get a 5-10% performance improvement. With almost every 'tweak' requiring more silicon, the performance per dollar will remain flat at best.

bit_user said:
We could see FPGA-like techniques come to the fore, where dies contain a significant pool of reconfigurable logic, so it can be shared between different fixed-function tasks.

What clock frequency would the stuff on FPGA run at? Likely less than 500Mhz and you need 5-10X as much silicon space to implement a given function as you would on ASIC since most space in FPGA is routing and general-purpose stuff you rarely get to use more than half of at any given time. I cannot imagine any performance-critical stuff where FPGAs would make sense, it would need to be for stuff that is too slow to do in software but not common or important enough to bother implementing as a fixed function.

bit_user said:
That's mainly because GPUs have been improving at such a pace that new models present an easy path to adding performance. When new models cease to be much faster than older once, adding more will be a tempting way to increase performance. And with CXL, the logistics and programming model might be simpler, as well.

Adding more will only be 'tempting' if you can afford doing so and GPU manufacturers manage to make it work properly this time around.

Before wishing for multi-GPU to come back, you'd need AMD and Nvidia to sort out their multi-GCD packages. If they cannot make multi-chip graphics work well as a single-package design, you can kiss goodbye at it having any chance in hell of working as separate packages or boards.

CXL won't save multi-GPU: Pascal's NVLink had 80+GB/s as a dedicated chip-to-chip link and that still wasn't enough to make SLI work right. I suspect the amount of GPU-to-GPU chatter has extremely steep scaling if you want to achieve good performance scaling without programmers having to go 100 miles out of their way to optimize their games for your specific multi-GPU hardware.

Also, I'd expect a very large number of people to simply give up on pushing graphics any further: how many people would actually give a damn about anything faster than a 4090 when that can already do 4k RT Ultra at 100+fps in nearly everything? For most normal people, the incremental visual improvements beyond that would be far beyond the point of diminishing return.

Kamen Rider Blade · Feb 24, 2023

bit_user said:
Samsung and SK Hynix both have compute-in-memory solutions. I think at least Samsung's puts the compute in the bottom die of the stack.

I checked the HBM-PIM slide decks, they aren't on the direct bottom, they're like 2x Silicon layers above it.
Close to the bottom, but not quite.

Actually, their CPUs had it first. If you remember back in Ryzen 3000-series, the I/O die had the memory controller + I/O.

True, they did segment it from earlier designs, but I'm thinking of segmenting it off into it's own thing AGAIN.
Like into it's own die and using OMI for a serial connection to link back to the cIOD.
You know how the GPU's moved the Memory Controller into it's own die, the same could happen on the CPU side where the Memory Controller is it's own die, sitting physically closer to the DIMMs and feeds the data back via a Serial Connection similar to OMI.

Yeah, like maybe they could've used a different I/O die to effectively back-port Zen 4 to AM4, so that people could use it on cheaper motherboards and with DDR4.

AMD did the calculus, and they didn't see the financial value in doing so.
It also simplifies product support:
AM4 = DDR4
AM5 = DDR5
Less confusion for their customer base, less SKU's to worry about, etc.

Kamen Rider Blade · Feb 24, 2023

InvalidError said:
And why is it that CPUs have lower memory bandwidth in the first place? The need for customizable memory size using DIMMs. Once CPUs have on-package memory as mainstream, the memory can be whatever the manufacturer wants it to be. Had Apple wanted to, it could have gone 2-4xHBM3E.

The DIMM aspect is one of it.
The other aspect is because regular DDR has lower latency than GDDR.
GDDR has much higher bandwidth, but worse latency.

It's deliberate trade-offs made for the target use case.

Kamen Rider Blade · Feb 24, 2023

bit_user said:
Apple tried, with the M1 Ultra, and the benchmarks (see Geekbench) show their GPU performance claims are wildly inflated. It's nowhere near a top-end dGPU, in spite of amassing a respectable aggregate memory bandwidth.

https://en.wikipedia.org/wiki/Apple_M1
M1 series uses LPDDR, not HBM.

Sorry, Apple didn't try using HBM at all on the M1 line.

Kamen Rider Blade · Feb 24, 2023

InvalidError said:
CXL won't save multi-GPU: Pascal's NVLink had 80+GB/s as a dedicated chip-to-chip link and that still wasn't enough to make SLI work right. I suspect the amount of GPU-to-GPU chatter has extremely steep scaling if you want to achieve good performance scaling without programmers having to go 100 miles out of their way to optimize their games for your specific multi-GPU hardware.

I think the only realistic way to do multi-GPU is to copy what Apple & TSMC did and have the two Dies connected together via some form of Parallel bus.
That lowers the latency enough and gives enough bandwidth to make it work.
After that, you need to make sure the drivers handle things properly.
That's the harder part, making sure your drivers work correctly without relying on SLI/CrossFire in the "Alternate Frame Rendering" of the past, but properly distribute the work to render every frame together and communicate across the interface as necessary.

That's an area where I see AMD going, they already have this implemented on the CDNA side.

Now it's a matter of time before AMD gets the drivers working.

bit_user · Feb 24, 2023

InvalidError said:
What clock frequency would the stuff on FPGA run at? Likely less than 500Mhz and you need 5-10X as much silicon space to implement a given function as you would on ASIC

I think a growing trend among FPGAs is to provide higher-level building blocks, which could go some ways towards addressing the density, frequency, and power-efficiency issues.

InvalidError said:
I cannot imagine any performance-critical stuff where FPGAs would make sense, it would need to be for stuff that is too slow to do in software but not common or important enough to bother implementing as a fixed function.

What about as a higher-performance alternative to microcode?

InvalidError said:
Before wishing for multi-GPU to come back, you'd need AMD and Nvidia to sort out their multi-GCD packages. If they cannot make multi-chip graphics work well as a single-package design, you can kiss goodbye at it having any chance in hell of working as separate packages or boards.

Multi-board would seem to hold good potential for multi-display rendering, at least. I recall a workstation board with dual-Fury GPUs that AMD was pitching towards VR, with the suggestion that each GPU could render the output for one eye.

InvalidError said:
CXL won't save multi-GPU: Pascal's NVLink had 80+GB/s as a dedicated chip-to-chip link and that still wasn't enough to make SLI work right.

SLI became increasingly niche, which reduced incentives for developers to support it. With CXL, you'd have a cache-coherent foundation for multi-GPU that's ubiquitous.

InvalidError said:
Also, I'd expect a very large number of people to simply give up on pushing graphics any further: how many people would actually give a damn about anything faster than a 4090 when that can already do 4k RT Ultra at 100+fps in nearly everything? For most normal people, the incremental visual improvements beyond that would be far beyond the point of diminishing return.

In your scenario of stagnant performance per $, multi-GPU would give people an upgrade path that didn't require discarding their initial investment.

bit_user · Feb 24, 2023

Kamen Rider Blade said:
You know how the GPU's moved the Memory Controller into it's own die, the same could happen on the CPU side where the Memory Controller is it's own die, sitting physically closer to the DIMMs and feeds the data back via a Serial Connection similar to OMI.

There's no point in using a serial connection for in-package communication. I do wonder if you might be onto something, except maybe the idea would be to put the memory controller in the base of a LPDDR5 stack. I wonder if that would offer any advantages.

bit_user · Feb 24, 2023

Kamen Rider Blade said:
https://en.wikipedia.org/wiki/Apple_M1
M1 series uses LPDDR, not HBM.

Sorry, Apple didn't try using HBM at all on the M1 line.

I know that, as you can clearly see from post #20.

"Apple's M1 Ultra shows that even packing like 8 channels of LPDDR5 in-package isn't enough to compete with high-end dGPUs."

We were talking about the generic idea of a high-end iGPU. Regardless of the memory technology, Apple managed to aggregate 800 GB/s of memory bandwidth, which is about 80% of a RTX 4090. And yet, their M1 Ultra iGPU struggles to compete with a RTX 3070.

bit_user · Feb 24, 2023

Kamen Rider Blade said:
I think the only realistic way to do multi-GPU is to copy what Apple & TSMC did and have the two Dies connected together via some form of Parallel bus.

That's if you want to make it transparent to software. On the other hand, if software is aware that there are two GPUs, it could probably partition the work and data more intelligently.

Kamen Rider Blade said:
That's an area where I see AMD going, they already have this implemented on the CDNA side.

RDNA3 was an explicit rejection of that idea, or did you not read the article? They said they couldn't make the chiplet approach work by splitting the GCD, which is why they instead decided to move the I/O and L3 cache onto chiplets.

GPU Compute is a different problem domain with different data movement patterns. That's why they could use multi-die in that case. And, in point of fact, the MI200 is visible to software as two separate GPUs. They didn't attempt Apple's conceit, but also they don't have an interconnect anywhere near as fast as Apple's.

TerryLaze · Feb 24, 2023

LawlessQuill said:
https://www.notebookcheck.net/Preli...t-5-times-that-of-standard-DDR5.580580.0.html

No I mean phasing out discrete cpus. With the new advancements in bridge technology, direct storage, SAM, and AI optimizations I personally believe CPUs could be relegated to a few chiplets attached to a gpu die, within the GPU itself. At a certain point, DDR RAM on the GPU could be possible as well (separate from VRAM).

Yeah that's basically what intel did with nervana/habana
The iGPU has all of the stuff they are selling and the cpu cores on it are just coincidental.
Available on pci-e and even on nvme.
But these are business things, for the desktop people will always want the choice for a much faster dGPU.

InvalidError · Feb 24, 2023

bit_user said:
SLI became increasingly niche, which reduced incentives for developers to support it. With CXL, you'd have a cache-coherent foundation for multi-GPU that's ubiquitous.

In your scenario of stagnant performance per $, multi-GPU would give people an upgrade path that didn't require discarding their initial investment.

SLI/CF was niche because most developers didn't want to have to do model-specific optimizations to make it work properly, which is support hell. I bet most developers will choose to work with whatever the mainstream level of performance is instead of having to massively increase the already excessive amount of effort that goes into the graphics budget to push things any further.

Also, mostly stagnant performance per dollar doesn't mean there won't be architectural changes breaking interoperability along the way. By the time you want to boost your 5+ years old GPU's performance by adding another, you may need to get two new ones instead.

bit_user · Feb 24, 2023

TerryLaze said:
Yeah that's basically what intel did with nervana/habana

Nervana is dead. Intel killed it when they bought Habana. Anyway, that's a fair point, but it's more like embedding a CPU into a peripheral GPU, than the sort of fusion where the system's central processing unit is a CPU/GPU hybrid.

I think Falcon Shores is probably even more analogous to Nvidia Grace+Hopper and AMD MI300

https://www.hpcwire.com/2022/02/18/intel-announces-falcon-shores-cpu-gpu-combo-architecture-for-2024/

Hey, at least the industry seems to be converging around this idea. How well it ages will be interesting to see.

I just think the economics of HPC is very different than what people want from a gaming PC. People seem pretty outraged at GPU prices now, but if you look at Apple's M1 Ultra solution, the very cheapest configuration you can buy with the fully-enabled GPU costs $5000.

Mac Studio

Mac Studio with the game-changing M1 Max or M1 Ultra chips. Stunningly compact form. And an array of high-performance ports. Buy online now.

www.apple.com

Of course, that includes the Apple Tax, but it's a lot more than their other M1 Macs. I just don't see either that pricing or the lack of upgradability sitting well with a lot of high-end gamers. Or, in an era where upgrades don't so much, then you still want modularity so you can replace individual components that break, which is going to be even more important if people are keeping their GPUs longer.

InvalidError · Feb 24, 2023

bit_user said:
Or, in an era where upgrades don't so much, then you still want modularity so you can replace individual components that break, which is going to be even more important if people are keeping their GPUs longer.

How often do CPUs fail without something else (usually high-side VRM FET failing shorted and sending 12V to Vcore) failing first? Very rarely. High-graphics SoCs would likely have to meet the same expectations before AMD and Intel can deploy them at scale. We'll see how likely that is to happen when AMD, Nvidia and Intel launch their first generations of extensively 3D-stacked chips. I doubt their datacenter clients would take kindly to a noticeable increase in overall failure rates.

bit_user · Feb 24, 2023

InvalidError said:
How often do CPUs fail without something else (usually high-side VRM FET failing shorted and sending 12V to Vcore) failing first? Very rarely.

Talking about consumer CPUs? That's largely because they don't run flat-out as much or as hard as people run their GPUs. You're focused on the wrong problem.

And RAM does definitely develop errors at a higher rate than CPUs fail.

InvalidError said:
I doubt their datacenter clients would take kindly to a noticeable increase in overall failure rates.

They get a longer warranty and will likely replace their hardware within that time. Also, server CPUs and GPUs are de-rated to run at lower speeds for better longevity and stability. With consumer CPUs and GPUs, the temptation is always to juice them harder to get the performance win, and you just assume people don't do it enough to cause failures before the warranty expires.

News AMD Puts Hopes on Packaging, Memory on Logic, Optical Comms for Decade Ahead

Titan

Titan

Prominent

Titan

Prominent

Titan

Titan

Titan

Titan

Prominent

Titan

Titan

Distinguished

Distinguished

Distinguished

Distinguished

Titan

Titan

Titan

Titan

Titan

Titan

Titan

Titan

Titan

Share this page