News AMD Puts Hopes on Packaging, Memory on Logic, Optical Comms for Decade Ahead

usertests

Distinguished
Mar 8, 2013
410
368
19,060
A further 10x performance/watt (efficiency) should be easily achievable. They want upwards of 200-1,000x within the next 15 years or so to enable zettascale supercomputers, and it might be possible to go further than that with 3D packaging.

Computing has already come so far, but adding a few more zeroes to the end could make things comical.
 

InvalidError

Titan
Moderator
The CPUs with on-package if not direct-stacked memory that I predicted about two years ago are one more step closer.

I wouldn't be surprised if DDR6 ends up being the last external memory standard we get before main memory moves on-package and external memory expansion when you need more than whatever on-package memory your CPU/GPU has goes PCIe/CXL.

An IGP chiplet/tile with 4-8GB of stacked HBM-like memory and almost direct access to the system memory controller should be interesting.
 

bit_user

Polypheme
Ambassador
Weird that they wouldn't plot Genoa on here:

pwyHBxGJwcSFrvR9heZH6d-970-80.jpg.webp


Maybe it blows up their nice trend, with that massive 1.5x core-count increase and DDR5 memory?

I also wonder what SPECint they used... did they go back and re-test old Opteron servers with SPEC2017?

AMD will also target processing in memory.
This should set the stage for an interesting battle with Samsung and SK Hynix, both of which have been very active in this space. If it's truly core to AMD's strategy, I doubt they'll be content to source PIM solutions from partners.

Can someone explain the difference between "2.5D Si INT, EFB" and "3D Chiplets", in the first slide of the second set?

mrHg5zMhVSDW8TuGv4afRd.jpg

And, on the next slide in that set, how much should we read into the subtly different wording: "DRAM layers" vs. "Memory layers"?

ncKkWtVuNqRSsNv9CjJGSc.jpg

Another big target for efficiency savings, and thus potential performance boosts, are chip I/O and communications. Specifically, using optical communications
How much latency would these optical transceivers tend to add?

AMD also took some time to boast about the AI performance gains which have been delivered by its processor portfolio over the last decade.
Nvidia still owns this market. If AMD really wants to play ball, they need to do a better job of hardware support across their entire stack, making it easy for even novice users to use any AMD GPU for AI acceleration.

And then they need to design hardware that's 2 generations ahead of where they think Nvidia will be, because that's where they will actually be, when AMD launches it product. For too long, AMD's AI performance has been a generation behind Nvidia's. That's because Nvidia understands it's truly strategic, whereas AMD treats it as a "nice-to-have". However, the hardware doesn't even matter if the software support & mindshare isn't there.
 

bit_user

Polypheme
Ambassador
This slide places too little emphasis on software, IMO.

n5nmdFuvaRfr8QueJFg8fc.jpg

In particular, programming models will need to shift. Caches are a great way for speeding up software without breaking backward-compatibility, but the lookups burn a lot of power. I think we will need to start reducing dependence on hardware-managed caches. Hardware prefetchers are also nice, but have their own overheads and limitations.

Finally, I keep expecting the industry to start looking beyond conventional approaches to out-of-order execution. We can't go back to strictly in-order, but I think there's a better compromise than the current conceit of a serial ISA that forces the hardware to do all the work needed to find concurrency.
 

InvalidError

Titan
Moderator
Can someone explain the difference between "2.5D Si INT, EFB" and "3D Chiplets", in the first slide of the second set?
2.5D is when you use interposers to tie different dies together a bit like current-day HBM, 3D is when stuff gets stacked directly on top or tucked under some other major function silicon instead of dedicated interconnect silicon.

So AMD's 3D-Vcache products are basically a mix of 2.5D to connect chiplets and 3D for the extra cache directly on CPUs.

Since AMD spun off the cache and memory controllers into chiplets for its higher-end GPUs, the logical evolution would be for the cache-memory controllers to become the base die for some sort of HBM-like memory.

Going "pure 3D" may be problematic since heat produced closer to the BGA/LGA substrate has to travel through everything stacked on top to reach the IHS or heatsink. Anything besides low-power stuff where this isn't an issue will likely remain hybrid 2.5-3D for thermal management reasons.
 

bit_user

Polypheme
Ambassador
2.5D is when you use interposers to tie different dies together a bit like current-day HBM,
Well, if you look at the slide, the starting point seems to be chiplets, since they show a picture of a de-lidded EPYC. So, I assume "2.5D Si INT, EFB" means more than that.

Going "pure 3D" may be problematic since heat produced closer to the BGA/LGA substrate has to travel through everything stacked on top to reach the IHS or heatsink. Anything besides low-power stuff where this isn't an issue will likely remain hybrid 2.5-3D for thermal management reasons.
I wonder if they could integrate graphene to wick away heat from the compute layer.


"Compared with metals or semiconductors, graphene has demonstrated extremely high intrinsic thermal conductivity, in the range from 2000 W/mK to 5000 W/mK at RT. This value is among the highest of known materials. Moreover, few-layer graphene (FLG) films with the thickness of a few nanometers also maintain rather high thermal conductivity unlike semiconductor or metals. Therefore, graphene and FLG are promising materials for micro or even nanometer scale heat spreader applications"​

You might use some TSV approach to send the heat up to the top of the stack, or maybe the entire package is a vapor chamber and you'd just have to draw the heat out to the edges of the chip.

Or, why would the compute die need to be at the base of the stack? Why couldn't it sit on top? Especially if it were restricted to computing on just the memory within the stack, then you might not need so many I/Os between it and the rest of the package.
 
  • Like
Reactions: pointa2b

InvalidError

Titan
Moderator
Or, why would the compute die need to be at the base of the stack? Why couldn't it sit on top?
I'd imagine that sending 100-250A through bottom dies' with a helluva bunch of TSVs would make burying DRAM, SRAM, NAND and other dense structures under a CPU/GPU die kind of problematic. Thermal vials through silicon would be a no-go for the same reason. Much simpler to limit TDP to what the stack can pass.

Using a graphene heat-spreading layer between 3D-stack layers might help alleviate hot spots and improve heat propagation through the stack, though I bet we are 10+ years away from economically viable ways of doing that for consumer electronics. Since graphene is an extremely good electrical conductor, the heat-spreading graphene layers would need to have thousands of holes precision-cut out of it to avoid interfering with copper pillars between dies, can't imagine that getting cheap any time soon.
 

JamesJones44

Reputable
Jan 22, 2021
620
560
5,760
The CPUs with on-package if not direct-stacked memory that I predicted about two years ago are one more step closer.

I wouldn't be surprised if DDR6 ends up being the last external memory standard we get before main memory moves on-package and external memory expansion when you need more than whatever on-package memory your CPU/GPU has goes PCIe/CXL.

An IGP chiplet/tile with 4-8GB of stacked HBM-like memory and almost direct access to the system memory controller should be interesting.

Yep, I figured once Apple pushed putting memory on die in the name of efficiency and it was successful, at some point the general PC community would follow for the same reasons.
 

Kamen Rider Blade

Distinguished
Dec 2, 2013
1,280
810
20,060
Yep, I figured once Apple pushed putting memory on die in the name of efficiency and it was successful, at some point the general PC community would follow for the same reasons.
It's the eventual push for tiering of more Memory Layers.

If you thought what we have now is crazy, there will be more layers of memory added in the future.

L4$/L5$/L6$/L7$ will all have their place.
 

Kamen Rider Blade

Distinguished
Dec 2, 2013
1,280
810
20,060
Mounting DRAM on top seems like a problem not easily solved with the extra thermal mass above the compute dies.

Mounting it next door seems more practical, Apple proved it to work.

Add in HBM and modular chiplet based Memory Controllers that you can mass produce cheaply, it becomes a more practical manufacturing problem.
AMD already seperated the Memory Controller into it's own die on the GPU side, it's only a matter of time before I see them doing it on the CPU side so that it becomes easy/cheap to mix and match memory types as needed to allow easy creation of a variety of product SKU's using common parts like legos.

I can easily envison AMD making seperate memory controllers for each RAM type.
1x for Regular DRAM DIMMs using OMI type interface
1x for HBM#
1x for Regular GDDR#
All using tiny dies and mabye adding in SRAM on top of the Memory Controller to help lower latency and improve bandwidth.
 
  • Like
Reactions: JamesJones44

TJ Hooker

Titan
Ambassador
Well, if you look at the slide, the starting point seems to be chiplets, since they show a picture of a de-lidded EPYC. So, I assume "2.5D Si INT, EFB" means more than that.
I believe they're using "2D" to refer to MCM packaging that has the chiplet interconnects going through the (organic) package substrate (what they used for their CPUs starting with Zen 2). Whereas they're using "2.5D" to refer to products like their HBM GPUs, where the chiplet interconnects go through a silicon interposer (which in turn sits on top of the substrate). You can see a picture illustrating what (I believe) the difference is in slide 11 here: https://nepp.nasa.gov/workshops/etw...ues/1500_Ramamurthy-Chiplet-Technology-v3.pdf

I don't know if the way AMD is using the terms exactly lines up to industry standard definitions of 2D/2.5D (the presentation I linked above seems to consider both interposer and substrate based interconnects to be examples of "2.xD packaging"). I also can't figure out what "EFB" stands for in AMD's slide.
 
Last edited:
  • Like
Reactions: bit_user

LawlessQuill

Prominent
Apr 22, 2021
9
2
515
The CPUs with on-package if not direct-stacked memory that I predicted about two years ago are one more step closer.

I wouldn't be surprised if DDR6 ends up being the last external memory standard we get before main memory moves on-package and external memory expansion when you need more than whatever on-package memory your CPU/GPU has goes PCIe/CXL.

An IGP chiplet/tile with 4-8GB of stacked HBM-like memory and almost direct access to the system memory controller should be interesting.
DDR7 already exists for vram, and is in development for ram
 

InvalidError

Titan
Moderator
DDR7 already exists for vram, and is in development for ram
GDDR7 is not DDR7. DDR6 is 4-5 years away and DDR7 would be another 5-7 years further beyond that, which will be well beyond the point where I expect most CPUs and GPUs to have on-package and likely 3D-stacked DRAM, which is exactly when I expect memory expansion to move to PCIe/CXL.

Once external memory goes PCIe/CXL, it won't matter what the underlying memory is, all you need is an appropriate memory controller bridge for whatever memory you want to use.
 
  • Like
Reactions: bit_user

bit_user

Polypheme
Ambassador
Mounting DRAM on top seems like a problem not easily solved with the extra thermal mass above the compute dies.

Mounting it next door seems more practical, Apple proved it to work.
Samsung and SK Hynix both have compute-in-memory solutions. I think at least Samsung's puts the compute in the bottom die of the stack.

AMD already seperated the Memory Controller into it's own die on the GPU side, it's only a matter of time before I see them doing it on the CPU side
Actually, their CPUs had it first. If you remember back in Ryzen 3000-series, the I/O die had the memory controller + I/O.

so that it becomes easy/cheap to mix and match memory types as needed to allow easy creation of a variety of product SKU's using common parts like legos.
Yeah, like maybe they could've used a different I/O die to effectively back-port Zen 4 to AM4, so that people could use it on cheaper motherboards and with DDR4.
 

bit_user

Polypheme
Ambassador
DDR7 already exists for vram, and is in development for ram
I think GDDR is its own thing and and the numbering has no direct correspondence with regular DDR memory standards.

I would like to see a plan for phasing out discrete cpus, and having a singular integrated chip system
Did you mean discrete GPUs getting phased out? Won't happen. The high-end GPUs will remain distinct from CPUs, for the foreseeable future. The main reason is GDDR memory, which has far higher bandwidth than is available to a CPU. It also turns out to be useful to upgrade your GPU without having to toss out your old CPU, as well.

At the low and even mid-range, we could see iGPUs with in-package memory eroding the market segment of dGPUs, but Apple's M1 Ultra shows that even packing like 8 channels of LPDDR5 in-package isn't enough to compete with high-end dGPUs. Don't believe me? Check its Geekbench scores.
 
Last edited:

InvalidError

Titan
Moderator
Did you mean discrete GPUs getting phased out? Won't happen. The high-end GPUs will remain distinct from CPUs, for the foreseeable future. The main reason is GDDR memory, which has far higher bandwidth than is available to a CPU.
And why is it that CPUs have lower memory bandwidth in the first place? The need for customizable memory size using DIMMs. Once CPUs have on-package memory as mainstream, the memory can be whatever the manufacturer wants it to be. Had Apple wanted to, it could have gone 2-4xHBM3E.
 

rluker5

Distinguished
Jun 23, 2014
605
367
19,260
I think Intel will try to keep this tech in the server market for as long as possible for profit reasons.
They could put it in consumer, but I think they will wait for AMD to catch up and do that first.

As far as optical I also have concerns for latency, longevity and price.
 

InvalidError

Titan
Moderator
I think Intel will try to keep this tech in the server market for as long as possible for profit reasons.
They could put it in consumer, but I think they will wait for AMD to catch up and do that first.
There are reasons why new chip-making tricks get used on high-value, high-margins stuff first. Bonding a bunch of chips together with 3D-stacking TSVs isn't cheap and incurs a significant amount of chip design overhead. It'll be a while for the whole process to get refined, more cost-efficient, more reliable and more readily accessible.
 
  • Like
Reactions: bit_user

bit_user

Polypheme
Ambassador
And why is it that CPUs have lower memory bandwidth in the first place?
The main reason is that they simply don't need more, at the consumer tier. The number of memory channels starts to get a little silly with the bigger server CPUs, but the main issue for servers is capacity.

Once CPUs have on-package memory as mainstream, the memory can be whatever the manufacturer wants it to be. Had Apple wanted to, it could have gone 2-4xHBM3E.
Cost. There's a reason consumer GPUs use GDDR memory and not HBM. It's also the main reason Nvidia used LPDDR5X in grace, rather than the HBM as we'd have expected.
"Power efficiency and memory bandwidth are both critical components of data center CPUs. The NVIDIA Grace CPU Superchip uses up to 960 GB of server-class low-power DDR5X (LPDDR5X) memory with ECC. This design strikes the optimal balance of bandwidth, energy efficiency, capacity, and cost for large-scale AI and HPC workloads.​
Compared to an eight-channel DDR5 design, the NVIDIA Grace CPU LPDDR5X memory subsystem provides up to 53% more bandwidth at one-eighth the power per gigabyte per second while being similar in cost. An HBM2e memory subsystem would have provided substantial memory bandwidth and good energy efficiency but at more than 3x the cost-per-gigabyte and only one-eighth the maximum capacity available with LPDDR5X.
The lower power consumption of LPDDR5X reduces the overall system power requirements and enables more resources to be put towards CPU cores. The compact form factor enables 2x the density of a typical DIMM-based design."​

The bandwidth they get to their directly-connected LPDDR5X is only about 546 GB/s (@ 32-channel -> 512-bit ?). So, the bandwidth tradeoff vs. HBM is real, and yet for reasons of capacity and cost they went with LPDDR5X.


Getting back to the premise of replacing dGPUs, I find it a little hard to swallow that we're going to put a 500+ W, $1.5k monster GPU + a 320 W, $800 gaming CPU + probably like $500 of HBM in a single package that can only be cooled with chilled water and you have to completely toss out, if any part of it breaks or you want to upgrade your memory, CPU, or GPU. That's why I think iGPUs will be limited to laptops and low-to-mid -range desktops. Or, exotic server chips like AMD's MI300.

Like the dinosaurs that ruled the earth for millions of years, dGPUs are very good at what they do. It will similarly take an industry-smashing asteroid to make them go extinct.
 
Last edited: