News Intel Demos Meteor Lake CPU with On-Package LPDDR5X

Status
Not open for further replies.

ezst036

Honorable
Oct 5, 2018
653
561
12,420
It looks like Intel has taken a page from Apple's book as the company has been installing LPDDR memory on its M1 and M2 packages for a while now.

The product demonstrated by Intel is its quad-tile Meteor Lake CPU, which uses Foveros packaging for its chiplets and carries 16GB of Samsung's LPDDR5X-7500 memory. Actual configuration of the CPU is unknown, but 16GB of memory can provide a peak bandwidth of 120 GB/s

Looks like this sits firmly inbetween M1 and M1 Pro.


68.25 GBps200 GBps

Just a little bit above M2, which is 100Gbps. And what level is this anyways, like a level 9 cache?
 

rluker5

Distinguished
Jun 23, 2014
721
439
19,260
I like the on package memory. Even if it only helps performance and efficiency a bit that is better than nothing.

But where is this interposer with the cache that was rumored. I had my hopes up.

Also Apple was first with the on package dram, but that was with Crystalwell, an Intel product back in 2014.
 
  • Like
Reactions: artk2219

JamesJones44

Reputable
Jan 22, 2021
754
687
5,760
The big question I have is does the iGPU have access to this memory in the same way Apple Silicon does? If it does there could be a nice boost for the iGPU in these processors. If not, it may not be as dramatic of a bump that Apple Silicon gets from it's Unified Memory structure.
 
  • Like
Reactions: artk2219

jkflipflop98

Distinguished
I like the on package memory. Even if it only helps performance and efficiency a bit that is better than nothing.

But where is this interposer with the cache that was rumored. I had my hopes up.

Also Apple was first with the on package dram, but that was with Crystalwell, an Intel product back in 2014.

Wut? Exactly what advantages do you suppose putting cache on the interposer will provide? How would you even use that? The interposer is a fancy smart breadboard and nothing more. It's just wires. There's no need to shove transistors in there.
 

kwohlt

Commendable
Oct 7, 2021
35
37
1,560
Wut? Exactly what advantages do you suppose putting cache on the interposer will provide? How would you even use that? The interposer is a fancy smart breadboard and nothing more. It's just wires. There's no need to shove transistors in there.
The advantage of 3D stacking is die space. The die doesn't have room for the rumored 128MB of L4 cache. The rumor was 3D stacked cache by placing the L4 in the active interposer, which would also allow better cooling as the compute tile would be in direct contact with the IHS, rather than being beneath the cache as X3D does.
 
Sep 6, 2023
1
4
10
This would be ideal for appliance-type devices for which people never upgrade the RAM anyway. Examples:
1. Chromebooks
2. tablets
3. ultrabooks
4. gaming consoles (Steam Deck competitors)

And the idea that the integrated RAM will make heat problems significantly worse is strange. First off, this isn't going to be used on workbooks with 32 or 64 GB RAM. It is going to be used for consumer and prosumer devices like the Dell XPS 13. Even if it did, the node shrink - going from 10nm for Raptor Lake to 7nm for Meteor Lake - would more than overcompensate for it anyway.
 

jkflipflop98

Distinguished
The advantage of 3D stacking is die space. The die doesn't have room for the rumored 128MB of L4 cache. The rumor was 3D stacked cache by placing the L4 in the active interposer, which would also allow better cooling as the compute tile would be in direct contact with the IHS, rather than being beneath the cache as X3D does.

Your OP didn't say anything about die stacking. You said "where is this interposer with the cache that was rumored." Foveros doesn't need cache itself. You add cache by stacking it on top. There's no need for it to be IN the breadboard.
 
  • Like
Reactions: bit_user

abufrejoval

Reputable
Jun 19, 2020
480
323
5,060
I like the on package memory. Even if it only helps performance and efficiency a bit that is better than nothing.

But where is this interposer with the cache that was rumored. I had my hopes up.

Also Apple was first with the on package dram, but that was with Crystalwell, an Intel product back in 2014.
I guess you refer to the 64/128MB of EDRAM for GPU use and this is what I had to immediately think of, too. I have several systems with EDRAM and bigger iGPUs with 48EUs, but they yield very poor performance increases and feel more like a 24->32EU upgrade than the doubling they should have delivered. And then Xe with 96EUs does offer a linear 4x upgrade over the prevous 24EU iGPUs without any of the EDRAM overhead!

In terms of Apple pushing and Intel implementing I'd say you'd call that "co-creation" and outsiders might never know who led with the idea.

The current Apple design eliminates all external DRAM interfaces from the die carrier and that is obviously a big save in terms of power and cost, but means a complete loss of RAM expansion.

So that for me is the biggest question and so far without any indication which way Intel would want to go here: do they cut external RAM or do they allow external expansion?

And how would that be implemented and work?

I own an Alder lake laptop that has 8GB of soldered DRAM and a SO-DIMM slot that was originally filled with a matching 8GB DDR4 SO-DIMM for a fully matched dual-channel setup. But since I often need to operate with VMs and RAM is so extraordinarily cheap these days (unless you buy from Apple), I replaced the 8GB stick with a 32GB variant, fully aware that this would reduce the upper 24GB or RAM to single channel operations and bandwidth.

My rationale is that it's still way faster then paging and that the most bandiwidth critical part, the iGPU frame buffer would be taken from the bottom physical RAM which would remain dual-channel.

But how would this now work out with a chip like this? Dual channel operations seem pretty much out of reach, when latencies and general characteristics are as far out of step as they would be here. Running two distinct single-channel memory pools doesn't seem very attractive either, which seems to provide a strong hint, that this type of chip would cut memory expansion entirely like Apple's Mx designs.

Except that those have this trick up their sleves, where they use multiples of the base chip including each with its DRAM package to combine into x2 and x4 designs, that then are effectively doubling or quadrupling the memory channels and bandwidths accordingly.

There is no obvious way this could happen with the Intel variant, judging from the provided picture. Yet without such a scaling escape hatch, their approach no longer looks terribly attractive.

I do think that having both on-die (or on-substrate) extra DRAM together with more traditional external DRAM is attractive, but only if that on-die DRAM is wide and fast enough to carry all iGPU burdens with ease.

But that would likely require a completely distinct memory controller with another set of channels and therefore a chip re-design that's far from minor... and therefore hard to imagine, even if Intel has far less qualms about designing fully distinct die variants than AMD.

In other words: this article unforunately creates more questions than I feel it answers.
 
Sep 5, 2023
10
3
15
I would hope with on package memory you would be able to do better than 7500MT/s / 120Gbps bandwidth, but I guess not
Latency and power usage will be better. But at a very real point the DRAM chips themselves can only go so fast and their bandwidth is rather fixed. You'd have to move to some kind of exotic memory to get low latency and high bandwidth. And HBM as primary memory for consumer products is likely going to be a performance regression even with the increased bandwidth because of the increased latency on small accesses common to most consumer workloads.
 
  • Like
Reactions: KyaraM

bit_user

Titan
Ambassador
Also Apple was first with the on package dram, but that was with Crystalwell, an Intel product back in 2014.
No - very different things. Intel used a tiny (~128 MB) chunk of on-die eDRAM for a graphics cache. It was essentially like L4 cache, IIUC.

What Apple and this latest example are doing is putting the main memory DRAM on package (not on die), next to the CPU. It's not a cache, either. It's actually the main memory of the CPU.

BTW, Intel wasn't exactly the first to do the eDRAM thing, either. Back in 2013, the XBox One launched with a 32 MB chunk of eSRAM that it used for graphics. Before that, the XBox 360 also featured an eDRAM framebuffer, in its dGPU. It wasn't until 2017 that Microsoft finally threw in the towel on this approach and went with a simple, wide GDDR memory setup.
 

bit_user

Titan
Ambassador
The big question I have is does the iGPU have access to this memory in the same way Apple Silicon does? If it does there could be a nice boost for the iGPU in these processors. If not, it may not be as dramatic of a bump that Apple Silicon gets from it's Unified Memory structure.
Yes, of course. This is simply the main memory of the CPU, migrated on package instead of in DIMM slots.

Meteor Lake will stand to benefit well from additional memory bandwidth, as it's rumored to feature a much larger tGPU.
 

bit_user

Titan
Ambassador
The advantage of 3D stacking is die space. The die doesn't have room for the rumored 128MB of L4 cache. The rumor was 3D stacked cache by placing the L4 in the active interposer, which would also allow better cooling as the compute tile would be in direct contact with the IHS, rather than being beneath the cache as X3D does.
I think the reason not to put cache in the interposer is that it's typically made on a much coarser process node. Being physically larger than any of the compute tiles, if you now stuffed a bunch of SRAM in there, it would be much more expensive than a typical interposer. I'm pretty sure it would be more cost-effective to put L4 in its own tile than the interposer.

That's just my uninformed speculation. Listen to @jkflipflop98 - he actually knows stuff.
 
Last edited:

bit_user

Titan
Ambassador
I would hope with on package memory you would be able to do better than 7500MT/s / 120Gbps bandwidth, but I guess not
That will come, in time. JEDEC already specifies at least LPDDR5X-8533.

Samsung and SK Hynix have both announced products implementing it.

Latency and power usage will be better.
No, not latency. This is a common misconception propagated by people who can't do math.

Simply look at the speed of signal propagation in copper and you'll see that it makes no difference whether the DRAM is on package or in a DIMM slot next to the CPU. DRAM latency is not due to the fact that it's external. At least, if we're not talking about big server boards that use RDIMMs.

The real reason it's done is to enable higher frequencies and provide power savings. Maybe also cost savings, if you're Apple.

But at a very real point the DRAM chips themselves can only go so fast and their bandwidth is rather fixed.
It's interface-limited, not DRAM-limited. HBM clearly shows that. On-package memory enables much wider, HBM-like interfaces. Thus, it's somewhat a foregone conclusion that on-package memory is how bandwidth scaling will continue. Certainly, if we're at all concerned about power-efficiency.

HBM as primary memory for consumer products is likely going to be a performance regression even with the increased bandwidth because of the increased latency on small accesses common to most consumer workloads.
For memory-light workloads, you should get good cache hit rates. For memory-heavy workloads, your worst-case bandwidth is much more performance-determinitive, and that's where increasing bandwidth will have a much greater impact than reducing intrinsic latency.

We already have a real-world example, in the Sapphire Rapids Xeon Max. Show me a single example, where HBM-only performs worse than DDR5-only:
 
Last edited:

rluker5

Distinguished
Jun 23, 2014
721
439
19,260
No - very different things. Intel used a tiny (~128 MB) chunk of on-die eDRAM for a graphics cache. It was essentially like L4 cache, IIUC.

What Apple and this latest example are doing is putting the main memory DRAM on package (not on die), next to the CPU. It's not a cache, either. It's actually the main memory of the CPU.

BTW, Intel wasn't exactly the first to do the eDRAM thing, either. Back in 2013, the XBox One launched with a 32 MB chunk of eSRAM that it used for graphics. Before that, the XBox 360 also featured an eDRAM framebuffer, in its dGPU. It wasn't until 2017 that Microsoft finally threw in the towel on this approach and went with a simple, wide GDDR memory setup.
I was responding to where the article stated that Intel was copying Apple by having on package dram. Edram is on package dram by using a separate but distinct dram chip on the same package as the apu. Yes it is less, it was 9 years ago. They couldn't pack the transistors as densely back then.
Also the esram wasn't a separate dram chip on the same package as the apu. And the x360 GPU isn't an apu.
 
  • Like
Reactions: TJ Hooker

bit_user

Titan
Ambassador
I was responding to where the article stated that Intel was copying Apple by having on package dram. Edram is on package dram by using a separate but distinct dram chip on the same package as the apu. Yes it is less, it was 9 years ago. They couldn't pack the transistors as densely back then.
Okay, I missed that it was a separate chip, but that doesn't change the fact that it was merely a L4 cache and not main memory. Just because DRAM density might not have existed to enable full, on-package main memory doesn't give Intel credit for doing it back then. They did not.

I think what enabled Lakefield and the Apple M-series to pack enough DRAM density is chip-stacked DRAM, though I don't know when that technique really got started. It could be they also needed the power-saving techniques in LPDDR4X, in order for it to back enough density.

Also the esram wasn't a separate dram chip on the same package as the apu. And the x360 GPU isn't an apu.
I know that, but I was pointing out how Microsoft had been doing things similar to Crystalwell for years.

In fact, I had a suspicion that Crystalwell started as a bid by Intel to win the XBox One design. Intel was rumored to be interested in the console market (remember, they had their fingers in a lot more markets, back then). Given Microsoft's proclivity to fast, on-die graphics memory, this is precisely the kind of bid that would've attracted their interest.
 
  • Like
Reactions: rluker5

rluker5

Distinguished
Jun 23, 2014
721
439
19,260
Okay, I missed that it was a separate chip, but that doesn't change the fact that it was merely a L4 cache and not main memory. Just because DRAM density might not have existed to enable full, on-package main memory doesn't give Intel credit for doing it back then. They did not.

I think what enabled Lakefield and the Apple M-series to pack enough DRAM density is chip-stacked DRAM, though I don't know when that technique really got started. It could be they also needed the power-saving techniques in LPDDR4X, in order for it to back enough density.


I know that, but I was pointing out how Microsoft had been doing things similar to Crystalwell for years.

In fact, I had a suspicion that Crystalwell started as a bid by Intel to win the XBox One design. Intel was rumored to be interested in the console market (remember, they had their fingers in a lot more markets, back then). Given Microsoft's proclivity to fast, on-die graphics memory, this is precisely the kind of bid that would've attracted their interest.
Not the first time I've been misunderstood because I made a few statements that were related in some way, but not by the main theme I was commenting on. And it won't be the last. Edram is not system memory, just is in the category of dram, so not equivalent to Apple or this example of MTL in terms of use, just in terms of packaging.

I felt compelled to make the correction quickly because I run an old Crystalwell in my garage music streamer. And it isn't nearly as fast as an Xbox one in graphics. Except maybe with milk drop.
 

JayNor

Honorable
May 31, 2019
438
93
10,760
The PVC GPU has 288MB of SRAM total on the two base tiles, so it certainly is possible.

The Kaby Lake G consumer chips had a 4GB HBM for the embedded GPU.

Intel supported lpddr4x for the DG1 GPU. Would it be a surprise if this lpddr5x is solely for the tGPU? It wouldn't be surprising to me, since the presentations show the GPU not connected to CPU L3 on the ring bus.
 

bit_user

Titan
Ambassador
The PVC GPU has 288MB of SRAM total on the two base tiles, so it certainly is possible.
Is it cache or directly-addressable? Is it connected through an interposer or directly by TSVs to the compute dies?

The Kaby Lake G consumer chips had a 4GB HBM for the embedded GPU.
Yes, although that was private memory. It was practically a mini-Vega (although, we later learned that it was derived from Polaris) that happened to share a package with the CPU. The main benefit it got from that arrangement was coordinated power management.

Would it be a surprise if this lpddr5x is solely for the tGPU?
Yes, that would be very surprising!

It wouldn't be surprising to me, since the presentations show the GPU not connected to CPU L3 on the ring bus.
There's the whole L4 cache rumor, though. If the SoC has a L4 cache, then it wouldn't be at all surprising for the tGPU not to share the CPU's L3.
 

abufrejoval

Reputable
Jun 19, 2020
480
323
5,060
Latency and power usage will be better. But at a very real point the DRAM chips themselves can only go so fast and their bandwidth is rather fixed. You'd have to move to some kind of exotic memory to get low latency and high bandwidth. And HBM as primary memory for consumer products is likely going to be a performance regression even with the increased bandwidth because of the increased latency on small accesses common to most consumer workloads.
HBM as a consumer option is further away than ever, as long as there is no way to dramatically reduce the packaging cost, which puts HBM at 10:1 or worse over DRAM.

Access size for normal DRAM is nearly always at the granularity of cache lines, 64 bytes or 512 bits mostly. And for GPU workloads it's likely to be wider, so I can't see HBM at 1024 or even 2048 bits becoming a performance regression any time soon, just because it's "too wide" as you seem to imply.

But because of the price gap, it's very hard to prove that point.

I've heard a similar argument against GDDR as a DRAM replacement, but I doubt that's ever been proven, either, because at 3-4:1 over DRAM or worse in terms of cost, there is no hardware out there to test this practically.

And then you'd probably have to resort to very special workloads to obtain noticeable differences: megabytes of on-chip caches were invested to make that as hard as possible.

You give me GDDR or HBM at DRAM prices, and I'd say I'm ready to 'suffer' the consequences in terms of performance!

But then I run 3D variants of Ryzens, too.
 

bit_user

Titan
Ambassador
Access size for normal DRAM is nearly always at the granularity of cache lines, 64 bytes or 512 bits mostly. And for GPU workloads it's likely to be wider, so I can't see HBM at 1024 or even 2048 bits becoming a performance regression any time soon, just because it's "too wide" as you seem to imply.
First, desktop GPUs are now mostly using 32-way SIMD, while I think server GPUs use 64-way (AMD, at least - Nvidia might be using 128-way). Multiplying by 32 bits per lane, it translates to 1024b or 2048b (or 4096b). Thus, a SIMD load of interleaved data would mostly work in such transaction sizes. However, graphics involves a fair amount of scatter/gather type of accesses, in which case having to use such large chunks would be rather sub-optimal.

Next, modern DRAM access works in bursts. In DDR5, you must read or write typically 8 or 16 cycles worth of data per channel. In DDR4, it was 4 or 8, but DDR4 was twice the width (64b; DDR5 cut it to 32b). So, if the interface were single-channel, then you might be stuck working in 4096 or 8192 bit chunks (up through HBM3).

However, according to this:

"As per JEDEC, in HBM3, each DRAM stack can support up to 16 channels compared to 8 channels in HBM."

1024b / 16 = 64b - the same as a single DDR4 channel!

Furthermore:

"HBM3 has a pseudo channel mode architecture which was introduced in HBM2 standards, which divides a channel into two individual sub-channels of 32-bit I/O each as compared to 64-bit I/O in HBM2. On each segment, a read or write transaction transfers 256 bits in a burst that consists of 8 cycles of 32 bits each."

Source: https://www.lumenci.com/post/high-bandwidth-memory

Presumably, the burst length for 64-bit mode is 512 bits. I guess the use case for 256-bit bursts is probably scatter/gather accesses, like I mentioned above.

And then you'd probably have to resort to very special workloads to obtain noticeable differences: megabytes of on-chip caches were invested to make that as hard as possible.
Any cache can be thrashed. If you know how set-associative caches work and what's the associativity, a first-year student could write a simple program that causes cache-thrashing.

You give me GDDR or HBM at DRAM prices, and I'd say I'm ready to 'suffer' the consequences in terms of performance!
Then, I suppose you probably didn't see this:

 
  • Like
Reactions: abufrejoval
Status
Not open for further replies.