Intel takes a page from Apple's book, shows off Meteor Lake with 16 GB of on-package LPDDR5X-7500 memory.
Intel Demos Meteor Lake CPU with On-Package LPDDR5X : Read more
Intel Demos Meteor Lake CPU with On-Package LPDDR5X : Read more
It looks like Intel has taken a page from Apple's book as the company has been installing LPDDR memory on its M1 and M2 packages for a while now.
The product demonstrated by Intel is its quad-tile Meteor Lake CPU, which uses Foveros packaging for its chiplets and carries 16GB of Samsung's LPDDR5X-7500 memory. Actual configuration of the CPU is unknown, but 16GB of memory can provide a peak bandwidth of 120 GB/s
68.25 GBps | 200 GBps |
I like the on package memory. Even if it only helps performance and efficiency a bit that is better than nothing.
But where is this interposer with the cache that was rumored. I had my hopes up.
Also Apple was first with the on package dram, but that was with Crystalwell, an Intel product back in 2014.
The advantage of 3D stacking is die space. The die doesn't have room for the rumored 128MB of L4 cache. The rumor was 3D stacked cache by placing the L4 in the active interposer, which would also allow better cooling as the compute tile would be in direct contact with the IHS, rather than being beneath the cache as X3D does.Wut? Exactly what advantages do you suppose putting cache on the interposer will provide? How would you even use that? The interposer is a fancy smart breadboard and nothing more. It's just wires. There's no need to shove transistors in there.
The advantage of 3D stacking is die space. The die doesn't have room for the rumored 128MB of L4 cache. The rumor was 3D stacked cache by placing the L4 in the active interposer, which would also allow better cooling as the compute tile would be in direct contact with the IHS, rather than being beneath the cache as X3D does.
I guess you refer to the 64/128MB of EDRAM for GPU use and this is what I had to immediately think of, too. I have several systems with EDRAM and bigger iGPUs with 48EUs, but they yield very poor performance increases and feel more like a 24->32EU upgrade than the doubling they should have delivered. And then Xe with 96EUs does offer a linear 4x upgrade over the prevous 24EU iGPUs without any of the EDRAM overhead!I like the on package memory. Even if it only helps performance and efficiency a bit that is better than nothing.
But where is this interposer with the cache that was rumored. I had my hopes up.
Also Apple was first with the on package dram, but that was with Crystalwell, an Intel product back in 2014.
Latency and power usage will be better. But at a very real point the DRAM chips themselves can only go so fast and their bandwidth is rather fixed. You'd have to move to some kind of exotic memory to get low latency and high bandwidth. And HBM as primary memory for consumer products is likely going to be a performance regression even with the increased bandwidth because of the increased latency on small accesses common to most consumer workloads.I would hope with on package memory you would be able to do better than 7500MT/s / 120Gbps bandwidth, but I guess not
No - very different things. Intel used a tiny (~128 MB) chunk of on-die eDRAM for a graphics cache. It was essentially like L4 cache, IIUC.Also Apple was first with the on package dram, but that was with Crystalwell, an Intel product back in 2014.
Yes, of course. This is simply the main memory of the CPU, migrated on package instead of in DIMM slots.The big question I have is does the iGPU have access to this memory in the same way Apple Silicon does? If it does there could be a nice boost for the iGPU in these processors. If not, it may not be as dramatic of a bump that Apple Silicon gets from it's Unified Memory structure.
I think the reason not to put cache in the interposer is that it's typically made on a much coarser process node. Being physically larger than any of the compute tiles, if you now stuffed a bunch of SRAM in there, it would be much more expensive than a typical interposer. I'm pretty sure it would be more cost-effective to put L4 in its own tile than the interposer.The advantage of 3D stacking is die space. The die doesn't have room for the rumored 128MB of L4 cache. The rumor was 3D stacked cache by placing the L4 in the active interposer, which would also allow better cooling as the compute tile would be in direct contact with the IHS, rather than being beneath the cache as X3D does.
That will come, in time. JEDEC already specifies at least LPDDR5X-8533.I would hope with on package memory you would be able to do better than 7500MT/s / 120Gbps bandwidth, but I guess not
No, not latency. This is a common misconception propagated by people who can't do math.Latency and power usage will be better.
It's interface-limited, not DRAM-limited. HBM clearly shows that. On-package memory enables much wider, HBM-like interfaces. Thus, it's somewhat a foregone conclusion that on-package memory is how bandwidth scaling will continue. Certainly, if we're at all concerned about power-efficiency.But at a very real point the DRAM chips themselves can only go so fast and their bandwidth is rather fixed.
For memory-light workloads, you should get good cache hit rates. For memory-heavy workloads, your worst-case bandwidth is much more performance-determinitive, and that's where increasing bandwidth will have a much greater impact than reducing intrinsic latency.HBM as primary memory for consumer products is likely going to be a performance regression even with the increased bandwidth because of the increased latency on small accesses common to most consumer workloads.
I was responding to where the article stated that Intel was copying Apple by having on package dram. Edram is on package dram by using a separate but distinct dram chip on the same package as the apu. Yes it is less, it was 9 years ago. They couldn't pack the transistors as densely back then.No - very different things. Intel used a tiny (~128 MB) chunk of on-die eDRAM for a graphics cache. It was essentially like L4 cache, IIUC.
What Apple and this latest example are doing is putting the main memory DRAM on package (not on die), next to the CPU. It's not a cache, either. It's actually the main memory of the CPU.
BTW, Intel wasn't exactly the first to do the eDRAM thing, either. Back in 2013, the XBox One launched with a 32 MB chunk of eSRAM that it used for graphics. Before that, the XBox 360 also featured an eDRAM framebuffer, in its dGPU. It wasn't until 2017 that Microsoft finally threw in the towel on this approach and went with a simple, wide GDDR memory setup.
Okay, I missed that it was a separate chip, but that doesn't change the fact that it was merely a L4 cache and not main memory. Just because DRAM density might not have existed to enable full, on-package main memory doesn't give Intel credit for doing it back then. They did not.I was responding to where the article stated that Intel was copying Apple by having on package dram. Edram is on package dram by using a separate but distinct dram chip on the same package as the apu. Yes it is less, it was 9 years ago. They couldn't pack the transistors as densely back then.
I know that, but I was pointing out how Microsoft had been doing things similar to Crystalwell for years.Also the esram wasn't a separate dram chip on the same package as the apu. And the x360 GPU isn't an apu.
Not the first time I've been misunderstood because I made a few statements that were related in some way, but not by the main theme I was commenting on. And it won't be the last. Edram is not system memory, just is in the category of dram, so not equivalent to Apple or this example of MTL in terms of use, just in terms of packaging.Okay, I missed that it was a separate chip, but that doesn't change the fact that it was merely a L4 cache and not main memory. Just because DRAM density might not have existed to enable full, on-package main memory doesn't give Intel credit for doing it back then. They did not.
I think what enabled Lakefield and the Apple M-series to pack enough DRAM density is chip-stacked DRAM, though I don't know when that technique really got started. It could be they also needed the power-saving techniques in LPDDR4X, in order for it to back enough density.
I know that, but I was pointing out how Microsoft had been doing things similar to Crystalwell for years.
In fact, I had a suspicion that Crystalwell started as a bid by Intel to win the XBox One design. Intel was rumored to be interested in the console market (remember, they had their fingers in a lot more markets, back then). Given Microsoft's proclivity to fast, on-die graphics memory, this is precisely the kind of bid that would've attracted their interest.
Is it cache or directly-addressable? Is it connected through an interposer or directly by TSVs to the compute dies?The PVC GPU has 288MB of SRAM total on the two base tiles, so it certainly is possible.
Yes, although that was private memory. It was practically a mini-Vega (although, we later learned that it was derived from Polaris) that happened to share a package with the CPU. The main benefit it got from that arrangement was coordinated power management.The Kaby Lake G consumer chips had a 4GB HBM for the embedded GPU.
Yes, that would be very surprising!Would it be a surprise if this lpddr5x is solely for the tGPU?
There's the whole L4 cache rumor, though. If the SoC has a L4 cache, then it wouldn't be at all surprising for the tGPU not to share the CPU's L3.It wouldn't be surprising to me, since the presentations show the GPU not connected to CPU L3 on the ring bus.
HBM as a consumer option is further away than ever, as long as there is no way to dramatically reduce the packaging cost, which puts HBM at 10:1 or worse over DRAM.Latency and power usage will be better. But at a very real point the DRAM chips themselves can only go so fast and their bandwidth is rather fixed. You'd have to move to some kind of exotic memory to get low latency and high bandwidth. And HBM as primary memory for consumer products is likely going to be a performance regression even with the increased bandwidth because of the increased latency on small accesses common to most consumer workloads.
First, desktop GPUs are now mostly using 32-way SIMD, while I think server GPUs use 64-way (AMD, at least - Nvidia might be using 128-way). Multiplying by 32 bits per lane, it translates to 1024b or 2048b (or 4096b). Thus, a SIMD load of interleaved data would mostly work in such transaction sizes. However, graphics involves a fair amount of scatter/gather type of accesses, in which case having to use such large chunks would be rather sub-optimal.Access size for normal DRAM is nearly always at the granularity of cache lines, 64 bytes or 512 bits mostly. And for GPU workloads it's likely to be wider, so I can't see HBM at 1024 or even 2048 bits becoming a performance regression any time soon, just because it's "too wide" as you seem to imply.
Any cache can be thrashed. If you know how set-associative caches work and what's the associativity, a first-year student could write a simple program that causes cache-thrashing.And then you'd probably have to resort to very special workloads to obtain noticeable differences: megabytes of on-chip caches were invested to make that as hard as possible.
Then, I suppose you probably didn't see this:You give me GDDR or HBM at DRAM prices, and I'd say I'm ready to 'suffer' the consequences in terms of performance!