News Intel's Patent Details Meteor Lake's 'Adamantine' L4 Cache

The new cache seems to be targeted primarily at the Arc Xe-LPG GPU which is featured on the graphics tile, one of the many tiles on Meteor Lake GPUs.

To be utilized by the Meteor Lake-M and Meteor Lake-P chips which target the mobility platform.
Intel Arc Xe-LPG GPUs will be the main beneficiary of the Adamantine L4 cache IMO.

Actually in some of the recent patches, as discovered by Phoronix (can't find the proper link now), it was revealed that, unlike the previous designs, Intel Meteor Lake GPU cannot utilize the LLC on the chip which was previously shared by both the CPU and GPU.

As such, the Adamantine L4 cache will play a huge role to assist the performance of Meteor Lake chips in graphics workloads.

But of course, Adamantine L4 cache can also be used by the Compute Title (CPU Cores), which are made up of Redwood Cove (P-Core) & Crestmont (E-Core) hybrid configurations, leading to faster boot times and overall lower latencies compared to moving data to the primary DRAM.
 
  • Like
Reactions: KyaraM

Geef

Distinguished
A lot of people here might not have been around back in the day when boot times were calculated in minutes not seconds but its already massively fast now even with just any SSD. Not really a need for faster booting.

As long as that is not all it helps then a L4 cache can be good
 

InvalidError

Titan
Moderator
A lot of people here might not have been around back in the day when boot times were calculated in minutes not seconds but its already massively fast now even with just any SSD. Not really a need for faster booting.
Depends on the application. With the IoT thing still in close to full swing, there will be a growing number of toasters, ceiling fans, lights and other stuff running full-blown Linux, BSD and other OS derivatives for which you may not want to wait for a 5s boot time every time you turn them on before you can set them to do whatever it is you want them to do.

I never cared about my PC's boot time since I usually only reboot my PC once every 2-3 months.
 
  • Like
Reactions: Geef

Kamen Rider Blade

Distinguished
Dec 2, 2013
1,280
812
20,060
The days of SRAM based L4 Cache are here.

With Intel bringing their implementation on the way, you can bet that AMD will have their implementation.

It's going to be a great day when everybody starts using L4$ SRAM to benefit their CPU's =D
 

InvalidError

Titan
Moderator
The days of SRAM based L4 Cache are here.

With Intel bringing their implementation on the way, you can bet that AMD will have their implementation.

It's going to be a great day when everybody starts using L4$ SRAM to benefit their CPU's =D
Almost every mention of L4 in the patent are specific to boot-time initialization, management engine, secure firmware, secure engines and related stuff while a couple more claims focus on IGP/UMA graphics. Doesn't look like it is intended for CPU usage while running software. At least not in its first iteration. They even mention a flag to disable or "lock down" the "L4" before BIOS passes control over to the OS at claim #93.
 
  • Like
Reactions: rluker5

rluker5

Distinguished
Jun 23, 2014
692
416
19,260
I hate to cast doubt on something that looks good, but there does seem to be a possible use of some hybrid aggressive sleep function to save power on idle and low use. Maybe even shutting down extensive parts of the system if they could be woken quickly enough. That and the security stuff to make a cold boot like waking from sleep. Could lead to some nice, yet boring benefits in the power savings area.

On low power mobile the benefit the iGPU would get is just not worth all of this silicon. (for example Xe > Broadwell iGPU. DDR5 has enough bandwidth for mobile stuff, you can see that with AMD's bigger iGPUs) Also without a dGPU the extra cache doesn't help that much.

For a low powered mobile part the extra cache for extra performance in things the chip won't be doing doesn't make much sense. Power savings does.

Some Arrow Lake or Lunar Lake desktop chip with faster P-cores and a 256MB L4 does sound pretty appealing though and if they have it in MTL, even if it isn't used the way I want, the manufacturing experience from making and using it will help reach the goal of using a large L4 properly.
 

InvalidError

Titan
Moderator
On low power mobile the benefit the iGPU would get is just not worth all of this silicon. (for example Xe > Broadwell iGPU. DDR5 has enough bandwidth for mobile stuff, you can see that with AMD's bigger iGPUs) Also without a dGPU the extra cache doesn't help that much.
The Crystal Lake eDRAM on Broadwell enabled massively improved integrated graphics performance (3-3.5X the performance for 2.4X the IGP size) and if Intel scales Meteor Lake's IGP is 128 EUs just like the A380, it'll certainly benefit from having access to a large scratchpad to offset having half as much memory bandwidth to share with the CPU.
 
  • Like
Reactions: bit_user

bit_user

Polypheme
Ambassador
OMG, that diagram is weird.
  • The SoC die has 2 additional Crestmont cores, even though it's meant to be made on a larger process node!
  • The L4 cache die is off to the side of the SOC die, not directly connected to either the CPU or GPU tiles. I wonder if they at least put the tag RAM in the SoC tile.

This sentence strikes me as very odd:

"Value is added for high end silicon with higher pre-initialized memory at reset, potentially leading to increased revenue."

Do they literally mean "revenue" in the sense of better sales? Um, how exactly does "higher pre-initialized memory at reset" translate into that, and what does it have to do with the L4 cache?

Is this something to do with the boot phase before the DRAM controller is initialized? So, they want to rely on the L4 cache tile to hold the entire pre-boot firmware image, and that lets the offer better hardware security?
 

bit_user

Polypheme
Ambassador
Actually in some of the recent patches, as discovered by Phoronix (can't find the proper link now), it was revealed that, unlike the previous designs, Intel Meteor Lake GPU cannot utilize the LLC on the chip which was previously shared by both the CPU and GPU.

I assume they're talking about the LLC of the CPU die.

With Intel bringing their implementation on the way, you can bet that AMD will have their implementation.

It's going to be a great day when everybody starts using L4$ SRAM to benefit their CPU's =D
I think there's a good chance that AMD will keep the GPU on its own Infinity Cache slice.

I hate to cast doubt on something that looks good, but there does seem to be a possible use of some hybrid aggressive sleep function to save power on idle and low use. Maybe even shutting down extensive parts of the system if they could be woken quickly enough. That and the security stuff to make a cold boot like waking from sleep. Could lead to some nice, yet boring benefits in the power savings area.
That's an interesting point - they could power down the CPU tile and just run the Crestmont cores on the SoC tile. Likewise, they could power down the L4 tile - I know some phone SoCs power down parts of their cache hierarchy to save energy.

On low power mobile the benefit the iGPU would get is just not worth all of this silicon. (for example Xe > Broadwell iGPU. DDR5 has enough bandwidth for mobile stuff, you can see that with AMD's bigger iGPUs) Also without a dGPU the extra cache doesn't help that much.
Disagree. A big part of Apple's power savings has been through more aggressive use of cache. When the system is running under high load, it's much more efficient to go to cache than DRAM.

Also, I disagree on the bandwidth aspect, as iGPUs are famously bandwidth-limited. That's why AMD went from 11 CUs to 8, in more recent SoCs with Vega. If Intel is scaling up its iGPU past 96 EUs, then they'll need the additional bandwidth.
 
Last edited:
  • Like
Reactions: rluker5

bit_user

Polypheme
Ambassador
With the IoT thing still in close to full swing, there will be a growing number of toasters, ceiling fans, lights and other stuff running full-blown Linux, BSD and other OS derivatives for which you may not want to wait for a 5s boot time every time you turn them on before you can set them to do whatever it is you want them to do.
It's funny to me just how many devices run a Linux kernel. Such overkill, but often the easiest path when you have the power budget and a capable core.

Notably, the Raspberry Pi Pico does not run Linux. Nor does Arduino, of course. Microcontrollers are really past the limit of where Linux will fit.
 

InvalidError

Titan
Moderator
This sentence strikes me as very odd:
"Value is added for high end silicon with higher pre-initialized memory at reset, potentially leading to increased revenue."​
I'm not seeing how more "pre-boot memory" is supposed to translate into increased profit either. I can imagine it making it easier for motherboard manufacturers to throw together fancy UIs without worrying about DRAM controller initialization, maybe running UEFI apps without requiring external DRAM which could be significant for things like rear-view cameras and similar embedded applications and is one of the things mentioned in there.

It's funny to me just how many devices run a Linux kernel. Such overkill, but often the easiest path when you have the power budget and a capable core.
Once your toaster requires support for cameras so you can watch your toasts, WiFi and BT networking so you can remotely monitor from your phone, PC, tablet or whatever else , USB storage to record your toasts, audio and video output for videoconference over toaster or just putting your toaster cams on external displays, may as well get a full OS :)
 
  • Like
Reactions: bit_user

rluker5

Distinguished
Jun 23, 2014
692
416
19,260
The Crystal Lake eDRAM on Broadwell enabled massively improved integrated graphics performance (3-3.5X the performance for 2.4X the IGP size) and if Intel scales Meteor Lake's IGP is 128 EUs just like the A380, it'll certainly benefit from having access to a large scratchpad to offset having half as much memory bandwidth to share with the CPU.
I think you might be romanticizing. Broadwell's IGPU is better than Crystalwell's but it wasn't proportionally (per tflop) that much faster than Haswell's even though it had a significantly improved arch. Here's how it holds up vs other arches: https://www.anandtech.com/show/1665...ew-is-rocket-lake-core-11th-gen-competitive/2
The L4 will help, but it isn't going to change a low powered iGPU into some powerhouse and dethrone the small cache 780m. I'm just saying it isn't really worth it for this case. L4 helps more with repetitve CPU tasks on desktop CPUs with dGPUs. And apparently other things not typically tested in CPU reviews.
 

bit_user

Polypheme
Ambassador
The L4 will help, but it isn't going to change a low powered iGPU into some powerhouse and dethrone the small cache 780m.
Apple has shown that an iGPU can be very competitive against all but the biggest mobile dGPUs, if supported with enough memory bandwidth. And, in their case, that bandwidth reaches up to 400 GB/s, which is a lot more than you can get with external DRAM.
 

InvalidError

Titan
Moderator
I think you might be romanticizing. Broadwell's IGPU is better than Crystalwell's but it wasn't proportionally (per tflop) that much faster than Haswell's even though it had a significantly improved arch. Here's how it holds up vs other arches: https://www.anandtech.com/show/1665...ew-is-rocket-lake-core-11th-gen-competitive/2
Benchmarking an almost 10 years old IGP in modern-day titles it likely has no support for on a chart that doesn't show any Haswell or Skylake numbers to compare to isn't particularly useful. Here are some more contemporary benchmarks that show how much of a leap forward Broadwell was from Haswell:

On the productivity side, Broadwell's IGP beats Haswell's by as much as 6X in Cinebench OpenGL and Maya.

On workloads that played well with the eDRAM, Broadwell's IGP was a screecher.
 
  • Like
Reactions: bit_user
Almost every mention of L4 in the patent are specific to boot-time initialization, management engine, secure firmware, secure engines and related stuff while a couple more claims focus on IGP/UMA graphics. Doesn't look like it is intended for CPU usage while running software. At least not in its first iteration. They even mention a flag to disable or "lock down" the "L4" before BIOS passes control over to the OS at claim #93.
Isn't the reason for a patent to patent the NEW stuff a thing can do?!
If L4 that is used by the core is already a thing that is used everywhere does it need to be in the patent?
I'm not seeing how more "pre-boot memory" is supposed to translate into increased profit either.
More expensive CPU = more money for intel ??? ¯\_(ツ)_/¯
 

abufrejoval

Reputable
Jun 19, 2020
441
301
5,060
The Crystal Lake eDRAM on Broadwell enabled massively improved integrated graphics performance (3-3.5X the performance for 2.4X the IGP size) and if Intel scales Meteor Lake's IGP is 128 EUs just like the A380, it'll certainly benefit from having access to a large scratchpad to offset having half as much memory bandwidth to share with the CPU.
Actually the eDRAM failed to produce significant gains as far as I could tell. I started with a Skylake i5-6267 that had 64MB of eDRAM and a Iris 550 48EU iGPU, not the maximum configuration.

I then have another with a Coffee Lake 8th gen NUC (i7-8559U) which also had an Iris Plus 655 48EU iGPU, but 128MB of eDRAM. Next came a 10th gen NUC with an ordinary 24EU HD iGPU and finally I have a 11th gen NUC with the 96 Xe iGPU, that has no eDRAM at all.

The Skylake Iris 550 part was mostly interesting because Intel designed this chip specifically for Apple, who wanted a more powerful iGPU on their slim notebooks and Intel was forced to come up with something. It's a chip where the iGPU portion is bigger than the CPU dual core section and it packs the eDRAM on top. It must have been pretty expensive to make, but Apple evidently got it offered at a very good price. Official list price was the same as a i7-6600U/i5-6300U, which were much smaller and cheaper to fit on a wafer and didn't cost the extra effort for the eDRAM.

However, Apple didn't stay with that for long and a lot of these chips then got dumped cheaply, which is how I got to have one in a low-cost notebook, that still feels snappy.

Anyhow, the NUC8 (48EU + 128MB eDRAM), NUC10 (24EU) and NUC11 (96 EU no eDRAM) show rather well, that the pre-Xe iGPUs could not put their power on the road, not even with eDRAM, because twice the EUs on the NUC8 only resulted in a meagre 50% gain vs the NUC10 on nearly every graphics benchmark I tried. However, the 96EU Xe iGPU did scale linearly vs. the 24EU UHD iGPU, really putting the Iris Plus/550 48EU iGPU with eDRAM to shame.

All these NUCs offer relatively similar DRAM bandwidths, while they range from DDR4-2666 to DDR4-3200 timings, which is around 38-40GB/s. The eDRAM was measured at 50GB/s somewhere in Anandtech. So its main advantage might have been latency: this ain't no HBM!

How does the Xe achieve the incredible speed gains without additional DRAM bandwidth?

I'd love to know the details, but it must be a combination of excellent use of larger scratchpad cache areas, which means that most of the GPU operations aren't actually done on the slow DRAM until the finaly framebuffer render and general GPU advances. Because just adding EUs obviously didn't work, even with eDRAM. 50% improvement for 100% extra effort is a disaster and that isn't even the biggest variant Intel made.

There was at least in theory a 96EU Broadwell Iris 580 iGPU that I never saw in action and that might have just delivered 75% improvement over the 24EU UHD, while it cost the equivalent of 4 CPU cores in die area.

For me this is exactly the advantage that Zen delivered: do away with the iGPU, double the CPU cores and sell at a quad-core price!
 
  • Like
Reactions: -Fran-

bit_user

Polypheme
Ambassador
However, the 96EU Xe iGPU did scale linearly vs. the 24EU UHD iGPU, really putting the Iris Plus/550 48EU iGPU with eDRAM to shame.
No, it didn't. Those Xe EUs have a lot of other improvements. So, ideally, it should've scaled more than linear with the EU increase from a Gen 9.x iGPU.

The eDRAM was measured at 50GB/s somewhere in Anandtech. So its main advantage might have been latency: this ain't no HBM!
It should've been additive with external DRAM bandwidth, in which case your best-case would've been 90 GB/s.

How does the Xe achieve the incredible speed gains without additional DRAM bandwidth?

Intel made lots of changes between Gen9 and Gen12, all the way from the ISA and microarchcitecture, up to the macroarchitecture. I remember reading about Gen11 that they removed some scalability bottlenecks.
 
Until boot time is reduced to sub-1s, where it should be, there is a lot of space for improvement.
Well, ideally boot time would be one cycle after power on.

With xeon max you can get 64Gb that could be made to work as a drive...I wonder how long it would take to boot windows from that HBM2 stack.
Resize
 

abufrejoval

Reputable
Jun 19, 2020
441
301
5,060
No, it didn't. Those Xe EUs have a lot of other improvements. So, ideally, it should've scaled more than linear with the EU increase from a Gen 9.x iGPU.

https://www.anandtech.com/show/15993/hot-chips-2020-live-blog-intels-xe-gpu-architecture-530pm-pt

Intel made lots of changes between Gen9 and Gen12, all the way from the ISA and microarchcitecture, up to the macroarchitecture. I remember reading about Gen11 that they removed some scalability bottlenecks.
I am reporting on what I measured using the various 3Dmark benchmarks, PerformanceTest 10 and Unigine on hardware I still own (it runs a productive Linux cluster though, so I can't easily do live tests now).

I also compared the Intel iGPUs against a Kaveri A10-7850K (near identical to the i5-6267U), Ryzen 5800U (around 25% slower than the Xe) and I have a huge libary of 3Dmark results from all my GTX/RTX cards.

The ISA changes are a separate thing, but rarely affect iGPU benchmark results, which are intentionally low on CPU dependencies. NUC8 and NUC10 were near identical in scalar CPU performance (i7-10700U has a slight clock lead over the i7-8559U), but it was 4 vs 6 cores on multi-core benchmarks, at least until they ran out of thermal headroom.

And the Tiger Lake NUC11 i7-1165G7 was around 25% faster on scalar only to catch up to the 6-core 10700U on multi using only 4 cores.

The Kaveri A10-7850K with its full complement of 512 GPU cores could hardly improve over its 384 GPU brethren even with the optimal DDR3-2400 at the time, because it was clearly bandwidth limited.

And so it remained for all iGPUs over the next generations, with only two banks of DRAM and less than <40GB/s there was only so much iGPU cores could do... until the Xe and Ryzen APUs. I don't have any data points for AMD between the 100 Watt Kaveri and the 15 Watt Ryzen 5800U, but the generational gain is a similar order of magnitude.

That leap must have been based on intelligent use of large caches (which the Kaveri famously didn't have), because the bandwidth didn't improve by nearly as much and nobody would have left 2x performance improvement on the table just by being lazy about the GPU design.

But to my understanding any improvement since is largely due to being able to use additional bandwidth via LPDDR4, DDR5 or LPDDR5, which seem able to reach 70GB/s.
It should've been additive with external DRAM bandwidth, in which case your best-case would've been 90 GB/s.
I am pretty sure that Intel didn't double the memory channels for the eDRAM on those (mostly) mobile chips, because it would require a vastly increased pad area and 128 extra pins just for the data bus: My Haswell Xeons do 70GB/s with DDR4-2133 with four channels, so additive bandwidth is out of the question. I just had another look, the i5-6267 with DDR3-1600 manages 25.6GB/s so the 50GB/s alternate bandwidth on eDRAM would have doubled that, albeit with diminishing returns with perhaps 64MB eDRAM being too small. Both the i5-6267 and the i7-8559U don't come close to the potential that a double sized iGPU seems to promise.

Xe and RDNA2/3 iGPUs pulled a one-off, but it doesn't seem repeatable, because otherwise there simply wouldn't be a market for dGPUs with their bandwidth optimized GDDR/HBM memory. Alder and Raptor Lake iGPUs also haven't really moved the bar, which Intel surely would have, if it were possible without pulling crazy stunts.

That is pretty much what Apple did on their M1/M2, which doubles and quadruples the memory channels (and bandwidth), but offsets the system cost at least to a certain degree by turning all RAM into eDRAM.
 

abufrejoval

Reputable
Jun 19, 2020
441
301
5,060
Until boot time is reduced to sub-1s, where it should be, there is a lot of space for improvement.
I can't say that I care about boot time much.

Actually we used to be proud that our machines had hundreds of days of uptime... until PCI-DSS auditors hit us over the head for not patching. But with live patching we're actually regaining some of that old swagger.

Even on a laptop booting is something I annoyingly have to do far too often on Windows, but at once a month it's nothing I'd spend extra money on improving beyond its current state (mostly BIOS self-test and initialization, not OS boot time).
 

abufrejoval

Reputable
Jun 19, 2020
441
301
5,060
It's funny to me just how many devices run a Linux kernel. Such overkill, but often the easiest path when you have the power budget and a capable core.

Notably, the Raspberry Pi Pico does not run Linux. Nor does Arduino, of course. Microcontrollers are really past the limit of where Linux will fit.
I ran Linux (also Unix System V.Rel3 and FreeBSD) on a 80486 with 16MB of RAM. I ran Linux in a VM on NT 3.51 with 32MB of RAM. My Microport System V Rel 2 ran on a 80286 with 1.5MB of RAM while QNX made do on a 128KB PC-XT.

Even a modern Linux can be stripped down to the point where it remains comfortable with these RAM sizes.

Although it doesn't necessarily make sense to run a VAX like mini-computer OS on a micro-controller appliance.
 

rluker5

Distinguished
Jun 23, 2014
692
416
19,260
Benchmarking an almost 10 years old IGP in modern-day titles it likely has no support for on a chart that doesn't show any Haswell or Skylake numbers to compare to isn't particularly useful. Here are some more contemporary benchmarks that show how much of a leap forward Broadwell was from Haswell:

On the productivity side, Broadwell's IGP beats Haswell's by as much as 6X in Cinebench OpenGL and Maya.

On workloads that played well with the eDRAM, Broadwell's IGP was a screecher.
I have a better comparison between with and without edram. Same arch CPU, GPU, iris 5200 has twice the shaders as the 4600:
3fae9991cf10de9b0821b82115cfac88649563aab64b9b07c452d01bb4df570d.png

eca27a9437b44d4eea8c913a3a8b9c53d305d2db418864ac92980e05175914f9.png