News Intel's Patent Details Meteor Lake's 'Adamantine' L4 Cache

Page 2 - Seeking answers? Join the Tom's Hardware community: where nearly two million members share solutions and discuss the latest tech.

bit_user

Polypheme
Ambassador
I am reporting on what I measured using the various 3Dmark benchmarks, PerformanceTest 10 and Unigine on hardware I still own (it runs a productive Linux cluster though, so I can't easily do live tests now).
But your comparison involved apples and oranges. If I understand it correctly, you compared a gen 9.x iGPU with 24 EUs against a Gen 12 (Xe) iGPU with 96 EUs and concluded that because the latter was 4x as fast that it scaled linearly. Except the difference isn't only in the number of EUs.

The ISA changes are a separate thing, but rarely affect iGPU benchmark results, which are intentionally low on CPU dependencies.
Who said anything about CPU? I was talking about the removal of register scoreboarding + other changes made to the shader ISA.

That leap must have been based on intelligent use of large caches (which the Kaveri famously didn't have),
Sounds plausible, but you should probably read up on the changes made to Gen11, before going too far down a speculative rabbit hole.

Both the i5-6267 and the i7-8559U don't come close to the potential that a double sized iGPU seems to promise.
How do you know they're not power-throttling? Doubling the shader count doesn't mean they'll continue to run at the same clockspeed as the baseline, especially under high-load.

Hint: look for a tool called intel_gpu_top
 

bit_user

Polypheme
Ambassador
Well, ideally boot time would be one cycle after power on.
If you're talking about booting all the way to a login prompt, that's unrealistic.

  1. Before anything else, the CPU has to bootstrap, which means initializing its internal devices, loading microcode, etc.
  2. Next, any power-on self test (POST)
  3. Initialize external devices.
  4. Once the storage controllers are online, you can load the OS kernel.
  5. Kernel initializes.
  6. Kernel loads device drivers and they initialize devices.
  7. Network starts.
  8. Kernel mounts filesystems.
  9. Kernel starts services.

That is never going to happen "one cycle after power on". The best you can do is restore from hibernation, which cuts out most of the latter stages and replaces them with loading a memory image.

Speaking about non-general purpose operating systems, there are embedded devices which can boot in less than a millisecond.
 

JamesJones44

Reputable
Jan 22, 2021
662
593
5,760
Once your toaster requires support for cameras so you can watch your toasts, WiFi and BT networking so you can remotely monitor from your phone, PC, tablet or whatever else , USB storage to record your toasts, audio and video output for videoconference over toaster or just putting your toaster cams on external displays, may as well get a full OS :)

It's so sad that this is probably going to be reality. The minute I saw the camera enabled pet treat dispenser I knew there was no hope for simple appliances to remain sensible.
 
  • Like
Reactions: bit_user

bit_user

Polypheme
Ambassador
I ran Linux (also Unix System V.Rel3 and FreeBSD) on a 80486 with 16MB of RAM. I ran Linux in a VM on NT 3.51 with 32MB of RAM. My Microport System V Rel 2 ran on a 80286 with 1.5MB of RAM while QNX made do on a 128KB PC-XT.

Even a modern Linux can be stripped down to the point where it remains comfortable with these RAM sizes.
I recall someone recently ported the MIPS build to run on a Nintendo 64. That console ships with 4 MB, but I seem to recall reading their kernel took closer to 8 MB, which probably means they were using an emulator.

Linux' bloat has definitely increased the minimum bar of how much RAM you need to run it. In light of the fact that servers routinely host TBs of RAM, I don't mind if Linux now requires several MB. It has to move with the times, and it's really doesn't need to fit well on something like a microcontroller.
 
  • Like
Reactions: JamesJones44

InvalidError

Titan
Moderator
If you're talking about booting all the way to a login prompt, that's unrealistic.
One of the claims regards setting up multi-threading early. Since the CPU would no longer need to do memory training before going multi-threaded, the CPU could start spitting out threads as soon as core0's BIST and initialization are completed to let other cores do their own BIST before grabbing tasks from the thread pool. Most POST tasks can be parallelized from there, which could almost eliminate BIOS time. Then again, many motherboards already have nearly nonexistent (faster than the monitor can sync) BIOS time when you enable the FastBoot option (skip most POST) and the BIOS isn't getting hung on re-training stuff between boots.

Getting into a full-blown OS will still take a little while though. Can't imagine it getting much better than what Windows does with saving a boot image and loading that as fast as the SSD can copy to RAM to bypass boot-time detection between hardware changes.
 

JamesJones44

Reputable
Jan 22, 2021
662
593
5,760
I recall someone recently ported the MIPS build to run on a Nintendo 64. That console ships with 4 MB, but I seem to recall reading their kernel took closer to 8 MB, which probably means they were using an emulator.

Linux' bloat has definitely increased the minimum bar of how much RAM you need to run it. In light of the fact that servers routinely host TBs of RAM, I don't mind if Linux now requires several MB. It has to move with the times, and it's really doesn't need to fit well on something like a microcontroller.

There are a couple like SliTaz that are still single digits RAM wise (4 or 8, I can't remember which), but most aren't well maintained.

Even Alpine which is considered a slim Linux requires 128 MB of RAM, which to be fair is small by today's standards, but not small enough for many micro controllers.
 
If you're talking about booting all the way to a login prompt, that's unrealistic.

  1. Before anything else, the CPU has to bootstrap, which means initializing its internal devices, loading microcode, etc.
  2. Next, any power-on self test (POST)
  3. Initialize external devices.
  4. Once the storage controllers are online, you can load the OS kernel.
  5. Kernel initializes.
  6. Kernel loads device drivers and they initialize devices.
  7. Network starts.
  8. Kernel mounts filesystems.
  9. Kernel starts services.

That is never going to happen "one cycle after power on". The best you can do is restore from hibernation, which cuts out most of the latter stages and replaces them with loading a memory image.

Speaking about non-general purpose operating systems, there are embedded devices which can boot in less than a millisecond.
It seems you are unfamiliar with the concept of ideals.
It might be is unrealistic now but that doesn't mean that it's not the target.
I recall someone recently ported the MIPS build to run on a Nintendo 64. That console ships with 4 MB, but I seem to recall reading their kernel took closer to 8 MB, which probably means they were using an emulator.

Linux' bloat has definitely increased the minimum bar of how much RAM you need to run it. In light of the fact that servers routinely host TBs of RAM, I don't mind if Linux now requires several MB. It has to move with the times, and it's really doesn't need to fit well on something like a microcontroller.
The N64 needed an 4Mb expansion even for many of its games.
So it has a total of 8Mb if it has such an expansion.
 

bit_user

Polypheme
Ambassador
It seems you are unfamiliar with the concept of ideals.
It might be is unrealistic now but that doesn't mean that it's not the target.
It was a pointless statement, since it's so far-removed from reality. Rather than attack you for it, personally, I thought I'd take the opportunity to show why it's unrealistic, and turn it into a learning opportunity. Next time, I guess we can do it your way.
 
It was a pointless statement, since it's so far-removed from reality. Rather than attack you for it, personally, I thought I'd take the opportunity to show why it's unrealistic, and turn it into a learning opportunity. Next time, I guess we can do it your way.
It wasn't even a statement (a report of facts or opinions) , it was a declaration of an ideal.
But I guess for you even this is a pointless statement, since it's so far-removed from reality:
"The Ideals are equality, right to life, liberty, and the pursuit of happiness, consent of the Governed and the right to alter or abolish the government."
 
Which was utterly pointless, because every single person here knows and agrees that we want boot times to be fast.

Not sure why you're now trying to drag this onto some political tangent. That can only go poorly.
Just because politicians came up with it doesn't make everything that talks about it political.
...
...
...
According to you this is probably a political drama, yes?!
 

InvalidError

Titan
Moderator
Which was utterly pointless, because every single person here knows and agrees that we want boot times to be fast.
Back in the HDD days, I simply got coffee or breakfast while my PC booted for the once-every-few-months times I needed to reboot it.

Today, even rebooting to apply major Windows updates like 22H2 takes less than a minute from the moment I click "Update and restart" with normal boot taking under 10s. Not much time to walk away anymore.

My main reason for not rebooting often is having to re-open the dozen of things I normally have open all of the time and no amount of boot time reduction is going to solve that.
 
  • Like
Reactions: bit_user

abufrejoval

Reputable
Jun 19, 2020
333
231
5,060
But your comparison involved apples and oranges. If I understand it correctly, you compared a gen 9.x iGPU with 24 EUs against a Gen 12 (Xe) iGPU with 96 EUs and concluded that because the latter was 4x as fast that it scaled linearly. Except the difference isn't only in the number of EUs.


Who said anything about CPU? I was talking about the removal of register scoreboarding + other changes made to the shader ISA.


Sounds plausible, but you should probably read up on the changes made to Gen11, before going too far down a speculative rabbit hole.


How do you know they're not power-throttling? Doubling the shader count doesn't mean they'll continue to run at the same clockspeed as the baseline, especially under high-load.

Hint: look for a tool called intel_gpu_top
To make things clearer (less anecdotical stuff):

I compared three NUCs (NUC8, NUC10 and NUC11) I operate in a RHV/oVirt cluster today, before putting them into "production", using the very same Windows 10 image, updated with the latest drivers for each at the time.

They were purchased within six month of each other each with the top i7 CPU and 64GB of DDR4-3200 memory (Kingston modules, which also supports the 2400 and 2933 timings of NUC8/10).

I got the NUC8 (i7-8559U/48EU) first in Summer 2019, I think, because it had fallen below €400 even with the Iris 655 iGPU, then got the NUC10 (i7-10700U/24EU) perhaps a month later at pretty much the same price (but with 2 extra cores) and finally I needed a 3rd for a proper HCI cluster and got lucky six months later in early 2020 with a NUC11 (i7-1165G7/96EU), because those remained almost impossible to buy for a long time afterwards.

Both the CPU and the GPU portion of the NUC8 and NUC10 are largely unchanged in terms of architecture and fab process (14nm), they mostly trade silicon die area between the GPU (48 vs 24EU) and the CPU (4 vs 6 cores) parts of the chip, making them as comparable as it gets.

The NUC11 changes everything, 10nm process, redesigned CPU cores and GPU.

The NUCs have fully adjustable PL1/PL2 and TAU settings, a fan that is certainly capable of cooling 15 Watts, probably a bit more--if you can tolerate the noise... which I did for testing, but not for production. All NUCs (and most notebooks) will peak much to much higher PL2 for TAU seconds or until thermals kick in, HWinfo and its graphs showed all details on a remote observation system.

And to test peak power consumption and throttling I use Prime95 and Furmark, each and in combination like many others.

Of course I used the "maximum performance" settings first for the benchmarks and then tested various ways of dialing NUCs and their fans down to the point where they still gave me short-term peak performance as well as acceptable noise levels for sustained loads to ready them for the production use under Linux.

All those tests showed, that the iGPU portion of those SoCs is the last thing that ever throttles, CPU cores will always clock down first when PL2 runs beyond TAU or thermals kick in--unless the iGPU doesn't have load.

So the iGPU generally runs privileged and graphics benchmarks are not much affected by TDP settings until you go really low. Anandtech has tested passive Tiger Lake NUCalikes, which suffer in graphics as the system heats up, but that's below 10 Watts.

On the high-end the NUCs can go to 50 or even 64 Watts for the NUC11, but it's only the CPU cores that will really use that wattage, the iGPU never goes near that and I believe I've never seen them use more than 10 Watts. Furmark never sees them throttle or clock down even at only 15 Watts PL1/PL2/TAU=0, for that you need to add Prime95 or go below 10 Watts of permissible TDP or cooling capacity (see Anandtech tests).

But with all that the NUC8, which should have twice the graphics performance of the NUC10 (48 vs 24 EUs), only got 50% uplift from 100% extra EUs and 128MB of eDRAM, while the NUC11 got 400% of the NUC10 graphics performance without eDRAM for a pretty much linear scale of 96vs24 EUs.

I hope that's enough data to put your doubts about the quality of my measurements to rest.

To me it showed that both the 24 extra EUs on the NUC8 and the 2 extra cores on the NUC10 weren't really worth having because of diminishing returns. In the first case, the graphics performance increase just wasn't worth the technical effort that went into making it happen and in the second case the 2 extra cores simply didn't pay off with only a 15 Watt TDP budget, because they needed to clock below the silicon knee even on truly parallel loads and wound up with a performance pretty similar to the NUC8.

Only when I unleashed PL2 and TAU they did reach their potential, but the NUC fans become intolerable above 2000rpm. The Tiger Lake managed the same CPU performance only using 4 cores with the improved IPC with much less noise.
 

bit_user

Polypheme
Ambassador
the NUC11 got 400% of the NUC10 graphics performance without eDRAM for a pretty much linear scale of 96vs24 EUs.

I hope that's enough data to put your doubts about the quality of my measurements to rest.
Lots of details, only to get back to my main objection, which is that you really can't make scalability claims by comparing Gen 9.5 iGPU vs Gen 12. You need to compare Gen 12 vs Gen 12, in order to appreciate how well it scales.

Tests comparing the 32 EU Gen 12 in desktop CPUs have shown greater-than-linear increase over 24 EU Gen 9.x iGPUs, so that really invalidates your scaling experiment.

To me it showed that both the 24 extra EUs on the NUC8 and the 2 extra cores on the NUC10 weren't really worth having because of diminishing returns.
That's a fair judgment.
 

SiliconFly

Prominent
Jun 13, 2022
99
37
560
There are a couple of red flags in the article no one seems to have noticed. Typically, the base tile is an passive interposer with just power vias & data links. AMD uses it. Many of the MTL parts too I believe will use passive interposers compared to ADM L4 base tile (due to cost & power issues).

(1) First off, having L4 sram in an active interposer (base tile) which also has power vias & data links is a nightmare scenario. Not sure how practical it is. Can anyone shed some light?

(2) Second, no one has addressed the power issue. A large L4 is going to use a lot of power. It's basically a tradeoff between performance & efficiency.

(3) The patent doesn't mention anything about MTL using ADM, which is a bit worrisome.

(4) Even if ADM exists in MTL, it may be restricted to only a few special parts (like AMD X3D) due to power & more importantly.... cost!

Any thoughts?
 
Last edited:

InvalidError

Titan
Moderator
(2) Second, no one has addressed the power issue. A large L4 is going to use a *LOT* of power. It's basically a tradeoff between performance & efficiency.
SRAM doesn't use that much power since 99.99% of it is static on any given clock cycle, only about 1W per GB from leakage and you can only cram about 200MB of it per 100sqmm, so layering SRAM under other tiles adds ~0.2W/sqcm, which is trivial on chips with a thermal density around 100W/sqcm.

What does use power is the tag-RAM that keeps tabs on which SRAM rows contain what data and how long that data has been in there since the last access to decide which row to evict next since every incoming read/write has to be checked against every row in the set an address belongs to.
 
Also the power need of the L4 doesn't matter if that makes the CPU run that much faster that it still comes out more efficient.
Less copying data back and forth might make up for the more L4 power alone.
 

SiliconFly

Prominent
Jun 13, 2022
99
37
560
Also the power need of the L4 doesn't matter if that makes the CPU run that much faster that it still comes out more efficient.
Less copying data back and forth might make up for the more L4 power alone.
Ok. Assuming MTL has ADM & ADM power draw is negligible & ADM cost is not a major issue. That still leaves us with the most important question of all:

(1) Having L4 in an active interposer which also has power vias & data links is a nightmare scenario. Not sure how practical it is. Can anyone shed some light?

It's like trying to build a very huge palace in a extremely dense forest without cutting down any plants or trees! :)

The thing is, if L4 can somehow sit comfortably in the base tile, it means it directly links to both the tCPU & the tGPU simultaneously which is just amazing! It not only speeds up the tGPU but the tCPU will get a significant boost in IPC which many articles have missed. A 512MB L4 cache will bring in a hefty double-digit IPC boost like 20% or 30% very easily. Actually sounds too good to be true! I wish it happens somehow, but very skeptical at the moment.
 

InvalidError

Titan
Moderator
It's like trying to build a very huge palace in a extremely dense forest without cutting down any plants or trees! :)
The cache doesn't have to be in one single contiguous array. It can be chopped up into islands and surrounded with TSVs to get power from the LGA substrate to their destination tiles.

In fact, caches are usually divided into sets and each set corresponds to some subset of address bits or function thereof. Your hypothetical 512MB cache could be chopped up in 256kB chunks and made into a 2048-ways cache, then you only need enough contiguous space to squeeze a 256kB SRAM into, at the expense of routing logic and propagation delays to get data to/from whichever set its address belongs to.

As for getting 20-30% IPC from this, hold your horses. AMD's Zen 5 L2$ experiments say it only gets ~4% better IPC from doubling L2 from 1MB to 2MB and ~7% from going to 3MB, which is with L2$ that has only ~14 cycles of latency. L3$ has 40+ cycles of latency, which will make it far less effective at reducing the cost of L1/L2 misses and L4$ would be even slower, especially if it gets chopped up into bits scattered across the interposer. The cost of misses increases with each additional tier you add and for L4$ to provide any benefit, its hits have to save more time on average than the flat additional latency added to all L3 misses.

Performance benefits from adding an extra cache tier will be circumstantial: some workloads may benefit significantly from having a working dataset that sits between L3$ and L4$ sizes while others that don't have much data locality within L4$ size will get hammered by the added latency.
 
  • Like
Reactions: SiliconFly
As for getting 20-30% IPC from this, hold your horses. AMD's Zen 5 L2$ experiments say it only gets ~4% better IPC from doubling L2 from 1MB to 2MB and ~7% from going to 3MB, which is with L2$ that has only ~14 cycles of latency. L3$ has 40+ cycles of latency, which will make it far less effective at reducing the cost of L1/L2 misses and L4$ would be even slower, especially if it gets chopped up into bits scattered across the interposer. The cost of misses increases with each additional tier you add and for L4$ to provide any benefit, its hits have to save more time on average than the flat additional latency added to all L3 misses.
I very much doubt that the L4 cache will have anything to do with cache misses.
It's all about having massive datasets available for the CPU in a more convenient place than the main ram.
That's what benchmarks utilize and it's also what real-life uses.
Look at 7zip, it has a 32Mb dataset in the build in benchmark by default and zen was much better in that benchmark until intel got a big enough cache as well.
And that goes for most things in general, the more of the data you have the closer to the CPU the faster it will go.
Look at x3d game benchmarks compared to non 3xd.
 

SiliconFly

Prominent
Jun 13, 2022
99
37
560
I very much doubt that the L4 cache will have anything to do with cache misses.
It's all about having massive datasets available for the CPU in a more convenient place than the main ram.
That's what benchmarks utilize and it's also what real-life uses.
Look at 7zip, it has a 32Mb dataset in the build in benchmark by default and zen was much better in that benchmark until intel got a big enough cache as well.
And that goes for most things in general, the more of the data you have the closer to the CPU the faster it will go.
Look at x3d game benchmarks compared to non 3xd.
I think what InvalidError is trying to say is, the performance boost due to a large L4 might not be as much in all workloads.

Whatever said, a large L4 will be fantastic to have if Intel can do it (which is very doubtful at the moment with MTL).
 
  • Like
Reactions: bit_user

SiliconFly

Prominent
Jun 13, 2022
99
37
560
The cache doesn't have to be in one single contiguous array. It can be chopped up into islands and surrounded with TSVs to get power from the LGA substrate to their destination tiles.

In fact, caches are usually divided into sets and each set corresponds to some subset of address bits or function thereof. Your hypothetical 512MB cache could be chopped up in 256kB chunks and made into a 2048-ways cache, then you only need enough contiguous space to squeeze a 256kB SRAM into, at the expense of routing logic and propagation delays to get data to/from whichever set its address belongs to.

As for getting 20-30% IPC from this, hold your horses. AMD's Zen 5 L2$ experiments say it only gets ~4% better IPC from doubling L2 from 1MB to 2MB and ~7% from going to 3MB, which is with L2$ that has only ~14 cycles of latency. L3$ has 40+ cycles of latency, which will make it far less effective at reducing the cost of L1/L2 misses and L4$ would be even slower, especially if it gets chopped up into bits scattered across the interposer. The cost of misses increases with each additional tier you add and for L4$ to provide any benefit, its hits have to save more time on average than the flat additional latency added to all L3 misses.

Performance benefits from adding an extra cache tier will be circumstantial: some workloads may benefit significantly from having a working dataset that sits between L3$ and L4$ sizes while others that don't have much data locality within L4$ size will get hammered by the added latency.
One more thing. I think ADM L4 cache latency will be very close to L3 actually. In essence, it'll work more like a massive L3 cache with cache latency almost similar to the built-in L3 for tCPU.
 

InvalidError

Titan
Moderator
One more thing. I think ADM L4 cache latency will be very close to L3 actually. In essence, it'll work more like a massive L3 cache with cache latency almost similar to the built-in L3 for tCPU.
Before the L4$ can do its thing, the L3$ has to conclude that it missed. The read/write address also has to get to it over an off-die interface and routing fabric that connects L4$ to everything else. The L3$ adds 36 cycles on top of L2$ latency on Zen 4, a bigger L4$ made on a cheaper process (interposers are made on 12-16nm class processes if that is really where you want to put your L4$) would almost certainly add another 40+ cycles due to the much longer physical roundtrip and extra clocked hops to help cover the distance.