News Jim Keller Shares Zen 5 Performance Projections

Page 6 - Seeking answers? Join the Tom's Hardware community: where nearly two million members share solutions and discuss the latest tech.
The only person you need to negotiate with is yourself. You're a CPU architect in your own mind, and nowhere else.
I can see value in both sides of the equation of where to put the iGPU.

SMT8 won't happen. SMT4 is a stretch. Currently, the benefits shown by SMT2 are limited enough that I think even 4-way isn't a foregone conclusion.
It depends on the work load in question. Some benefits more from SMT(>2) than others.

Heavy SMT testing on AnandTech.

That's why I like a more dynamic & flexible SMT1-8 model.
You adjust as needed to get the best performance for the workload you're doing.


Keep in mind that ARM Neoverse cores are competitive, even without SMT, and Intel seems to be moving that direction with Sierra Forest.
Intel also seems to be a slowly sinking ship as well, so they aren't the best judge of what direction is best.

That's easier to do when you're limited to a lower clockspeed. The latency hit from increasing the size is less, because each clock cycle takes longer.
And both Intel & AMD are slowly increasing their cache sizes over time.
I just didn't expect Apple to get to 192 KiB L1(I+D)$ first.
Both have done interesting things to mitigate latency.

Apple can also more readily afford to spend more $ on die area, if it saves them some power consumption.
I'm sure that same logic applies to AMD & Intel as well.

Due to a lot more than just cache sizes.
True, that's just one portion of the equation.
 
Once layering DRAM and SRAM gets figured out, you should have everything necessary to layer DRAM and SRAM on top of logic.
At the end of that Jim Keller video link I sent you, he says layering DRAM atop logic is insane (and not like he's impressed, but rather to say it's a deeply problematic concept).

Then chip design will be primarily limited by thermal density - can't pack things tighter than you can cool them.
He also complained that die stacking came too late, given the thermal densities of logic on today's process nodes.
 
It depends on the work load in question. Some benefits more from SMT(>2) than others.

Heavy SMT testing on AnandTech.
If you pair a CPU with a shallow OoO buffer and weak branch-prediction & prefetching, then of course it's going to do much better with heavy SMT. That's why GPUs rely so heavily on SMT, in fact.

There are so many other factors at play than the workload, but confirmation bias means you're just looking for an example where wide SMT delivers a good benefit and never bother to ask all the other questions that matter.

That's why I like a more dynamic & flexible SMT1-8 model.
You adjust as needed to get the best performance for the workload you're doing.
We've been through this. It won't work well with OS schedulers if the CPU just decides on its own that it's going to starve out some of the threads. An OS scheduler assigns threads to cores and assumes they'll make consistent progress.

In the worst case, the CPU could decide it's going to execute a thread stuck in a spinlock, while starving out the thread the first one is blocking on.

And both Intel & AMD are slowly increasing their cache sizes over time.
L3, mostly. They kept L1 & L2 sizes static for a remarkably long time.

Zen 1, 2, and 3 all had the same L1 and L2 sizes. They only bumped up L2 by a simple 2x in Zen 4, leaving L1D still at the same 64 kB as Zen 1.

Intel remarkably cut L2 from 1 MB to 256 kB, going from Penryn -> Nehalem. From then (2008) through Comet Lake (2019), they kept the combo of 64 kB L1 and 256 kB L2. It was only Ice Lake (2019) and Rocket Lake (2021) that bumped up L2 to 512 kB. Tiger Lake & Golden Cove then bumped it up 1.25 MB.

I'm sure that same logic applies to AMD & Intel as well.
No, because Apple is a systems company. If they make more efficient CPUs, they can save money on batteries. A CPU company has to be very sensitive to the price of their silicon, which pushes them in the direction of optimizing for PPA at the expense of PPW. That's also why they depend on clocking higher to achieve similar performance as Apple.
 
At the end of that Jim Keller video link I sent you, he says layering DRAM atop logic is insane (and not like he's impressed, but rather to say it's a deeply problematic concept).
It is only problematic because current chip fabrication process is optimized to form fast transistors and DRAM cells on the base layer where logic and DRAM have fundamentally incompatible requirements: fast CMOS silicon has too much leakage for DRAM, DRAM silicon is too slow for high-speed logic.

With ribbon FETs where channels are layered on top of each other instead of etched into the base silicon and GAA where the channel is detached from the base silicon, it looks like chip manufacturing won't be limited to base layer high speed transistors for much longer. That should make it possible to at the very least slap high-speed logic on top of DRAM by inserting a silicon oxide layer and power plane between the two to shield DRAM from CMOS leakage and noise.

If you can slap simple logic (like SRAM) on DRAM, then you can do fun things like having 128kB of SRAM per memory bank (64x2kB rows) to hold whole DRAM rows. Then you can have 1000+ open rows per DRAM chip and the ~17ns precharge latency can become a background garbage collection style process where you open all the rows you need, tell the DRAM chip when you are done with them, then let it evict them in the background while you work on all the new stuff. Moving the row address decoders to logic layers could trim RAS latency by a couple of ns. Moving the CAS muxes to logic layers should significantly reduce CAS latency too. The remaining usable floor space could be spent on basic compute-in-memory functions.

Likely fun stuff ahead.
 
  • Like
Reactions: bit_user
It depends on the work load in question. Some benefits more from SMT(>2) than others.

Heavy SMT testing on AnandTech.

That's why I like a more dynamic & flexible SMT1-8 model.
You adjust as needed to get the best performance for the workload you're doing.
If you looked further into the article, you would see POWER8's shortcomings as @bit_user mentioned. If anything, having that wide of an SMT was likely meant to be a stall hiding mechanism, not a necessarily throughput mechanism (though that certainly helps).

In any case, the problem with SMT is you have to duplicate certain resources in order to maintain the state of those threads. Considering Intel and AMD also target the consumer market, in which those apps don't really benefit from lots of cores/hardware threads, there really isn't much of a benefit to go up to say SMT8 if what most people are going to do isn't even going to benefit from it. That's silicon you could use elsewhere, or not have at all.
 
  • Like
Reactions: bit_user
the problem with SMT is you have to duplicate certain resources in order to maintain the state of those threads.
Since we live in an era where ISA registers are a mere fiction, I'm pretty sure the silicon overhead of SMT is quite small. I think the main penalty is cache contention & thrashing. So, it would push you towards bigger caches, and that gets expensive.

Also, there would be more contention for physical registers, which might push you in the direction of expanding the physical register file, if it became the main point of contention.

The main reason I'm skeptical about wider SMT is that it doesn't do a great job at addressing what's becoming the dominant issue, and that's energy-efficiency. For that, using smaller cores is probably the way to go. It'll be interesting to see how AMD's dual-threaded Zen 4C cores compare with Intel's single-threaded Crestmont cores.

The secondary reason is probably more to do with how many workloads really scale to so many threads, even in the server world.
 
Since we live in an era where ISA registers are a mere fiction, I'm pretty sure the silicon overhead of SMT is quite small. I think the main penalty is cache contention & thrashing. So, it would push you towards bigger caches, and that gets expensive.

Also, there would be more contention for physical registers, which might push you in the direction of expanding the physical register file, if it became the main point of contention.

The main reason I'm skeptical about wider SMT is that it doesn't do a great job at addressing what's becoming the dominant issue, and that's energy-efficiency. For that, using smaller cores is probably the way to go. It'll be interesting to see how AMD's dual-threaded Zen 4C cores compare with Intel's single-threaded Crestmont cores.

The secondary reason is probably more to do with how many workloads really scale to so many threads, even in the server world.
Investigating Performance of Multi-Threading on Zen 3 and AMD Ryzen 5000
Single-Threaded Apps & Gaming doesn't really seem to benefit from SMT.

It's professional apps that seam to benefit the most.

JQTg6ya.png

Results Overview
In our testing, we covered three areas: Single Thread, Multi-Thread, and Gaming Performance.

In single threaded workloads, where each thread has access to all of the resources in a single core, we saw no change in performance when SMT is enabled – all of our workloads were within 1% either side.

In multi-threaded workloads, we saw an average uplift in performance of +22% when SMT was enabled. Most of our tests scored a +5% to a +35% gain in performance. A couple of workloads scored worse, mostly due to resource contention having so many threads in play – the limit here is memory bandwidth per thread. One workload scored +60%, a computational workload with little-to-no memory requirements; this workload scored even better in AVX2 mode, showing that there is still some bottleneck that gets alleviated with fewer instructions.

On gaming, overall there was no difference between SMT On and SMT Off, however some games may show differences in CPU limited scenarios. Deus Ex was down almost 10% when CPU limited, however Borderlands 3 was up almost 10%. As we moved to a more GPU limited scenario, those discrepancies were neutralized, with a few games still gaining single-digit percentage points improvement with SMT enabled.

For power and performance, we tested two examples where performance at two threads per core was either saw no improvement (Agisoft), or significant improvement (3DPMavx). In both cases, SMT Off mode (1 thread/core) ran at higher temperatures and higher frequencies. For the benchmark per performance was about equal, the power consumed was a couple of percentage points lower when running one thread per core. For the benchmark were running two threads per core has a big performance increase, the power in that mode was also lower, and there was a significant +91% performance per watt improvement by enabling SMT.
 
  • Like
Reactions: bit_user
Since we live in an era where ISA registers are a mere fiction, I'm pretty sure the silicon overhead of SMT is quite small. I think the main penalty is cache contention & thrashing. So, it would push you towards bigger caches, and that gets expensive.

Also, there would be more contention for physical registers, which might push you in the direction of expanding the physical register file, if it became the main point of contention.
I would argue there's a lot of other things that the CPU may need to keep track of outside of the ISA registers in order to ensure more or less smooth performance. For instance, on MIPS CPUs with VMT, there's a lot of CP0 registers that get duplicated because they're per-thread. I'd imagine there also would have to be enough resources for say branch history, TLBs, etc so that when it switches to another thread, there's not much of a penalty for doing so (after all, that was a major point of MT)

Although I shouldn't have said "the" problem, but rather "a" problem.
 
  • Like
Reactions: bit_user
If you looked further into the article, you would see POWER8's shortcomings as @bit_user mentioned. If anything, having that wide of an SMT was likely meant to be a stall hiding mechanism, not a necessarily throughput mechanism (though that certainly helps).

In any case, the problem with SMT is you have to duplicate certain resources in order to maintain the state of those threads. Considering Intel and AMD also target the consumer market, in which those apps don't really benefit from lots of cores/hardware threads, there really isn't much of a benefit to go up to say SMT8 if what most people are going to do isn't even going to benefit from it. That's silicon you could use elsewhere, or not have at all.
But not all customer code is going to have very deep branching code or high IPC, some may benefit more from having more threads.

Given how wide & varied so many client code is on their apps, some benefit more from higher or lower thread counts per core. The only way to really find out is through testing, benchmarking, & validation.

That's kind of the point of a flexible SMT system, it allows you to partition existing resource in each core to best suit the code at hand.

You saw the results for the SMT test on Zen 3, some code REALLY benefit from extra threads, while others don't at all.

And you can't predict what kind of code they'll write, so having a flexible way of doing SMT would be more beneficial than one size fits all.

Obviously your core designs would need to be adjusted for that, but I don't see SMT(1-8) as a bad thing, especially if you can configure which SMT you're using on the fly on each core.
 
That's probably what the SPEC2017 subscores would show, if the breakdowns had been published (I posted the aggregates, above).

3D Particle Movement is probably an outlier, in that it's a floating-point app that benefits greatly from SMT. However, the AVX-512 version is massively faster, and I'm sure a good deal of the optimizations that went into it were around data movement and probably avoiding some cache thrashing. That's normally the only way you get order-of-magnitude improvements on compute-bound tasks, without changing the fundamental algorithm.

It's too bad we can only speculate, because (AFAIK) 3D PM was never open-sourced.

BTW, note that these tests were run an a 16-core 5950X, whereas the ones I posed above were on a 64-core 7763 (1P and 2P). I guess the ratio of cores to memory channels is still the same, but the 5950X cores are clocked higher.
 
Last edited:
But not all customer code is going to have very deep branching code or high IPC, some may benefit more from having more threads.
Right, but how often is that actually done? Unless they're a content creator or are archiving their media collection, there's not a whole lot of use cases a consumer would necessarily benefit from having more hardware threads.

Given how wide & varied so many client code is on their apps, some benefit more from higher or lower thread counts per core. The only way to really find out is through testing, benchmarking, & validation.
Or one can look at Task Manager and see if their CPU utilization is close to 100% on something like an 12+ thread CPU. If it is, then sure, they could probably stand to have more cores or threads. But even as I'm typing this with a YouTube video playing, Discord, and various other apps running, I'm sitting at like <10% CPU utilization.

That's kind of the point of a flexible SMT system, it allows you to partition existing resource in each core to best suit the code at hand.

You saw the results for the SMT test on Zen 3, some code REALLY benefit from extra threads, while others don't at all.

And you can't predict what kind of code they'll write, so having a flexible way of doing SMT would be more beneficial than one size fits all.

Obviously your core designs would need to be adjusted for that, but I don't see SMT(1-8) as a bad thing, especially if you can configure which SMT you're using on the fly on each core.
The core issue with your argument that I see is you feel like this is a free lunch. A flexible design would require somewhere in the system to keep tabs on thread residency in order to predict how many threads to schedule in a core. I feel this is almost getting into the Halting Problem territory.

A major hurdle with this is that code that runs on the CPU tends to be unpredictable. And it's not like the software running doesn't know what it's going to do next, but for a typical consumer based system, the human also plays a role in this. Every time I type a key, I'm firing off an interrupt, which ruins the determinism of how to schedule things. Or if I have things coming in on the network (which things are always coming in on the network), that has to be serviced as well.

The only reason why GPUs can get away with a deep SMT design is because the workloads GPUs do are highly predictable and deterministic. And it has to be, otherwise the GPU wouldn't work as well as it does.
 
  • Like
Reactions: bit_user
I would argue there's a lot of other things that the CPU may need to keep track of outside of the ISA registers in order to ensure more or less smooth performance.
Well, I guess in the age of Spectre and Meltdown, you need separate branch predictor per SMT and maybe even cache prefetcher state. Perhaps duplicate TLBs, also.

For instance, on MIPS CPUs with VMT, there's a lot of CP0 registers that get duplicated because they're per-thread.
I'm sure status registers are now renamed, as well. I wouldn't be surprised if they did dataflow analysis to the level of individual flags bits.

I'd imagine there also would have to be enough resources for say branch history, TLBs, etc so that when it switches to another thread, there's not much of a penalty for doing so (after all, that was a major point of MT)
Across a context switch, only the ISA registers are retained.
 
That's probably what the SPEC2017 subscores would show, if the breakdowns had been published (I posted the aggregates, above).

3D Particle Movement is probably an outlier, in that it's a floating-point app that benefits greatly from SMT. However, the AVX-512 version is massively faster, and I'm sure a good deal of the optimizations that went into it were around data movement and probably avoiding some cache thrashing. That's normally the only way you get order-of-magnitude improvements on compute-bound tasks, without changing the fundamental algorithm.

It's too bad we can only speculate, because (AFAIK) 3D PM was never open-sourced.

BTW, note that these tests were run an a 16-core 5950X, whereas the ones I posed above were on a 64-core 7763 (1P and 2P). I guess the ratio of cores to memory channels is still the same, but the 5950X cores are clocked higher.
Given that 3D PM wasn't open-sourced, I'd rather not guess as to why it has such huge gains.
But the fact remains that it does have huge gains with SMT2.

That's why I think most apps should be tested/benchmarked & validated with various SMT types.
With Bigger Caches, Bigger Cores w/ more resources to sub-divide. Especially given the up-coming era of Process Node shrinkage and the ability to add more resources in every category for a given Fixed Die Area.

Then you can probably set a preference for which SMT you want per core and leave it at that to get the most performance out of your processor.
 
That's kind of the point of a flexible SMT system, it allows you to partition existing resource in each core to best suit the code at hand.
We actually have a "flexible" solution, today. You can run either 1 or 2 threads on a core. If you run only 1 thread, then it gets exclusive use of the cache, pipelines, and physical register file.

What you could do to make it "dynamic" is to have the OS thread scheduler track cache & pipeline utilization, on a per-thread basis, by polling the performance registers. It could then pair up low-utilization threads on a P-Core (or demote them to a single-threaded E-core, on a hybrid CPU). Conversely, threads experiencing a lot of contention could be prioritized for running solo, if sufficient resources are available.

This is a purely software solution that could be implemented today.

I don't see SMT(1-8) as a bad thing, especially if you can configure which SMT you're using on the fly on each core.
It adds nonzero cost, so there needs to be enough cases that benefit from it. I'm going to speculate that most of those cases which gained from SMT2 would show little or no further benefit from SMT4, and we'd probably see even more regression on the cases which don't benefit.
 
Well, I guess in the age of Spectre and Meltdown, you need separate branch predictor per SMT and maybe even cache prefetcher state. Perhaps duplicate TLBs, also.


I'm sure status registers are now renamed, as well. I wouldn't be surprised if they did dataflow analysis to the level of individual flags bits.


Across a context switch, only the ISA registers are retained.
At least not separate items per thread, but you still need to build enough of them to support more threads. Especially if you're going to assume the threads are going to put the same amount of pressure on them.

Going back to MIPS, they tripled the branch history table capacity between two generations, presumably to help support a 4 thread-per-core processor.
 
  • Like
Reactions: bit_user
Right, but how often is that actually done? Unless they're a content creator or are archiving their media collection, there's not a whole lot of use cases a consumer would necessarily benefit from having more hardware threads.
Depends on what they're doing.

Are they editing video and encoding at the same time?
Are they rendering and doing other tasks in the background?
What is there system doing, and how many different things are going on at the same time?

The current workload can vary drastically.

Or one can look at Task Manager and see if their CPU utilization is close to 100% on something like an 12+ thread CPU. If it is, then sure, they could probably stand to have more cores or threads. But even as I'm typing this with a YouTube video playing, Discord, and various other apps running, I'm sitting at like <10% CPU utilization.
You might also be running loads that aren't very demanding of CPU resources, even if you run 100 different apps, if all of them are lightly threaded and aren't demanding work-loads, then it really doesn't matter.

You're not taxing the CPU with work-loads that are demanding enough.

The core issue with your argument that I see is you feel like this is a free lunch. A flexible design would require somewhere in the system to keep tabs on thread residency in order to predict how many threads to schedule in a core. I feel this is almost getting into the Halting Problem territory.
It can be determined by the CPU/OS/End User.

Which one you choose is up to you, I'd prefer to manually benchmark and set limiters on each program on how many threads per core they should generate before moving onto another core.

Given that we're coming into the era of many cores per consumer, I don't see that as a issue.
Assuming you know what you're doing, you can schedule or limit how many threads per core to run simultaneously, then scale up in core count as appropriate for your work load.

A major hurdle with this is that code that runs on the CPU tends to be unpredictable. And it's not like the software running doesn't know what it's going to do next, but for a typical consumer based system, the human also plays a role in this. Every time I type a key, I'm firing off an interrupt, which ruins the determinism of how to schedule things. Or if I have things coming in on the network (which things are always coming in on the network), that has to be serviced as well.
That's why we have many core paradigm in this day and age.
You can be doing whatever you're doing, network traffic can be running the background for whatever apps you have running the background, and you can even have certain work-loads doing heavy core stuff.

All auto-managed for you. You can even put limiters on each program.

The only reason why GPUs can get away with a deep SMT design is because the workloads GPUs do are highly predictable and deterministic. And it has to be, otherwise the GPU wouldn't work as well as it does.
There workloads also don't branch very much or at all. The fact that they're largely linear and computation heavy with little to no branching makes it perfect for GPU's.

If your code branches like crazy and jumps around, GPU's might not be the right solution for you. Everything depends on what you're trying to do. That's the kicker.
 
The only reason why GPUs can get away with a deep SMT design is because the workloads GPUs do are highly predictable and deterministic. And it has to be, otherwise the GPU wouldn't work as well as it does.
I don't agree with that. GPUs rely on SMT for latency-hiding and have very little reliance on temporal locality. Because their memory latencies can be huge, they need lots of SMT threads, but it's okay because cache-thrashing is virtually a non-issue (in part, thanks to on-die scratchpad memory).

For GPUs, the main downside of SMT is that they actually do need to duplicate the entire ISA state, since they don't do register renaming.

There workloads also don't branch very much or at all. The fact that they're largely linear and computation heavy with little to no branching makes it perfect for GPU's.

If your code branches like crazy and jumps around, GPU's might not be the right solution for you. Everything depends on what you're trying to do. That's the kicker.
GPUs branch quite well, actually. The only thing bad about branches on a GPU is if it breaks SIMD and the compiler has to resort to predication.
 
Last edited:
We actually have a "flexible" solution, today. You can run either 1 or 2 threads on a core. If you run only 1 thread, then it gets exclusive use of the cache, pipelines, and physical register file.
That's only because we live in a SMT2 world for the x86-side.
That's the only options we have.

What you could do to make it "dynamic" is to have the OS thread scheduler track cache & pipeline utilization, on a per-thread basis, by polling the performance registers. It could then pair up low-utilization threads on a P-Core (or demote them to a single-threaded E-core, on a hybrid CPU). Conversely, threads experiencing a lot of contention could be prioritized for running solo, if sufficient resources are available.

This is a purely software solution that could be implemented today.
If you want the OS thread scheduler to do it, that's fine.
A dedicated Hardware thread scheduler might be better depending on implementation.
As long as they can track resource usage within each core and determine if it's truly beneficial to spawn more threads.
Of-course, a manual solution might be nice where you can set maximum thread limits per core would be nice.

It adds nonzero cost, so there needs to be enough cases that benefit from it. I'm going to speculate that most of those cases which gained from SMT2 would show little or no further benefit from SMT4, and we'd probably see even more regression on the cases which don't benefit.
I'd rather Benchmark, Test, & Validate. Then setup some rules in the OS for how many "Maximum Threads" to run Simultaneously per core for each app.
 
That's only because we live in a SMT2 world for the x86-side.
Yes, but what I said generalizes to SMT-n.

A dedicated Hardware thread scheduler might be better depending on implementation.
The scheduling criteria used by the OS are far too complicated to bake into hardware. That's why Intel just implemented a Thread Director, which merely spies on the threads and try to collect performance data as a hint to the OS' scheduler.

I'd rather Benchmark, Test, & Validate. Then setup some rules in the OS for how many "Maximum Threads" to run Simultaneously per core for each app.
It's going to vary as a function of what else is running at the time. That's why a dynamic solution beats a static one.
 
Yes, but what I said generalizes to SMT-n.
So there will be more choices of SMT-# then as to what is optimal at that time.

The scheduling criteria used by the OS are far too complicated to bake into hardware. That's why Intel just implemented a Thread Director, which merely spies on the threads and try to collect performance data as a hint to the OS' scheduler.
They should've called it 'Thread Observer' instead of 'Thread Director' given the way you describe it.

It's going to vary as a function of what else is running at the time. That's why a dynamic solution beats a static one.
So let the OS Scheduler handle things is what it comes down to.
 
  • Like
Reactions: bit_user
Right, but that sounds very passive and might (legitimately) raise concerns over side-channel attacks.

I'm pretty sure the Thread Director has been a target of hackers, but Intel assures us it's secure (decide for yourself what to make of that).
How many times has Intel said "We're Secure", but we discover a new Spectre/Meltdown/New vulnerability?

There solution (Instead of issuing a new patch) is to buy their new CPU ___ ?

Instead of issuing a proper Micro-Code update to fix their older CPU.
 
  • Like
Reactions: bit_user
tnlWZkf.png

What I'm more interested is the potential for a "Big APU" from AMD.
Imagine what could be possible with a large monolithic die APU for mobile/desktop.
I know the Die cost would be quite high.

On a 400 mm² large APU die, the BoM cost alone would be at least $135 from TSMC alone, that's not including factoring in AMD's R&D & other fixed costs, then their ~50% profit margins along with Testing, Binning, Validation, Packaging, & Shipping costs.

With the upcoming DX12 update:
DX12 Update allows CPU to access VRAM simultaneously
And a Monolithic APU that has a decent amount of CU's like 64 CU's of RDNA3

Everything @InvalidError has been asking for, or the closest thing to it would be optimally used in a Mobile APU form factor.

HBM3 for Mobile since it's the best type of VRAM to be used in a small Mobile Form Factor.

At least 12 GiB of SIP (System-In-Package) for LPDDR6 for Adjacent Packaging of regular System Memory.

Optional Main Memory expansion board using Dell's CAMM that is certified by JEDEC to replace SO-DIMM's.

The cooling would be crazy if you moved it to a SFF (Small Form Factor) or Mini-Tower case and you were allowed to increase the TDP to the CPU package.
 
Last edited:
What I'm more interested is the potential for a "Big APU" from AMD.
Imagine what could be possible with a large monolithic die APU for mobile/desktop.
I know the Die cost would be quite high.
Well, Apple has its M-series. Those are monolithic, except for the Ultra.

HBM3 for Mobile since it's the best type of VRAM to be used in a small Mobile Form Factor.
Apple is doing fine with LPDDR5X. Nvidia claims Grace would use the same power with either that or HBM (I forget if it was 2e or 3), but that LPDDR5X is currently a lot cheaper.