News AMD’s beastly ‘Strix Halo’ Ryzen AI Max+ debuts with radical new memory tech to feed RDNA 3.5 graphics and Zen 5 CPU cores

Page 2 - Seeking answers? Join the Tom's Hardware community: where nearly two million members share solutions and discuss the latest tech.
Status
Not open for further replies.
I don't think it's math but grammar or semantics.

What they call "2x faster" is in fact 2x as fast, while 200% faster is 3x as fast.

And it is quite intentional misleading by everyone who uses faster/better/less expensive etc. comparative with a factor instead of a percentage: shame on them!
No, this is the shame of AMD and its marketing department, where managers simply do not know how to correctly, mathematically, convert times into percentages and back. The shame of the AMD team is right before your eyes on the slide.

The mathematically correct slide - if the numbers on it are correct in % - at the bottom there should be an average of 4.6 times. If the average of 2.6 times at the bottom is correct - at the top of the columns the percentages should be around 140-160% and the last column 202% on the histogram.

It is simply incredible - that not a single journalist at the exhibition shamed them. We now live in a world of lies and absurdity and ignoramuses. This will definitely end badly for the entire civilization.
 
This is shared memory. Therefore, when igpu is not load, the entire bandwidth is available to the processor cores
That's not accurate. As thoroughly investigated & documented here, Apple didn't make the interface of the CPU cluster to the interconnect fabric wide enough that the CPU cores could saturate the memory bandwidth:

and that is why the M4 Max and especially the M3, has a memory controller with a very low efficiency in reality.
It's not the memory controller(s). Using OpenCL, it was possible for the GPU to achieve 390 GB/s (out of a theoretical 400 GB/s) on the M1 Max:

If everything were as you claim, why do the M3 Pro/M3 Max cores have almost twice less real bandwidth available in real tests, if the buses are the same width?
I don't follow. The M3 Pro has a 192-bit memory data width, which is all I really know about that SoC, specifically.
 
  • Like
Reactions: thestryker
How do you explain then the real 120-130GB/s for the 16-core version of the M3 Max with 400Gb/s bandwidth with lpddr5 6400 and 220-230GB/s for the M4 Max with 546GB/s with lpddr 8500?

In essence, marketing is lying to buyers that the memory controller is fully accessible to the processor cores, because I have not seen any footnote anywhere in advertising or review materials that part of the bandwidth is allegedly reserved and not accessible to the processor cores.
 
How do you explain then the real 120-130GB/s for the 16-core version of the M3 Max with 400Gb/s bandwidth with lpddr5 6400 and 220-230GB/s for the M4 Max with 546GB/s with lpddr 8500?
If I understand your question correctly, you're saying the 16-core M3 Max can only access ~130 GB/s out of 400 GB/s, whereas the M4 Max can access ~230 GB/s out of 546 GB/s? I have no specific knowledge of either SoC, but I'd speculate that it comes down to how many CPU core clusters they each have, how those are connected to the data fabric of the SoC, and what frequency that interconnect runs at.

I expect you can probably find more insights into the matter, if you do a bit of digging. One guy I'd follow is ex-Apple engineer Manard Handley, who goes by the alias name99 and has done a lot of reverse-engineering of their M-series SoCs. Here's his github repo, but he also posts on some social media and sometimes over on the RealworldTech forums.

In essence, marketing is lying to buyers that the memory controller is fully accessible to the processor cores, because I have not seen any footnote anywhere in advertising or review materials that part of the bandwidth is allegedly reserved and not accessible to the processor cores.
I'm not here to defend Apple, but if the SoC is capable of using that much memory bandwidth, then they didn't actually lie. You just assumed it was all made available to the CPU cores, but I'm sure they never said so. Also, pretty much every datasheet or specs summary I've seen published by a manufacturer has a "get out of jail free" clause, where they say something like: "all specifications subject to change".

After yesterday's exchange, I had wanted to add that this sort of thing isn't uncommon. In the PS4 and PS5, the CPU cores were also restricted from eating the whole pie. In the PS4's case, the CPU cores could only use a total of about 20 GB/s out of the 176 GB/s max [1]. In the PS5, the CPU cores are limited to about 97 GB/s out of the 440 GB/s total [2].

Sources:
  1. https://forum.beyond3d.com/threads/is-ps4-hampered-by-its-memory-system.54916/
  2. https://chipsandcheese.com/p/the-nerfed-fpu-in-ps5s-zen-2-cores
 
I'm not here to defend Apple, but if the SoC is capable of using that much memory bandwidth, then they didn't actually lie. You just assumed it was all made available to the CPU cores, but I'm sure they never said so.
Given the lack of source for the testing perhaps it's not even limited, but rather what the CPU designs are capable of saturating. As you've already said the bus width isn't to feed the CPU, but rather the GPU.
 
Given the lack of source for the testing perhaps it's not even limited, but rather what the CPU designs are capable of saturating.
In the first link I posted about this (the Anandtech article), Dr. Ian Cutress used multithreaded scaling analysis to show that the M1's bottleneck was somewhere between the CPU cluster and the memory controller. He showed you could reach the saturation point with only 4 active cores, all simultaneously hammering memory.
 
  • Like
Reactions: thestryker
In the first link I posted about this (the Anandtech article), Dr. Ian Cutress used multithreaded scaling analysis to show that the M1's bottleneck was somewhere between the CPU cluster and the memory controller. He showed you could reach the saturation point with only 4 active cores, all simultaneously hammering memory.
Yeah, but those could easily be attributed to inherent design limitations given that it didn't scale much beyond the Pro. Now I'm bitter about nobody picking up the mantle on Apple SoC testing after Ian left 🤣
 
Yeah, but those could easily be attributed to inherent design limitations given that it didn't scale much beyond the Pro.
I thought you were saying the CPU cores just weren't fast enough to demand any more than that, which clearly isn't the case. Presumably, Apple figured that's all the CPU cluster would need, in most real-world scenarios.

Now I'm bitter about nobody picking up the mantle on Apple SoC testing after Ian left 🤣
I'm not aware of it, but that doesn't mean that nobody is doing it. Chips & Cheese has yet to touch Apple silicon, but their tools are open source and it's possible others might've already run some of their tests on newer M-series SoCs.
 
  • Like
Reactions: abufrejoval
I thought you were saying the CPU cores just weren't fast enough to demand any more than that, which clearly isn't the case.
Ah, yeah, I didn't make that clear. Even x86 CPU cores have been fast enough to saturate the dual channel bandwidth it just didnt matter in much until 16 core CPUs arrived. I also have a hard time believing that the M3 Max wouldn't be able to at least match the bandwidth of the M1 Max.
I'm not aware of it, but that doesn't mean that nobody is doing it. Chips & Cheese has yet to touch Apple silicon, but their tools are open source and it's possible others might've already run some of their tests on newer M-series SoCs.
Any time I've gone looking for technical insights into the later M-series I've hit a brick wall. It doesn't mean that they're not out there (I don't check youtube for example), but if they are they aren't easy to find.
 
Any time I've gone looking for technical insights into the later M-series I've hit a brick wall. It doesn't mean that they're not out there (I don't check youtube for example), but if they are they aren't easy to find.
Beyond the github link I posted above, you can also find a smattering of interesting papers on Google Scholar. Because research takes a while and then there's the delays of the publication pipeline, you're not going to find much on the latest and greatest products, but you'll do better if you focus on a couple generations prior.


I know Handley (name99) gleaned many details by reading through quite a few of Apple's patents. He's also collected data others have gathered via microbenchmarking. Unlike research, patents have the advantage of being more forward-looking, sometimes getting filed years before a product implementing them reaches the market.

BTW, Github's online PDF viewer seems to choke on some of the larger PDFs, but they work just fine if I download and view the raw file locally.
 
Last edited:
I'm not here to defend Apple, but if the SoC is capable of using that much memory bandwidth, then they didn't actually lie. You just assumed it was all made available to the CPU cores, but I'm sure they never said so. Also, pretty much every datasheet or specs summary I've seen published by a manufacturer has a "get out of jail free" clause, where they say something like: "all specifications subject to change".
This is exactly what is meant, because the memory is shared and everyone expects, especially from experience with x86, that the entire bandwidth (at least 80%) will be available to the processor (even to one core, as in x86), when the igpu is idle or just outputs a 2D image. But instead, a number of tests show that Apple is being disingenuous.

After yesterday's exchange, I had wanted to add that this sort of thing isn't uncommon. In the PS4 and PS5, the CPU cores were also restricted from eating the whole pie. In the PS4's case, the CPU cores could only use a total of about 20 GB/s out of the 176 GB/s max [1]. In the PS5, the CPU cores are limited to about 97 GB/s out of the 440 GB/s total [2].
You see, here's the thing - Apple has no one to compete with except the x86 camp and the expectations of most ordinary people looking at the advertising materials are exactly the same, as on x86 almost the entire bandwidth is available to the processor cores. Apple carefully does not advertise in marketing materials to the general public that it has a completely different memory controller architecture, one core, judging by what is written, cannot have access to either the full bandwidth or even the full address bus (as I understand it) and even all cores at the same time are far from 80%+ of the theoretical memory bandwidth. And a person who switched from x86 is immediately extremely disappointed by this.

After all, as was correctly written in the articles on your links - people were stunned by such a bandwidth precisely in comparison with current x86 chips and thought that they would get a bandwidth of the level of server x86 chips, but got a "pumpkin" instead of a "melon".
 
Any time I've gone looking for technical insights into the later M-series I've hit a brick wall. It doesn't mean that they're not out there (I don't check youtube for example), but if they are they aren't easy to find.
That's right! If there are plenty of professional RAM tests on x86 - take even the well-known AIDA64 package, not to mention dozens of others, but with Apple, with all the hype around it, it's almost a deaf "reinforced concrete wall". I relied on practical tests in some reviews from the network and it was really very difficult for me, for example, to find (I didn't read and didn't find) the above links to Anandtech in Google. And I searched more than a year ago and quite persistently. How did it happen that the key articles from this site didn't appear in Google search results, even on the first 10 pages, then? Has Google's search engine become completely irrelevant or is this some kind of conspiracy with Apple?

---
I hope that in the case of Zen5 Halo everything will be fast and open, as soon as real laptops reach testers and store shelves. And it will be another rout of Intel by AMD, if their memory controller gives at least 200 GB/s, like M4Pro, crushing 128-bit ArrowLake for laptops. This is a new breakthrough in x86 and all the old Intel/AMD chips will immediately fade in comparison...
---
So that everyone understands - what real 200GB/s+ gives - is the ability to really support 3-4 8k monitors, i.e. finally add support for DP2.0+ (UHBR20) and TB5/USB40V2 to igpu, which we have been waiting for for 5 years since the announcement of DP2.0 in 2019. Never before has the IT industry taken so long to introduce such standards that were needed 10 years ago. Although even DP2.0+ (UHBR20) / HDMI 2.2 are not enough to introduce 8k monitors with 120Hz 36-bit panels. For this, optics are already needed, the age of copper has come to an end.

8k monitors will finally close the topic of rendering ideal clear fonts and almost "analog" images, because on diagonals up to 32" they will definitely provide ppi above 230.

Everyone probably knows that in Chrome there is a non-disabled muddy (incorrect - well-visible vertical shadows around letters (see text zoom@400%), which are not present with correct grayscale anti-aliasing like default grayscale anti-aliasing in Windows XP without ClearType) grayscale anti-aliasing for fonts under Windows. The problem of muddy fonts in Chrome is solved by itself at a ppi above about 220-230, as on all smartphones for a long time.
You can enable correct anti-aliasing in Chrome only up to version 50 or in Firefox up to version 69. But in Firefox at least, starting with version 69, this muddy (incorrect) anti-aliasing can be completely disabled under Windows, but not in Chrome. And this is an automatic damage to your eyesight in Chrome.
 
Last edited:
I hope that in the case of Zen5 Halo everything will be fast and open, as soon as real laptops reach testers and store shelves. And it will be another rout of Intel by AMD, if their memory controller gives at least 200 GB/s, like M4Pro, crushing 128-bit ArrowLake for laptops. This is a new breakthrough in x86 and all the old Intel/AMD chips will immediately fade in comparison...
It's already been shown that AMD is giving more bandwidth to their iGPU than their CCDs. Here, you can see that the Ryzen AI 9 HX advantages its iGPU by 4x, compared to its CCDs. Somewhere, I'm sure I read that even the desktop's tiny iGPU has 2x the link bandwidth of a single CCD.

https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe245b90c-e9e3-4180-a4c2-61bd9ea1bee7_829x660.png


Perhaps because of this, a single CCD can't max the memory bandwidth of the 9950X. You need to have both CCDs going, for that.

https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc332a1ed-0cdf-4cbb-bf4f-6bf58a3c6411_1409x756.png


Source: https://chipsandcheese.com/p/amds-ryzen-9950x-zen-5-on-desktop

If Ryzen AI Max is using the same CCDs as the desktop Ryzen 9000's, then you're probably not going to saturate its memory bandwidth, even with both CCDs.

So that everyone understands - what real 200GB/s+ gives - is the ability to really support 3-4 8k monitors, i.e. finally add support for DP2.0+ (UHBR20) and TB5/USB40V2 to igpu, which we have been waiting for for 5 years since the announcement of DP2.0 in 2019. Never before has the IT industry taken so long to introduce such standards that were needed 10 years ago. Although even DP2.0+ (UHBR20) / HDMI 2.2 are not enough to introduce 8k monitors with 120Hz 36-bit panels. For this, optics are already needed, the age of copper has come to an end.
In a home computing context, I cannot imagine why anyone needs 8k, much less at 120 Hz with 36-bit color. However, the industry is dutifully advancing on that front, whether we need it or not. HDMI's Ultra96 spec will get you there (and beyond).

8k monitors will finally close the topic of rendering ideal clear fonts and almost "analog" images, because on diagonals up to 32" they will definitely provide ppi above 230.
You must have some ridiculous vision, because I can barely see a pixel on a 32" 4k monitor. I have one of those at work and it convinced me that 32" is too small for 4k to be worthwhile. I'd have to go up to at least 36" if not 40". However, such a big monitor would have me swiveling my head and I'm sure my neck would start to hurt before long.
 
  • Like
Reactions: thestryker
Perhaps because of this, a single CCD can't max the memory bandwidth of the 9950X. You need to have both CCDs going, for that.
Each CCD has two IF ports. The throughput of each port is 64GB/s at 32B/s*2GHz. In consumer processors, only one is enabled. Both are enabled only in server processors.
If both IF ports are enabled on each CCD, the theoretical throughput of the CCD-SOC link will be exactly 128GB/s (256 for 2xCCD) at an IF frequency of 2GHz.
For current single-port connections, this is 64GB/s (128 for 2xCCD) but is reduced by the current memory controller.
We can expect changes in the new SOC for this connection.
 
Last edited:
  • Like
Reactions: bit_user
Both are enabled only in some server processors.
Fixed that, for you. The EPYC IO Die doesn't have enough links for a fully-populated server CPU to have dual-connected CCDs, so it's something they reserve for their frequency-optimized models.

There's another detail you're glossing over, which is that the Infinity Fabric links are bidir (full-duplex), whereas memory is half-duplex. So, the real bandwidth limit of a Zen 5 CCD, with the AM5 IO Die, is actually 96 GB/s (assuming you're doing some kind of signal processing workload that's neatly balanced at a 2:1 read-to-write ratio). Because real world workloads do have a mix of reads and writes, I expect a single CCD can saturate most of a 9950X's memory bandwidth.

Now, the key question for Ryzen AI Max is whether the I/O Die has double the IF connectivity for the CCDs. I wouldn't bet on it, being a mostly laptop-oriented product, but maybe.
 
Last edited:
  • Like
Reactions: Peksha
If Ryzen AI Max is using the same CCDs as the desktop Ryzen 9000's, then you're probably not going to saturate its memory bandwidth, even with both CCDs.
AIDA64 tests of real laptops (almost all of them, regardless of the FPPT value, which is roughly equivalent to Intel's PL2 mode) with AI 370 show an average bandwidth (in read+write+copy mode divided by 3) of about 95-96GB/s, which gives about 80% efficiency with lpddr5 7500. That is, igpu practically does not take the bandwidth from the cores when it is not seriously loaded - which is much better (in practice) than in the M4 architecture and especially the old M3 from Apple, since the cores cannot get 80% of the memory controller bandwidth there (especially in Max versions) - which sharply reduces the performance of Apple processors during intensive pumping in memory during complex calculations, relative to what could be with 80%+, like in x86 cores.
Strix Point architecture uses the 128 bus quite effectively. The bandwidth is slightly worse than that of Lunar Lake, where the memory is in the chiplet and is slightly faster - I was surprised that Intel did not use a 512-bit HBM3 controller to immediately bypass AMD at this crucial turn for it (and also a shameful one, since it is the TSMC factory).

Zen5 Halo is a completely different matter - it has a 256-bit controller, which sharply elevates it above the 128-bit Arrow Lake HX.

I haven't seen any Aida64 benchmarks of caches and RAM for the 256-bit Zen5 Halo memory controller yet, but I'd like to...

I cannot imagine why anyone needs 8k
I have clearly described the reasons above and they are completely objective even for 24". The minimum ppi should be above 220-230 to eliminate erroneous anti-aliasing in Chrome base browsers.
You can simply take a screenshot of this forum thread in Chrome under W7/8/10/11 and your screenshot will prove my statement immediately by post in this thread.

much less at 120 Hz with 36-bit color.
Are you currently happy with the mess of scrolling text on a 60Hz screen? How much is on your smartphone and current screen?
HDMI's Ultra96 spec will get you there (and beyond).
No, it will not solve this problem in lossless data transfer mode (I do not need DSC lossy compression). To provide 120Hz at 8k when scrolling text, a lossless transmission channel of 2m with 160Gbps (including service data) is needed. Copper is already a problem for such a band in the mass case. 36 (actually 39 minimum 2^13x3) bits are needed to eliminate banding on 8k gradient fills in dynamics - the eye sees it perfectly.

I'm waiting for Zen5 Halo HX tests with 256 bit controller in AIDA64...
 
  • Like
Reactions: Peksha
Given the lack of source for the testing perhaps it's not even limited, but rather what the CPU designs are capable of saturating. As you've already said the bus width isn't to feed the CPU, but rather the GPU.
I can easily see why that would be so.

For the longest time CPU design concentrated on raising scalar performance, making sure that the main thread of logic, the parts that Amdahl's law excluded from parallelisation--the really critical path, would be shortened as much as possible.

That mostly meant caches and lots of speculation eating the vast majority of all silicon real-estate, because going to RAM for every variable not in a register, the way the original microprocessors did (some, like the TI TMS9900 even had its registers in RAM), which used to be nearly invisible in terms of clock cycles, because processors themselves often required multiple cycles for every operation, had become an overhead of hundreds of cycles.

Once caches have exploited all extractable locality from traditional CPU logic code, what is left for external memory access was typically mostly scattered chaos; you had to wait (or try scheduling another thread), typically only for small bits (locality was still strong, just elsewhere at this moment), so latency was more critical than bandwidth in this case.

The "GPUization" of CPUs, the introduction of instructions that operate on huge vectors in wavefronts have changed significantly what their memory controllers are tasked to handle. Bandwidth and perhaps even the ability to control the "cachyness" of data you're operating on became much more important for performance. Note the use of HMC on Knights Landing as one solution to that issue (Intel Xeon Phi).

But on something like x86 or even ARM that transition has architectural limits which in this context mostly means that there aren't significant returns for giving CPU cores extra bandwidth: they quite simply can't turn that into enough value to pay for the effort. E.g. no game would run faster unless the underlying ISA and technical architecture was redone from scratch and the game rewritten.

Interestingly enough AMD has tried something like that in the past, with the first wave of APUs which were designed to have GPU and CPU code coexist and capable of switching ISAs at call/jump granularity via their Heterogeneous Systems Architecture. But somehow nobody jumped at rewriting Office or SAP or even games to take advantage of that.

I have no idea if that facility still survives in current designs, chances are that the synchronization effort between GPU and CPU blocks would eat all benefits today. RISC-V is the only ISA I can think, which aims at making such call/jump granularity between CPU/GPU code possible via extensions, but you'd still have to design such a CGPU from ground up with far too many assumptions and compromises baked in to make it general enough for scale and economy.

But the general preference for GPUs getting precedence over CPUs on any memory bus also extends into shared power policies. Every APU I've seen has preferred the current iGPUs power budget over the CPU, while its maximum was generally also smaller.

It's somewhat easy to see why you'd do that on a desktop, where a CPU entering commands into a drawing pipeline overtaking a GPU busy rendering it would be even less pleasant to watch that dropping frame rates.

Now imagine having to write a software scheduler trying to make smart decision about that instead...

Bandwidth starvation in caches and RAM is already a huge issue on virtualized server workloads, with heterogenous consumers on a single bus, this doesn't get easier.
 
  • Like
Reactions: bit_user
Ah, yeah, I didn't make that clear. Even x86 CPU cores have been fast enough to saturate the dual channel bandwidth it just didnt matter in much until 16 core CPUs arrived. I also have a hard time believing that the M3 Max wouldn't be able to at least match the bandwidth of the M1 Max.

Any time I've gone looking for technical insights into the later M-series I've hit a brick wall. It doesn't mean that they're not out there (I don't check youtube for example), but if they are they aren't easy to find.
I am afraid you're looking for capabilities, nobody wants to spend money on.

Every metal wire inside a chip, every transistor put in and connected costs money. Tiny amounts for sure, but they are only put there if there if you can prove their value.

Chip designers use extracts from commercial workloads to test and hone their designs and synthetic tests for memory bandwidth accessible to the CPU aren't very high on that list, while some bandwidth intensive genomic sequencing code or LLMs are now moving up on the priority scale, as do bits of gaming code that can still be managed with larger caches (X3D).

So if the majority of typical commercial workloads extracts on the CPU side of the SoC are happy with receiving only 25% of the bandwidth available, those extra wires and transistors for the remaining 75% won't get put in for lack of a sponsor, while the GPU portion which pushed for the extra width gets it all.

I understand your frustration rather well. I've had plenty of similar ones, where I felt like I met a wall of silence for questions I didn't think totally stupid. And I was puzzled that nobody else seemed to be asking them.

But that also reminds me of questions my kids asked growing up, which were totally off the beaten track, somehow obviously wrong, but not that easy to explain without going beyond things they could know or without having to reflect on things I hadn't really thought about, either.

Which reminded me of myself being in the same place with my parents or other adults, which had me question if they were as all-knowing as I had assumed so far...

For the last couple of years I've had the luxury of being able to think through a lot of those questions and in most cases the answer was eventually quite simple: economy driving evolution.

Finding that economic angle isn't always as straightforward as you'd think, but so far it's given the best returns.
 
So if the majority of typical commercial workloads extracts on the CPU side of the SoC are happy with receiving only 25% of the bandwidth available, those extra wires and transistors for the remaining 75% won't get put in for lack of a sponsor, while the GPU portion which pushed for the extra width gets it all.
In fact, the pursuit of profit kills versatility. This is exactly what x86 was famous for and what Apple ultimately sacrificed.

Processor cores have long been suffocating from very slow memory (look at the L1 cache - even a 17-year-old C2D can easily handle a throughput of 100 GB/s and higher). I understand that everything has an economic background, but it is very sad, especially when the RAM can no longer be replaced or upgraded, to see the wretched LPDDR5(x) soldered instead of the 1024-bit HBM3, which for the same reasons of "economic feasibility" is now soldered only into server chips. And I am absolutely sure that this is not a problem of excessive consumption even for notebook SoC x86 - integration of HBM3 controller into chiplet with processor and memory of at least 32 GB with throughput from 500 GB/s - as a mass solution. It just requires courage - to set up mass production and then the price will gradually fall rapidly, as with all technologies of the past...
 
AIDA64 tests of real laptops (almost all of them, regardless of the FPPT value, which is roughly equivalent to Intel's PL2 mode) with AI 370 show an average bandwidth (in read+write+copy mode divided by 3) of about 95-96GB/s, which gives about 80% efficiency with lpddr5 7500. That is, igpu practically does not take the bandwidth from the cores when it is not seriously loaded
That's a monolithic SoC, and therefore not directly relevant to my point about Ryzen AI Max, which you quoted.

which is much better (in practice) than in the M4 architecture and especially the old M3 from Apple, since the cores cannot get 80% of the memory controller bandwidth there (especially in Max versions) - which sharply reduces the performance of Apple processors during intensive pumping in memory during complex calculations, relative to what could be with 80%+, like in x86 cores.
What you should pay attention to is the amount of GB/s per core, no the % of total bandwidth.

What matters even more than that is real world performance, because there's a lot that sits between the cores' pipelines and DRAM. Not only are there are multiple levels of caches, but also memory prefetchers. Plus, as core count increases, the all-core clockspeed will drop, which lessens the per-core demand on memory bandwidth.

The memory model of ARM also differs from x86 in a way that can work to reduce write-miss penalties on ARM CPUs. I've seen this characterized as having a net 8.9% impact on real world performance. It also means that an ARM CPU needs less DRAM bandwidth than an otherwise equivalent x86 CPU, in order to deliver equal performance. Up to 25% less.

I was surprised that Intel did not use a 512-bit HBM3 controller to immediately bypass AMD at this crucial turn
To me, the choice seemed obvious.
  1. HBM is significantly more expensive than LPDDR5X.
  2. HBM uses more power than LPDDR5X, which is a big deal since Lunar Lake is a laptop chip.
  3. HBM is in critically short supply, currently being the limiting factor in production of AI accelerators. It's backordered by like a year or so.
  4. Lunar Lake is a 8-core, 8-thread CPU made for thin & light. It doesn't need a ton of bandwidth.
  5. Intel incorporated a 8 MB "side cache" (system-level cache) to improve efficiency and lessen bandwidth demands, further.

So, it's not necessary and has lots of downsides.

I think it's telling that gaming GPUs stopped using it, since AMD discontinued the Radeon VII and Nvidia's Titan V. Arguably, those weren't even very mainstream and were both derivatives of datacenter/HPC-focused products.

for it (and also a shameful one, since it is the TSMC factory).
Not sure what you mean by this, since TSMC doesn't even make HBM.

Intel incorporated HBM into their Xeon Max, which they make and package themselves. So, the use of HBM (or not) has nothing to do with where the chip is fabbed.

I haven't seen any Aida64 benchmarks of caches and RAM for the 256-bit Zen5 Halo memory controller yet, but I'd like to...
They're not benchmarks of the memory controller. Just the cores' access to memory. It's looking at an end-to-end view of the system, rather than unit-level testing.

I have clearly described the reasons above and they are completely objective even for 24". The minimum ppi should be above 220-230 to eliminate erroneous anti-aliasing in Chrome base browsers.
Okay, you do you. For me, I'm good with 1440p at 27". That's enough DPI for my eyes.

Are you currently happy with the mess of scrolling text on a 60Hz screen?
Not really, but then the monitor I bought last year will go up to 280 Hz, if I wanted to. IMO, 144 Hz would've been enough. My decision was influenced by other factors, not the 280 Hz thing.

How much is on your smartphone and current screen?
I hate reading on my phone. I don't do it, unless I have to. It's not small, either. I'd just much rather be seated at a big monitor.
 
  • Like
Reactions: thestryker
Processor cores have long been suffocating from very slow memory (look at the L1 cache - even a 17-year-old C2D can easily handle a throughput of 100 GB/s and higher).
Assuming that's true, modern cores can do >= 6 times that much. I doubt a modern core is more than 6 times as fast, on most tasks.

https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3334bd57-4777-48fb-b997-5fb0f3f64b9f_2386x991.png


L1 cache is that fast because it handles most memory accesses, not because that's how fast DRAM should be. The cache hierarchy is very effective at reducing the amount of DRAM bandwidth required.

If you think memory bandwidth is a bottleneck, then a better place to look is memory scaling data, where people have actually benchmarked apps across different memory speeds to see how much impact it has. The answer is that you hit diminishing returns very quickly, even on heavily-threaded tasks. Also, it turns out, latency is a bigger issue than bandwidth. And latency on HBM is worse than DDR5, BTW.

it is very sad, especially when the RAM can no longer be replaced or upgraded,
That's a different matter. Some memory can't be on modules. Putting LPDDR5 on SO-DIMMs hit a pretty big brick wall at 6400 (I think), which meant anything faster had to be soldered until LPCAMMs came along. LPCAMM will have its own limits, and then we'll be back to soldering for top performance. Or, maybe you want a laptop with 512-bit memory interface but not one that's big enough to accommodate four LPCAMMs.

integration of HBM3 controller into chiplet with processor and memory of at least 32 GB with throughput from 500 GB/s - as a mass solution. It just requires courage - to set up mass production and then the price will gradually fall rapidly, as with all technologies of the past...
The price difference between HBM and LPDDR is fundamental. It will always be at a cost and power disadvantage, due to the fundamental differences between the technologies. Plus, the AI bros have already created more demand than the industry can cope with, for HBM. Whatever economies of scale there are to be had, well... we have enough scale to reach them.

Let's say someone designs a CPU with HBM. They won't get supply for so long that it'll be obsolete, by the time they finally do. That's how far backordered it is. And we have no evidence that it'd be solving a real problem, either.
 
  • Like
Reactions: thestryker
That's a different matter. Some memory can't be on modules. Putting LPDDR5 on SO-DIMMs hit a pretty big brick wall at 6400 (I think), which meant anything faster had to be soldered until LPCAMMs came along.
I don't think LPDDR5 was ever put on SODIMMs. JEDEC spec DDR5 SODIMMs capped at 5600 (XMP 6400 is highest I'm aware of), but CSODIMMs start at 6400 and should scale similarly to CUDIMMs (G.Skill has already shown XMP 8133).
 
  • Like
Reactions: Peksha
L1 cache is that fast because it handles most memory accesses, not because that's how fast DRAM should be. The cache hierarchy is very effective at reducing the amount of DRAM bandwidth required.
Not at all - caches of all levels are such a crutch as DLSS in GPU designed to hide the disadvantage of the architecture - in GPU weak computing power for a given resolution and minimum frame rate, and here too narrow memory bus available to all cores.

With a bus with a bandwidth at the level of L1 cache - caches become unnecessary, And when this memory is also shared by a bunch of devices on the bus, it is obvious that it should be thicker in bandwidth than the total simultaneous requirements of all possible devices on the bus.

Take even the connection of the processor with the southbridge via the DMI bus - the speed is simply shameful and therefore even M.2 connected to this bridge, instead of processor lines, work much worse. And there are still a lot of other ports.

Modern x86 architecture is at least 10 times behind in the development of the memory bus relative to the real requirements of all consumers on the common data bus...

But I also don't like the way Apple solves the problem. The architecture should allow giving the entire bus to the device if this device is capable of processing such a data flow. It is obvious that processor cores have long been capable of such speeds. That is why I am sure that 8k is being pulled for these reasons - the load on RAM is too great for such devices, especially if there are several of them. And this is a shame, because 8k screens were ready for mass production 10 years ago, but we still have not received a beautiful super-sharp (almost analog) picture even in 2025 on large diagonals. So far, such a picture is available only on rare models of 6k monitors for workstations, and on a number of laptops, where a regular 4k matrix immediately gives ppi 220+. And here's what's strange even for gaming solutions, where battery operation is optional - 4k mass screens are not installed, although against the background of GPU prices they cost several times less ($150-180 retail for 4k@120Hz), and this is the only universal solution for 4k and fhd. 2.5k screens are garbage, they are not compatible with either 4k or fhd content and do not provide crystal clear fonts in modern browsers for the reasons I mentioned above.

We only need 4k@120Hz screens in laptops and only 8k@120Hz screens in monitors. We have been waiting for them for many years...
Here's another opinion on why 220+ ppi is so important on monitors.
 
Last edited:
With a bus with a bandwidth at the level of L1 cache - caches become unnecessary,
This is not true. In CPUs, latency is a performance killer. If you built a cacheless CPU with a 4096-bit HBM3 memory subsystem, it would perform like absolute garbage!

When CPU designers choose the parameters of its cache hierarchy, they carefully balance latency against many other factors. A larger cache means it'll be higher-latency, because it takes longer to do lookups and fetches. This is probably the main reason why Lion Cove introduced a new tier in the cache hierarchy.

Here's how it performs, in Arrow Lake:

https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F510a752a-3417-4335-a9a6-c2b52929bf5b_1165x566.png


https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Facb029e9-e059-4cda-b857-0fe544a5b24c_1085x500.png


Sources:

And when this memory is also shared by a bunch of devices on the bus, it is obvious that it should be thicker in bandwidth than the total simultaneous requirements of all possible devices on the bus.
When multiple clients are sharing the same memory, there's usually some statistical multiplexing, which means that the total required bandwidth is much less than the sum of peak bandwidth needed by each client.

Take even the connection of the processor with the southbridge via the DMI bus - the speed is simply shameful and therefore even M.2 connected to this bridge, instead of processor lines, work much worse. And there are still a lot of other ports.
Ever since Rocket Lake, it's been PCIe 4.0 x8. In LGA 1700, the CPU-direct x4 port was only PCIe 4.0. So, the DMI connection had 2x the bandwidth of the CPU-direct link and more than enough to support a PCIe 4.0 NVMe drive at peak transfer rate.

Can you cite any data showing that a PCIe 4.0 NVMe SSD performs worse, when connected to LGA 1700 chipset than CPU-direct?

Modern x86 architecture is at least 10 times behind in the development of the memory bus relative to the real requirements of all consumers on the common data bus...
Can you cite any references supporting this claim?

But I also don't like the way Apple solves the problem. The architecture should allow giving the entire bus to the device if this device is capable of processing such a data flow.
Engineering products for the real world involves lots of tradeoffs. Unless we know why they did it that way, we can't know what tradeoffs they were making. It could be related to the cache coherency model, where they wanted to route all CPU cores through a single port on the SoC's fabric, because splitting them across multiple ports would mean longer-latency and lower bandwidth between the cores.

GPUs are more resilient to such problems, because they tend not to do most of their core-to-core communication via memory, not caches. Furthermore, they have such a weak memory consistency model that it's no extra effort to satisfy their memory ordering constraints in a distributed architecture. So, it'd be quite natural to distribute the GPU cores across multiple fabric ports, and with little downside.

It is obvious that processor cores have long been capable of such speeds.
They don't all sit around doing memcpy(), all day! The amount of bandwidth needed by a core tends to be quite a lot less than its peak.

That is why I am sure that 8k is being pulled for these reasons - the load on RAM is too great for such devices, especially if there are several of them.
At 32 bits per pixel, an 8k frame is 132.7 MB. At 60 Hz, that's 7.96 GB/s. That's something a modern APU can certainly manage and not even 1% of the memory bandwidth in dGPUs like a RTX 4090 or RX 7900 XTX.

And this is a shame, because 8k screens were ready for mass production 10 years ago, but we still have not received a beautiful super-sharp (almost analog) picture even in 2025 on large diagonals.
High cost + low demand would be my guess as to why 8k hasn't gone mainstream. For most people, 4k is plenty. For gamers, their GPUs long struggled to reach decent framerates even at 4k, so trying to do 8k would sound insane to them.

2.5k screens are garbage, they are not compatible with either 4k or fhd content and do not provide crystal clear fonts in modern browsers for the reasons I mentioned above.
I've used 2560x1440 since about 2012, when I first got one of these monitors at work. At work, I now use a 4k screen (too high DPI for the fonts I like to use), but I have bought a 1440p for use at home, as recently as a year ago. So, I clearly don't agree that they're garbage. I've even watched some streaming content on them, and find that 4k content downsampled to 1440p looks great!
 
When multiple clients are sharing the same memory, there's usually some statistical multiplexing, which means that the total required bandwidth is much less than the sum of peak bandwidth needed by each client.
I have personally seen many times how SATA ports are limited on a slow DMI bus. It is quite obvious that even the latest version of DMI is several times slower than the total number of lanes available directly from the processor.

Zen4 HX series has 28 PCIe 5.0 lanes - 24 are available. Compare with the thickness of the south bridge...

The funniest thing is that none of the laptop manufacturers have used all these Zen4 HX processor lanes, literally hanging in the air without doing anything. For a simple reason - the Zen4 HX memory bus is extremely weak - only 60-65GB/s, and although all devices require at least 2 times more, and taking into account the reserve for the system and software - 3 times more, if not 4, and here we are smoothly approaching the 256-bit Zen5 Halo controller with probably 200GB/s. Bingo! Eureka!
The question is - why did AMD make these 28 lanes in Zen4 HX? Obviously, just to show what it can do, although in real implementation, no one needed them due to the extremely slow memory bus and the lack of PCI-E 5.0 devices in laptops. They just created an artificial effect of "coolness" of this series, which no one will be able to use in practice. And apparently to "wipe the nose" of Intel with Raptor HX...

Here's how it performs, in Arrow Lake:
There is no point in me citing these graphs - they are obvious and trivial. The point was that the memory bus is as fast as the L1 cache, being located next to the processor, like the soldered memory (remember the context of our conversation). Therefore, any cache is a crutch. And today's hierarchy of crutches only proves the problems in the x86 architecture, which has become wildly unbalanced compared to the performance of cores and peripherals. It is obvious that Intel will also be forced to switch to a 256-bit (or 512-bit) controller in the HX series for the HEDT market, a year late compared to AMD, and then it will become widespread in regular series. At the moment, if Halo provides real 200 GB / s + - it will go into absolute lead over the HX series in terms of intensive processing of heavy data arrays in memory. And naturally, this will affect the performance in games, which such series usually target.

Can you cite any references supporting this claim?
Purely empirical assessment based on my understanding of the problems of x86 architecture. Especially in terms of igpu blocks and output to high-resolution screens with high frame rates.

. So, it'd be quite natural to distribute the GPU cores across multiple fabric ports, and with little downside.
I can't add anything except to repeat - such a scheme deprives the architecture of the universality of the memory bus for all devices equally and leads to bottlenecks for certain classes of calculations - in this case, during heavy calculations, on universal processor cores, since GPU cores cannot execute arbitrary code efficiently and are sharpened for vector operations first and foremost, imposing large penalties for complex commands and branching schemes.
I hope that the 256-bit Zen5 Halo controller will give as before at least 80%+ efficiency for the CPU cores, but at the same time will dynamically efficiently distribute the bandwidth between all devices according to their requirements, unlike the limitations of the Apple architecture.
At 32 bits per pixel, an 8k frame is 132.7 MB. At 60 Hz, that's 7.96 GB/s. That's something a modern APU can certainly manage and not even 1% of the memory bandwidth in dGPUs like a RTX 4090 or RX 7900 XTX.
It is empirically clear that when the bandwidth occupied by devices playing a video signal is more than 20% of the bandwidth, freezes begin. Intel directly recommends in the datasheets for video decoders and when outputting 4k to use only 2-channel memory. Why, if 22GB/s+ (DDR4 3200+) is more than enough even for a pair of 8k@60 monitors? But in reality, their IGPUs begin to freeze screens already with 4k monitors on single-channel memory, these are proven facts, especially in the case of old DDR4 3200.

It is extremely undesirable for the image output to take up more than 15% of the bus for a number of reasons. And you forget that fast VRAM is useless when data comes from the CPU in reality. And they always do. Only in games is it less significant, in other scenarios it is a significant limitation processor - memory bus - PCI-E bus with its brakes - VRAM.

It is much better if sys mem = vram and the processor cores access vram directly without restrictions of the pci-e bus.

And with soldered memory, nothing prevents such multiplexing by integrating a 1024-bit HBM3+ controller and 32-64GB of RAM into the processor. Super chiplet. We get very fast cores with fast system memory and direct switching to igpu cores when necessary. But the frame buffer (it is small in size even with 3 buffering) can be implemented separately on igpu, so as not to interfere with the common memory bus and common data processing by processor cores and gpu cores.

High cost + low demand would be my guess as to why 8k hasn't gone mainstream. For most people, 4k is plenty. For gamers, their GPUs long struggled to reach decent framerates even at 4k, so trying to do 8k would sound insane to them.
Most buyers on the planet are ignorant and do not understand what 8k (or more precisely 280+ ppi) on a screen up to 32" means to them. And at the same time they can easily compare the screen of a smartphone and the screen of their monitor. The most amazing thing is that even you (an extremely experienced member of this forum with many years of experience) do not understand this, judging by your statements. Although to be convinced of my rightness, you just need to compare the text on the screen of your smartphone, and then on the screen of your monitor with less than 150 ppi. And I previously suggested that you take a screenshot of this forum thread in Chrome and post it here. I will demonstrate to you what the problem is clearly, if you yourself do not see it point-blank. I am waiting for your screenshot from Chrome.

I remember how the same people on various forums literally laughed at me, claiming that fhd on a 24" screen was enough for them, but imagine their shock when they got the opportunity to work for 24" 4k. But this is not a complete low ppi problem solving.

The main problem for the eyes is that when it looks at a low ppi screen, it constantly refocuses from pixels to the objects themselves. This leads to problems with lens accommodation and increased fatigue.

Starting at about 400 ppi - this effect is no longer there - the human eye no longer distinguishes pixels. The picture becomes like from the real world - almost analog.
I have a smartphone with 400+ ppi and I see a difference of 300 ppi with a smartphone screen from 25 cm, but there is no difference further. And a person can accidentally approach the screen at shifts of 35-40 cm, especially a laptop. Therefore, the minimum ppi should be - 300. But 400+ is better, in fact, this is the final ppi for the eyes - further ppi race becomes meaningless. Only in VR, where there is an increase in the picture at the level of micropanels through the lens mechanism - that's where ppi of several thousand is needed.
So, I clearly don't agree that they're garbage. I've even watched some streaming content on them, and find that 4k content downsampled to 1440p looks great!
You confirm what I said. Yes, 4k reduced by a bicubic algorithm to 2.5k will naturally look more or less, because 4k is excessive for 2.5k, as well as for fhd. But ideally, because this is not a division by an integer.

But you cannot watch either 4k or fhd content in ideal quality on your monitor. Only 2.5k, and this practically does not exist in nature.

Only 8k, 4k, fhd are universal - all three are obtained either by multiplying lower resolutions by an integer or dividing higher ones into lower ones again by an integer. By the way, I was always surprised that manufacturers of 4k monitors do bilinear antialiasing, intentionally spoiling the picture, although the conversion algorithm from 4k to fhd is primitive, literally a couple of lines. Apparently, this was some kind of collusion on the market, incomprehensible to me. And only in 2019, first Intel, and then NVidia, with great fanfare, rolled out an integer resolution reduction on 4K panels to fhd, as some kind of "know-how". It was very funny, because this is not the GPU's task, but the monitor controller's. What difference does it make to him - to draw 4 pixels of 4k in a 2x2 matrix or 1 pixel in the same 2x2 matrix in fhd mode, while also increasing the frame rate by 4 times, if the LCD crystals allow it...

In addition, with the 8k-4k-fhd scheme, we get a clear advantage on the 4k and fhd screen compared to the original (especially when creating rips from the original) - in commercial video, only 4:2:0 color thinning scheme is used (i.e. the color resolution is 2 times worse in reality), but when resizing to fhd from 4k and at 4k from 8k, we get full color resolution for each pixel. That is why it is extremely profitable to shoot a movie (or home shooting) with 2 times higher resolution, if (optics and sensor allow it), and then convert it to 4:4:4 by reducing the resolution by 2 times.

That's why I don't understand why you keep a completely inferior 2.5k at home. Only 8k, 4k or fhd make sense. And high ppi is always only a benefit for everyone without exception, if you ignore the problems with crooked code in a number of applications and OS due to the fault of the developers.

Nobody complains about 400 ppi+ on smartphones, right? Everyone is delighted as soon as they see the quality of the picture. But for some reason they don't want to see the same analog quality of the picture on their desk, apparently out of stupidity, ignorance, not understanding that low ppi increases the problem of refocusing from pixels to objects (in addition, they are less clear along the contours) and leads to increased eye fatigue due to the constant re-accommodation of the lens back and forth. This is especially terrible on old fhd monitors and on 17.3" laptop screens, because people sit closer to them than to monitors. Yes, people use them - but why don't they have a desire for the best, when everyone has a direct example for comparison in their pocket? I don't understand this social (or mental) problem with the population. The majority of them...
 
Status
Not open for further replies.