News AMD's Zen 6-based desktop processors may feature up to 24 cores

thestryker · Monday at 2:52 PM

usertests said:
I hope they can do better than 6400. Zen 4/5 used the same I/O die, so no major improvements to be found there. Much faster speeds than DDR5-7200 have been touted for years. Maybe official 7200 support, with the "sweet spot" at 8000?

There needs to be DRAM availability and while ARL supports 6400 most of the OEM systems were still running 5600 UDIMMs due to low availability. That's why I'm not totally sure if there will be any JEDEC DDR5-7200 on the market by Zen 6 launch. So even if they could support it (ARL could easily support it now) I wouldn't expect them to use a standard that functionally doesn't exist.

Given how good the memory controller from AMD actually is (check APU overclocking to see what I mean) I'm also not totally sure how much a better memory controller can improve performance so long as the CCD/IOD is packaged as they currently are.

thestryker · Monday at 3:00 PM

The Historical Fidelity said:
And right now the highest JEDEC DDR5 bin is 8800mhz which is 57% greater than 5600mhz.

There are zero client clock drivers on the market that support 8800MT/s stock (they're 7200 currently). Just because JEDEC expanded the specification so it can go up to 8800 doesn't mean it's a real thing.

bit_user · Monday at 3:04 PM

thestryker said:
My server boxes have tended to last the longest and of course be on 24/7.

Are you hitting the CPU from 15% to 60% that entire time? If not, then maybe you don't need to worry so much.

thestryker said:
My last two were ~8 years (PIII-S) and ~12 years (SNB), but there also isn't a ton that hits the memory hard on those which I imagine minimizes wear.

Yeah, I think the wear-out issue is with the cells and just doing refreshes isn't hitting them nearly as hard as what would constitute normal memory usage by the CPU.

Somewhat recently (i.e. within the last few years), the Linux kernel has defaulted to memory layout randomization, for security reasons. That should also help with DRAM wear-leveling. I have no idea about Windows.

bit_user · Monday at 3:06 PM

thestryker said:
There needs to be DRAM availability and while ARL supports 6400 most of the OEM systems were still running 5600 UDIMMs due to low availability. That's why I'm not totally sure if there will be any JEDEC DDR5-7200 on the market by Zen 6 launch.

Zen 6 isn't until 2026, right?

thestryker · Monday at 3:16 PM

bit_user said:
Are you hitting the CPU from 15% to 60% that entire time? If not, then maybe you don't need to worry so much.

Yeah, I think the wear-out issue is with the cells and just doing refreshes isn't hitting them nearly as hard as what would constitute normal memory usage by the CPU.

Somewhat recently (i.e. within the last few years), the Linux kernel has defaulted to memory layout randomization, for security reasons. That should also help with DRAM wear-leveling. I have no idea about Windows.

Yeah the usage was relatively tame on those when spread out over time. File serving and for a while routing were the baseline on the old ones and heavy usage would just be randomly when I needed it. My new server will be interesting because it's using ZFS which will put a lot more strain on the memory (SNB box had a RAID controller) than the last one.

bit_user said:
Zen 6 isn't until 2026, right?

Correct, but it still has to rely on both the CKD ramp and memory IC validation. The good thing is that I'm pretty sure all of the major CKD manufacturers support 7200 (I know Renesas and Rambus off the top of my head) so that part should be purely limited to availability.

The Historical Fidelity · Monday at 3:29 PM

thestryker said:
There are zero client clock drivers on the market that support 8800MT/s stock (they're 7200 currently). Just because JEDEC expanded the specification so it can go up to 8800 doesn't mean it's a real thing.

Correct, but it does mean it will be a thing. I just checked G Skill’s website and they offer an OC DDR5-10,600MT/s so we know the IC’s can handle blistering transfer rates. I wonder how far JEDEC will push DDR5 standards.

usertests · Monday at 3:36 PM

bit_user said:
Have you heard that AMD is now selling EPYC-branded desktop processors? They are for just such applications.

https://www.tomshardware.com/pc-com...yc-4004-cpus-am5-gets-server-grade-processors

Epyc 4004 are pretty much identical to Ryzen 7000 desktop CPUs (the quad-core may be an APU die or severely disabled), but with enterprise features supported. A 32-core Zen 6c on AM5 would be a very interesting change of pace, as it would be a product that couldn't be marketed to gamers and the majority of users, who benefit more from high single-threaded performance than an additional 8 cores.

bit_user said:
Now that you mention it, I'm worried they could either return to the segmented model of Zen 2 or that there could be some other latency hit from so many cores sharing a L3 pool.

Dual-CCX with 6 cores each would be an unacceptable regression. 8+4 like Strix Point seems unlikely. I think it will be unified.

It could have some latency hit, but there are clear benefits:
1. No cross-CCX latency up to 12 cores. If something somehow heavily uses more than 8 cores, it's good for that.
2. High flexibility in binning. With no need to keep things symmetrical between two CCXs, any of the cores can be disabled. We will probably see the 8-core go from today's premium to tomorrow's "junk" 4/6-core equivalent, with a 10-core becoming a good mid-range option. The next premium single-CCD X3D CPU will have 12 cores.
3. Big pool of L3 cache available for any individual core: 48 MB if zhangzhonghao is correct, and available to APUs if MLID is correct. Cezanne/Phoenix/Strix's 16 MB per core on APUs (Strix is split 16+8) would triple... or higher if 3D V-Cache is added.

I think the X3D approach probably wouldn't work too well or at all if there were two CCXs per CCD. And on that note, we can probably look forward to an X3D CCD L3 cache increase which has been 96 MB for three generations in a row. The minimum increase would be to 112 MB if they use a 64 MB cache chiplet again.

thestryker · Monday at 3:38 PM

The Historical Fidelity said:
Correct, but it does mean it will be a thing

I agree in this case it will almost assuredly be.

It being part of the specification doesn't mean an actual product materializes though. GDDR6 is a prime example of this where the spec supported up to 32Gb density, but I'm pretty sure nobody ever shipped higher than 16Gb.

bit_user · Monday at 5:37 PM

usertests said:
Dual-CCX with 6 cores each would be an unacceptable regression. 8+4 like Strix Point seems unlikely. I think it will be unified.

I didn't mean it'd be heavily segmented, like in the bad, old days of Zen2. In that case, reaching L3 cache on the other CCX meant having to bounce off the IO die, which made latency of accessing the other half about as bad as hitting the other chiplet.

What I meant was just that I think they'll have to add more structure to it. I honestly don't know how it's structured, today. I might've read someone claiming that it's an Intel-style ring bus, on each CCD? IMO, Intel's ring bus hasn't scaled well, beyond about 8 cores.

usertests said:
1. No cross-CCX latency up to 12 cores. If something somehow heavily uses more than 8 cores, it's good for that.

Yeah, I'd really like to know what proportion of workloads are affected by this. I think this issue is overblown.

usertests said:
3. Big pool of L3 cache available for any individual core: 48 MB if zhangzhonghao is correct,

Yeah, more L3 cache will be welcome.

usertests said:
we can probably look forward to an X3D CCD L3 cache increase which has been 96 MB for three generations in a row. The minimum increase would be to 112 MB if they use a 64 MB cache chiplet again.

I hope they can move to a higher-capacity for the X3D cache dies. If those got a boost to 96 MB, then your X3D single-CCD models would jump to 144 MB.

bit_user · Monday at 5:40 PM

thestryker said:
It being part of the specification doesn't mean an actual product materializes though. GDDR6 is a prime example of this where the spec supported up to 32Gb density, but I'm pretty sure nobody ever shipped higher than 16Gb.

Because GDDR7 materialized too soon. GDDR users are too bandwidth hungry to care much about high-capacity GDDR6, when you can just go for GDDR7 and get both.

Plus, for all we know, the lack of on-die ECC could've started becoming an issue for GDDR6, in which case capacity-scaling would've just made that worse.

usertests · Monday at 5:48 PM

bit_user said:
I honestly don't know how it's structured, today. I might've read someone claiming that it's an Intel-style ring bus, on each CCD? IMO, Intel's ring bus hasn't scaled well, beyond about 8 cores.

Hopefully they share some technical details when it launches. I think Zen 5 may have made some relevant changes, but I'd have to check Chips and Cheese or something.

bit_user said:
I hope they can move to a higher-capacity for the X3D cache dies. If those got a boost to 96 MB, then your X3D single-CCD models would jump to 144 MB.

And if a mobile SKU was given a 96 MB cache die... it would mean a 9x increase in cache per core (from 16). Just a mind-boggling improvement for laptops (ignoring the 9945HX3D).

But it's important to scrutinize the leak. It may be difficult to fit 12 cores and 48 MB into a 75mm^2 chiplet. It will probably have to be on a TSMC N2 node, at least.

The Historical Fidelity · Monday at 6:53 PM

thestryker said:
I agree in this case it will almost assuredly be.

It being part of the specification doesn't mean an actual product materializes though. GDDR6 is a prime example of this where the spec supported up to 32Gb density, but I'm pretty sure nobody ever shipped higher than 16Gb.

Samsung currently produces GDDR6W at 32Gb density, however it has a 64-bit interface instead of the standard 32-bit.

thestryker · Monday at 7:21 PM

bit_user said:
Because GDDR7 materialized too soon. GDDR users are too bandwidth hungry to care much about high-capacity GDDR6, when you can just go for GDDR7 and get both.

Going to have to disagree on this one being a logical explanation. There was at least 6 years between the announcements of each standard.

It seems far more likely there was a technical reason behind not increasing the capacity given how quickly memory manufacturers were able to increase the speeds and move from 8Gb to 16Gb.

bit_user said:
Plus, for all we know, the lack of on-die ECC could've started becoming an issue for GDDR6, in which case capacity-scaling would've just made that worse.

This could very well be a reason.

thestryker · Monday at 7:21 PM

The Historical Fidelity said:
Samsung currently produces GDDR6W at 32Gb density, however it has a 64-bit interface instead of the standard 32-bit.

It's two die in one package which isn't part of the specification hence the name difference. I wonder if this was to fight HBM to some degree or increase capacity vs board space (or maybe both).

The Historical Fidelity · Monday at 7:43 PM

thestryker said:
It's two die in one package which isn't part of the specification hence the name difference. I wonder if this was to fight HBM to some degree or increase capacity vs board space (or maybe both).

Probably as a cheaper competitor to HBM

Mr Majestyk · Monday at 8:28 PM

People forgetting Zen 6 is getting a redesigned IF, it might be a different interconnect but will be much faster and greatly improve on the throughput which is IIRC currently 72GB/s. It would be pointless having a 24 core 10950X with a small bump to even 7200MT's RAM. Already, 9950X is bandwidth starved (Chips and Cheese analyzed this in detail IIRC) another reason for the poor uplift over 7950X.

I wonder if AMD could introduce some new SKU's:

10400X 6 Zen 6 cores:
10600X 8 Zen 6 cores:
10700X10 Zen 6 cores:
10800X3D 12 Zen 6 cores
10900X no point
10900X3D 18 cores =12 + 6, 12 cores with vcache. Awesome for gaming and productivity
109950X 24 Zen 6 cores = 12 + 12
10950X3D as expected. Better binned 10900X3D so slightly better clocks to retain flagship status in gaming.

bit_user · Monday at 9:24 PM

thestryker said:
Going to have to disagree on this one being a logical explanation. There was at least 6 years between the announcements of each standard.

Why does that matter? Just because GDDR6 was announced a long time ago doesn't mean it was feasible to make 32gb dies until recently. Maybe, by the time they were viable, the industry was already starting to transition to GDDR7, so they just focused on that, instead.

bit_user · Monday at 9:27 PM

thestryker said:
It's two die in one package which isn't part of the specification hence the name difference.

That's what I assumed, but the illustration shows it as 4 dies in one package vs. 2. And instead of going side-by-side, the extra dies are stacked.

thestryker said:
I wonder if this was to fight HBM to some degree or increase capacity vs board space (or maybe both).

That's exactly how it seems.

https://www.tomshardware.com/news/samsung-gddr6w-doubles-performance-and-capacity

Ogotai · Tuesday at 12:11 AM

newtechldtech said:
More PCIe lanes does not come cheap . you can move to Threadripper platform if you need more lanes.

um, yea, have you seen the prices for the boards for those ?

newtechldtech said:
dual channel memory instead 8 or 4 , 16+4 lanes instead of 64 or 128 , plus the CPU Price as well .

i never said any thing about more ram channels, did i ? i specifically said pcie. for mainstream desktop, 64 is more then likely over kill, even 32 would be good, as multi gpu is dead. so the 16 lanes for Pcie_2 x16 can be used else where, eg, m,2 slots... which would free up the lanes from the chipset for other things...

newtechldtech said:
If they add more lanes you will pay $500 more (extras between CPU price and Motherboard price increases.)

prove it...

bottom line is, there isnt enough pcie lanes on mainstream boards. if mobo makers put all the connectivity on the boards, and you have to loose something because you plugged something else into it, then thats just crap. up until am4 and which ever equivalent intel socket, this wasnt really a problem, and you had SLI/cross fire on top of that...
i even looked at getting a threadripper based comp to replace my x99 based one, just for board and cpu, it would be $2900 here.

jonaswox · Tuesday at 1:04 AM

bit_user said:
This is a common myth, but I've seen a lot of memory just go bad, at my job. DRAM wears out with use. Some even comes defective from the factory, and if you don't memtest it, ECC is the only way to catch that.

I would not consider faulty ram catching errors with ECC as properly functional in any way. Can we assume for the discussion at hand that the modules are not faulty or "worn out" (something I have never experienced - just to say, it has to be rare). 😀 ECC can also be faulty, is it less probable? of course, but if we go down this route, why ignore that?

Exactly the mentioned memtests proves that properly functioning ram can run without errors for days when bombarding it with reads and writes throughout the period.

bit_user said:
SSDs and hard disks are different. There, the medium is fundamentally unreliable and wears out much more rapidly (requiring active maintenace). I guess DRAM is getting to the same point, which is why DDR5 incorporates on-die ECC.

Obviously they are different 😀 What I am saying is that in almost all arenas of digital data, we implement overhead in abstraction layers that are not visible to the user, in any capacity - to make sure small errors get fixed. It is an inherent problem to the field - related to the discrete nature, so single bits can have a pretty significant detrimental impact , if it is "the right bit". Harddrives especially deploy massive overhead to keep your data safe. I dont remember the exact numbers, but I remember from school jawdropping a bit when I realised how much space is wasted in a harddrive , for pure data integrity reasons.

Frankly , implementing some simple error catching code is trivial, and you get a lot of integrity for the first 5/10% overhead.

And the crux here is the inherent unpredictable interference of the outside world. Your cell phone, whatever, anything that transmits electromagnetic waves. In practice , what you see is almost always bursts of errors, stemming from interference.

The fact that you can get your ram to run so efficiently with zero knowledge on how to set it up correctly and such, is actually pretty impressive - and just the smallest misalignment will boost your error probability by a lot. The entire field of DIY computers is impressive in this sense - that it is possible with such negligible impact. But that is a product of bandaiding and training wheels deployed by your motherboard, if you do not take control yourself. You will more than often see motherboards deploy drive strengths for example that are outright detrimental to signal integrity, because it is so keen on giving you bandaids and swim wings

Though going from standard whatever ram, to properly tuned high quality ram , is a different world.

bit_user · Tuesday at 9:07 AM

jonaswox said:
I would not consider faulty ram catching errors with ECC as properly functional in any way.

How else, then? In modern computing hardware, I've seen only two methods: ECC and full mirroring. Mirroring is obviously terrible for both performance and capacity, whereas OOB ECC has virtually no impact on either and just requires you buy more expensive DIMMs. As I mentioned, in-band ECC has only a modest impact on either, but (at least as Intel has implemented it) offers less protection than regular ECC DIMMs.

Here's an old study of DRAM reliability from Google. If anyone knows of anything more recent, please share.

"In this paper, we analyze measurements of memory errors in a large fleet of commodity servers over a period of 2.5 years. The collected data covers multiple vendors, DRAM capacities and technologies, and comprises many millions of DIMM days.

The goal of this paper is to answer questions such as the following: How common are memory errors in practice? What are their statistical properties? How are they affected by external factors, such as temperature and utilization, and by chip-specific factors, such as chip density, memory technology and DIMM age?

We find that DRAM error behavior in the field differs in many key aspects from commonly held assumptions. For example, we observe DRAM error rates that are orders of magnitude higher than previously reported, with 25,000 to 70,000 errors per billion device hours per Mbit and more than 8% of DIMMs affected by errors per year. We provide strong evidence that memory errors are dominated by hard errors, rather than soft errors, which previous work suspects to be the dominant error mode. We find that temperature, known to strongly impact DIMM error rates in lab conditions, has a surprisingly small effect on error behavior in the field, when taking all other factors into account."

https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/35162.pdf

jonaswox said:
ECC can also be faulty, is it less probable? of course, but if we go down this route, why ignore that?

ECC circuitry is incredibly simple. If the memory controller's ECC logic failed, you'd probably know it by a raft of false ECC errors being reported, but I mean it's like 0.001% of the logic in a CPU and pretty much impossible that only it fails, while the rest of the CPU continues to work flawlessly.

The most probable failure of ECC - and the only one seriously worth considering - is that there's a software or configuration issue in ECC error reporting. That's why it's important to use validated solutions. When I bought an AM4 board for my home fileserver, I bought a server-grade board from ASRock Rack - not a more mainstream-style board where ECC support is a minor bullet point not important for most of its users.

And yes, ECC has limitations, in that it can only reliably detect up to 2-bit errors. However, I would again say that it's not worth seriously considering a scenario where your RAM is working flawlessly and the first/only error(s) you ever see are 3+ bit-flips per read, that it fails to detect.

In the maybe half dozen failed ECC DIMMs I've seen in the wild, only one of them ever even had any 2-bit errors. However, most of my experience with bad RAM is of the non-ECC variety.

jonaswox said:
What I am saying is that in almost all arenas of digital data, we implement overhead in abstraction layers that are not visible to the user, in any capacity - to make sure small errors get fixed.

When it comes to DRAM, that's called ECC memory. As I mentioned, DDR5 now has on-chip ECC, which has jealously been hidden from us, so we cannot know how many problems it's papering over, of what sort, and just how close a given DDR5 DIMM or memory chip is to generating uncorrectable errors.

jonaswox said:
Harddrives especially deploy massive overhead to keep your data safe. I dont remember the exact numbers, but I remember from school jawdropping a bit when I realised how much space is wasted in a harddrive , for pure data integrity reasons.

They also have tracking information, which imposes its own overhead. Be careful the figures you're looking at don't include that.

Anyway, yeah, I remember back when HDDs started embedding DSPs and doing on-the-fly error checking & correction. I think that was way back in the 1990's.

jonaswox said:
Frankly , implementing some simple error catching code is trivial, and you get a lot of integrity for the first 5/10% overhead.

Yes, that's ECC. Before DDR5, ECC DIMMs added 12.5% overhead, adding 8 bits per 64. With DDR5, they now add 8 bits per every 32, just because DDR5 bifurcated 64-bit DIMMs into two logically independent 32-bit channels.

With the "in-band" ECC solution I mentioned, Anandtech found the overhead of Intel's implementation is just 8 bits per 256 (i.e. 3.1% overhead), showing just how much weaker it is. I wish Intel would let us use in-band ECC on all of their CPUs & platforms that don't support full OOB ECC.

The Historical Fidelity · Tuesday at 9:32 AM

Perh

Mr Majestyk said:
People forgetting Zen 6 is getting a redesigned IF, it might be a different interconnect but will be much faster and greatly improve on the throughput which is IIRC currently 72GB/s. It would be pointless having a 24 core 10950X with a small bump to even 7200MT's RAM. Already, 9950X is bandwidth starved (Chips and Cheese analyzed this in detail IIRC) another reason for the poor uplift over 7950X.

I wonder if AMD could introduce some new SKU's:

10400X 6 Zen 6 cores:
10600X 8 Zen 6 cores:
10700X10 Zen 6 cores:
10800X3D 12 Zen 6 cores
10900X no point
10900X3D 18 cores =12 + 6, 12 cores with vcache. Awesome for gaming and productivity
109950X 24 Zen 6 cores = 12 + 12
10950X3D as expected. Better binned 10900X3D so slightly better clocks to retain flagship status in gaming.

Perhaps we will see the new 16 to 24 core parts have 6-channel memory and require their own motherboards (like a X790 AM5.1 mobo), kind of like Intel’s old i7-E series.
Realistically this is the only way to overcome the memory bandwidth problem without completely redesigning and highly binning IO dies to be stable at 3600-4400 MHz mem-clock speeds which would make these IO dies incredibly expensive.
Yes 6-channel mem support will make motherboards and the CPUs more expensive, but a possible route is to bin the 8-channel threadripper IO dies with 2 broken channels down for Ryzen 9 desktop on an expanded AM5 footprint to have the necessary wiring for the 2 additional channels.
I don’t know which way will be cheaper for AMD or the consumer, but it’s either this or zen 6 remains bandwidth bottlenecked.

abufrejoval · 2025-03-12T06:34:48-0400

bit_user said:
In-band ECC would work here. In fact, all of the Intel parts I'm aware of that support IB-ECC are BGA models.

Perhaps you could do a bit of benchmarking for us there, given that you have an in-band device.

I've heard that the performance impact for in-line ECC is significant, because you add extra memory cycles to make it work: AFAIK it's not just hidden within a cache line, if it were the impact should be acceptable.

Most Atoms got so little bandwith out of their DIMMs (I measured typically half of what their Core cousins would get from the same DIMMs), that either there was plenty of wiggle room left (those extra cycles can be hidden) or they'd collapse entirely (slow cycles on top).

Too bad it might not let you imply how in-band ECC would perform on more modern hardware, since I already saw a completely different RAM behavior with Jasper Lake and Alder-Lake++ seem to have completely different memory controllers.

In-line ECC sounds like such a good idea on SBCs, that Intel's patents on this must be really watertight. Unless the performance impact is significant, it just seems like a feature everybody would want to have a choice on using.

abufrejoval · 2025-03-12T07:21:44-0400

bit_user said:
Intel said GNA was about ambient AI, like presence detection. It was low power, low-throughput. I don't know if it could handle video tasks, like background removal from a video conferencing session.

Yes, it was really more like a Hexagon DSP or MobileEye VPU. But it was also utterly useless, AFAIK only NUCs were enabled to use it to wake up for Alexa... Today I'm sure it still inside every mobile and desktop part, but effectively just dark silicon.

bit_user said:
Really? I haven't noticed them on the annotated die shots I've seen. Can you refer us to one?

Those bits are typically too small to be identified by die shot analysts. You can find their traces mostly in marketing materials. Intel Atoms have camera interfaces and some basic image processing capabilities that mobiles also had for a long time: if you need to support cameras, you need to have them. And on x86 those cameras have been terrible because Zoom isn't enough of a use case compared to what consumers do with their phones, while Qualcom has trouble integrating their superior camera technology support into the Windows software ecosystem for their laptop SoCs.

In phones this has diversified (some of the processing is within the camera CCD chip itself now) and skyrocketed in features and capabilities, but that stuff is completely hidden from users and developers, part of the secret blob sauce OEMs tune with SoC and camera chip vendors.

With AMD there has been extra audio processing in APUs since at least Kaveri, if not Richland and Thief was the only game title that was ever advertised as using it for "spatial processing", ensuring that game sound sources could be properly positioned in the game world and then properly mapped to headphones and surrund speakers. I'm pretty sure it's part of what console vendors demanded and use, but on Windows it's too niche to be used by any developer.

It was as successful there as these AMD stand-alone features often were: HSA, 3Dnow! come to mind. But the AMD chipset software still installs some extra audio device drivers on my Zen notebooks, where I have no idea what they are capable of doing and if they require using the HD audio lines so they can actually access signals and data to do their work.

Since I'm either using USB headsets or the monitor's audio via DP/HDMI, all audio hardware on mainboards from IP blocks to analog wizardry has been dark silicon/eletronic matter on my machines for more than a decade, too.

That includes those fancy polished Japanese caps and supposedly tons of shielding magic in circuit board traces.

bit_user said:
What if AI upscaling is offloaded to it, from the GPU?

Some simple picture upscaling/smoothing, sure. But that's no longer selling.

AI upscaling is a task way too computationally expensive to take from the GPU and it also relies on information only the GPU has: no way that I can see.

bit_user said:
NPUs can be used concurrently with GPUs, for inferencing. The reason they exist is because they offer better performance/W and performance/mm^2 at inferencing than GPUs.

There is inferencing and there is inferencing.

NPUs are designed to run small dense kernels, e.g. audio and image denoising, which fit mostly into their local on-chip RAM: they load it once during initialization and then can keep running them while the rest of the system is in low power with stopped clocks.

If they have to keep firing up the memory bus for their work, a) the energy benefits would largly go down the drain b) they'd compete with CPU/GPU on RAM bandwidth and with things like LLMs and AI upscaling you're talking processing that is mostly bandwidth constrained on GPUs/CPUs, too: there is simply no benefit in moving that to an NPU, which by then would have become something Nervana in size and thirst.

NPUs are really what the Hexagon xSP is on Qualcom mobile chips, but the transition between the phone and the laptop/deskto world isn't as easy as Intel and Qualcom thought.

Outside phones they live in a world that doesn't have a sharp cut off at 5 Watts, but covers an energy budget range much too broad to easily deal with in a single architecture. That would take an entirely new software ecosystem to take advantage of and I just don't see that happening because it lacks huge unique high value use cases, CPUs and GPUs can't handle at all.

bit_user · 2025-03-12T07:32:59-0400

abufrejoval said:
Perhaps you could do a bit of benchmarking for us there, given that you have an in-band device.

Yeah, it's on my to-do list, but I might not get around to it for a couple months. I also have a 8 GB DDR5 SODIMM I want to benchmark. These are weird devices, because they contain only 4 DRAM chips and the data is multiplexed on them in some weird way I didn't find properly explained (though everyone seems to agree they incur some sort of performance penalty).

In the meantime, did you notice they included benchmarks of IB-ECC on/off in this article?

https://www.anandtech.com/show/18732/asrock-industrial-nucs-box1360pd4-review-raptor-lakep-ecc/2

abufrejoval said:
I've heard that the performance impact for in-line ECC is significant, because you add extra memory cycles to make it work: AFAIK it's not just hidden within a cache line, if it were the impact should be acceptable.

I expect that, for linear accesses, it works out pretty much to just a 3-4% penalty. Worst-case random accesses might incur as high as a 50% penalty on reads, because the ECC words are stored at a different address range and that means if you're getting a cache hit rate of 0 on the ECC, then you have to do twice as many reads - one for the data and another for the ECC.

The real killer should be random writes, where even if you're writing full cachelines (e.g. non-temporal stores or REP STOSB), you'd definitely have to read-mod-write the cacheline containing the ECC word.

abufrejoval said:
Most Atoms got so little bandwith out of their DIMMs (I measured typically half of what their Core cousins would get from the same DIMMs), that either there was plenty of wiggle room left (those extra cycles can be hidden) or they'd collapse entirely (slow cycles on top).

Yeah, though I do fairly light-weight stuff with mine. If you're running an Alder Lake-N server and your CPU isn't very taxed, then switching on IB-ECC won't break you.

abufrejoval said:
already saw a completely different RAM behavior with Jasper Lake and Alder-Lake++ seem to have completely different memory controllers.

Exactly what do you mean by "Alder-Lake++" ?

abufrejoval said:
In-line ECC sounds like such a good idea on SBCs, that Intel's patents on this must be really watertight.

I wonder if graphics cards haven't done it for a while? How else do you suppose a Nvidia workstation/server card offers ECC with the same base memory capacity as the gaming cards? If I'm not mistaken, you can switch it on/off in software.

abufrejoval said:
Unless the performance impact is significant, it just seems like a feature everybody would want to have a choice on using.

Yeah, but I also remember reading about Intel having it for several years, before products started to emerge where you could actually utilize it. Sounds to me like it had teething issues, which took a while to sort out.

News AMD's Zen 6-based desktop processors may feature up to 24 cores

Judicious

Judicious

Titan

Titan

Judicious

Estimable

Distinguished

Judicious

Titan

Titan

Distinguished

Estimable

Judicious

Judicious

Estimable

Distinguished

Titan

Titan

Reputable

Titan

Estimable

Reputable

Reputable

Titan

Share this page