News Raspberry Pi 5 patch boosts performance up to 18% via NUMA emulation — Geekbench tests reveal gains in both single and multi-threaded performance

abufrejoval

Reputable
Jun 19, 2020
443
301
5,060
At first glance I want to decry this as such evident bogus, that on second thought I keep thinking "clickbait!"

Third thought: "perhaps I should feed this to some AI to demonstrate their disablity to reason?"

In case you're not firmly grounded in NUMA, it's mostly about fixing a problem that a uniform memory SoC like the RP5 shouldn't have: On multi-CPU (or multi CCD) systems the memory bottleneck tends to be relived by giving each cluster of cores its own DRAM bus, while implementing a mechanism via which DRAM physically attached to another CPU can still be used in a logically transparent manner having that CPU act as a proxy.

Those proxy services come at a cost, because now two CPUs are kept busy for memory access, but it can be judged better than simply haveing a task run out ouf memory altogether or having to share data via even less efficient means like fabrics, networks or even files.

So NUMA libraries will let applications exploit locality, keeping cores, code and data as much on locally attached memory and caches as possible to avoid the proxy overhead.

And on small systems with only a single memory bus, that situation should never occur, all RAM is local to the CPU core cluster that this code is running on.

But then in a way non-locality has crept into our SoCs because avoiding the terrible bottleneck of a single memory bus has created such immense pressure, that the vast majority of the surface area of even the smallest chips is now covered with caches.

And keeping your caches uncontested by any other core or thread is critical in avoiding cache line flushes and reloading data from lower level caches or the downright terrible DRAM, which btw. can in fact also be split into banks and open page sets, which again have been created to lessen the terrible overhead of going to fully unprepared raw RAM (some if it might even be sleeping!).

That's why high-performance computing applications need such careful tuning while lots of then still only scratch single digit percentages out of the max theoretical computing capabilities of the tens of thousands CPUs they are running on.

And evidently that work hasn't been done with the standard variant of Geekbench.

And that mostly exposes the dilemma that the classic abstraction or work separation between operating systems, its libraries and applications are facing: they simply don't know enough about each other to deliver an optimal RAM allocation strategy.

Typically you want both: locality for your working thread, keeping things as closely together, yet also aligned to data type and cache line boundaries (which wastes space but runs faster), but also as spread out across distinct portions of the cache, so distinct threads won't trample across each others cache lines.

And there an additional problem is that cache tags aren't fully mapped so cache lines that aren't actually pointing to the very same logical address can still evict each other: HPC type profiling and tuning may be requried, but is also hardware dependent e.g. between different implementations or even generations of x86.

It still doesn't quite explain why single threaded parts of the benchmark should gain so significantly, but that may just be because Geekbench too needs to somehow straddle completely opposing demands: reasonable runtime and results reasonable enough to allow comparison.

And the only way to dig deeper is to read and observe the running code via profiling, which is why any benchmark without source code can't really be any good.
 
Last edited:

bit_user

Titan
Ambassador
And keeping your caches uncontested by any other core or thread is critical in avoiding cache line flushes and reloading data from lower level caches or the downright terrible DRAM, which btw. can in fact also be split into banks and open page sets, which again have been created to lessen the terrible overhead of going to fully unprepared raw RAM (some if it might even be sleeping!).
I haven't seen a good explanation of why it should matter for caches, but I suspect it's really just about pipelining memory accesses across different banks of DRAM. It would be interesting to see this benchmark repeated across different memory capacities of the Pi hardware, if there are any two that use the same density DRAM chips.

It still doesn't quite explain why single threaded parts of the benchmark should gain so significantly,
Probably because it (more often than not) moves the GPU into a separate DRAM bank from the CPU cores running the thread.
 
Probably because it (more often than not) moves the GPU into a separate DRAM bank from the CPU cores running the thread.
Or the OS itself - as this patch is at the kernel level, this allows the scheduler to make sure the process is running wholly from a single memory area, separate from the kernel and system services. Stuff like execution prediction and prefetchers may just work better as a side effect of "fake" NUMA (which is more akin to a software implemented NUMA since it is, actually, non uniform at a hardware level, simply not considered as such).
 
  • Like
Reactions: bit_user