Looks like you completely missed my point and the basics behind it.
Yes, I didn't find it very clear, but thanks for the additional clarifications.
The point #1 is that no Intel or AMD, the main server CPU producers create purely CPU based motherboards with more than 2 CPUs on them but they could put 4, 6, 8 on them.
Like Stryker said, Intel still maintains up to 8-socket scalability, but at a considerable price premium. I think you'll find that it's generally not cost-effective to scale that high, which is why > 2-socket servers aren't popular.
It's getting seriously difficult to fit more than 2 CPUs and all their RAM on a single board. You can find some pictures of AMD server boards with 2 sockets and 48 DIMM slots. There's definitely not room for another two CPUs and their memory. Just eyeballing it, I couldn't say if there'd be enough root to fit 4 of them with only one DIMM per channel, so you might have to look at some kind of blade-base setup.
At SC22, we saw a Gigabyte GPU server with 48x DDR5 DIMM slots. We also saw a STH video on display and an Ampere Altra Arm with NVIDIA server
www.servethehome.com
Keep in mind that Intel's top end CPUs have the same number of UPI channels, regardless of whether they support only dual-socket or 8-socket configurations. Even in the latest generation, it's still not enough for point-to-point connectivity for 8 CPUs. Also, if your data isn't equally distributed across the other CPUs' memory, you'll hit more inter-processor communication bottlenecks than you would in a 2-CPU configuration, where all of the UPI channels are used to link the two CPUs.
Intel or AMD do not promote the fast ways you would connect existing motherboards into single parallel system for end users.
Intel had been pushing OmniPath, with versions of Skylake SP that even had it wired into the CPU package, itself. However, the market rejected OmniPath, as it was a proprietary format and required OmniPath networking gear.
As for other high-speed networking options, why do AMD and Intel need to be the ones pushing it? They're both members of the CXL consortium, and have both implemented 1.x support in Sapphire Rapids and Genoa, respectively.
Seems they gave up to the notion or often a myth that GPUs always beat CPU in computer power and wall-plug efficiency.
They both have AVX-512. Does that count for nothing? Zen 5 even beefed it up, quite considerably, not only implementing at native 512-bit width, but also adding more dispatch ports.
The point #2 is that typical 8 GPUs on the supercomputers baseboard we tested do not give on our tasks speedups more than 5-8 compared to the CPU on the same board. These tasks are usual heavy FP64 vector and scalar simulations not the recent fashion AI.
Well, GPUs don't only offer more raw compute - they
also offer roughly an order of magnitude more memory bandwidth. So, I'd say you're in an interesting position to have a workload that's not only so heavily dependent on FP64, but also has such a high compute/memory_bandwidth ratio.
Plus, it matters specifically
which GPUs you benchmarked. Compared to the A100, Nvidia's H100 made a huge leap in fp64 performance, in order to better compete with AMD, which went to full 64-bit registers and 1:1 performance vs. fp32 in CDNA.
If I had an app with poor scaling on GPUs, I'd definitely want to understand exactly why. It could just be as you say - that the task didn't use the GPUs very efficiently (branchy code? limited concurrency?), but it could also be that there's over-synchronization, excessive contention, etc. CPUs are designed to optimize those things a lot better than GPUs, with their weak memory models.
And seems you also missed my joke point #4 with number 4 as unlucky number in some countries why we might not see 4 processor motherboards
Intel made them years ago but not anymore
I got the cultural reference about 4 being unlucky, but it wasn't clear to me that you meant it in reference to quad-CPU configurations, rather than counting generations of EPYC CPUs, or something like that.