Speaking of Intel's take on AVX-512 hybrid approach on consumer CPUs, the AVX-512 instructions were never added to the efficiency cores because it would use way too much die space and defeat the whole purpose of efficiency cores.
Yes, for area
and power reasons, I think. It's worth remembering that Gracemont is the first generation of E-cores even to have their (256-bit) AVX extensions. Before that, they were limited to just the 128-bit SSE family.
catching the bad instruction fault on the E-cores and only scheduling the thread on the P-cores would be something that could be added at least to LINUX (some third party patches were in the works), if Intel had not fused the feature entirely.
Could work, but not well. The problem is that an app sees there are N hardware threads available (or vCPUs, if you prefer). It naively spins up N application threads, to take advantage of them all. If they all end up scrunched on the P-cores, then those get oversubscribed, while the E-cores sit around mostly unused.
And all it would take to trigger a fault is just one AVX-512 instruction. The thread executing it might not even get a measurable advantage from using it, but would still suffer the consequences of getting faulted-over and locked to the P-cores. So, it could actually
hurt your overall throughput more than it helps.
I'm sure Intel analyzed this option. I think it's why the initial Alder Lake CPUs didn't have the feature disabled in hardware - because the decision
not to go heterogeneous ISA was probably finalized at a fairly late date.
Developers were also reluctant to use AVX-512 is because the CPU takes a heavy frequency hit when this mode was engaged.
The massive frequency hit was mostly an issue on server CPUs, because they have very strict requirements to stay within their power envelope. On consumer CPUs (e.g. Rocket Lake, Tiger Lake), what you see is the CPU holding up frequencies fairly well, but consuming gobs of power instead.
Power-efficiency of AVX-512 seems to have improved significantly, on Intel 7.
Then, there's also a small additional penalty of about three percent when switching into and out of 512-bit execution mode.
The extensions aren't explicitly modal, but you're right that it's bad to leave a block of AVX-512 code without using
vzeroupper.
www.realworldtech.com
I wonder how necessary that is on Golden Cove or Zen 4...
The real solution would be for Intel to detect the presence of AVX-512 instructions then automatically and unconditionally pin the thread to the big cores. It wouldn't be THAT hard either, just catch the unknown instruction exception and see if it is AVX-512, and then move the thread.
That's basically what I was talking about, above - have the kernel trap a SIGILL, check if it's an AVX-512 instruction, then set the thread's CPU mask and put it back into the run queue. But, that has all the downsides I mentioned.
As far as I'm aware
nobody has gone the hybrid-ISA route, and I think for good reasons. Intel will probably enable AVX-512 on their E-cores, before long. AMD's adoption of AVX-512 probably means it's inevitable that Intel brings it back to their consumer CPUs.
The cases where AVX-512 provides any benefit are still somewhat niche, but more people have been dabbling with using it to accelerate things like text-processing. I'm skeptical it's energy-efficient to do so, but it's at least a performance win.
lemire.me