Does it even make sense using the RISC/CISC terms these days, for the mainstream processors?
Even the more "RISC-y" CPUs include SIMD extensions, AES acceleration and whatnot. They don't necessarily have 1-cycle-per-instruction timings... so what's left that's traditionally associated with RISC - exclusively load/store, fixed-length instructions? I'm not sure if e.g. ARM's LDM/STM instructions fit a dogmatic load/store world-view?
IMO, there are several key features of an ISA that make it RISCy:
- dedicated load/store instructions
- register-only operands
- fixed-size instruction words
Those set apart x86 from pretty much the rest of the modern ISAs out there. I know one ISA supports a mix of 32-bit and 64-bit instruction words, but I forget which (RISC-V?). I think that's still not nearly as bad as the variable-length situation we have with x86.
As for LDM/STM, those didn't get carried forward to AArch64. ARM replaced them with the LTP/STP instructions, which only operate on up to 2 registers.
Speaking of loads and stores, x86 also has the strongest memory ordering guarantees of the ISAs currently in common use. This can incur additional overhead. Interestingly, Apple's M-series CPUs have a special TSO mode they use for x86 emulation, so someone decided to enable it on AArch64 code and see just how much performance impact it had.
In the most extreme result of the benchmark suite they ran, weak ordering provided a 24.3% speedup over total store ordering! Overall, the benefit was just shy of 9%.
And sure, x86 has decades of bloat, but how much does *that* matter, really? If we ignore aesthetics and look purely at stuff like transistors used or performance hit for leaving the old stuff in?
It matters when trying to make wider decoders. Decoder width becomes crucial, when you start executing outside of the uop cache, such as in highly-branchy code. AMD has never gone above a 4-wide decoder. In Zen 5, what they did was to add a second decoder, but these decoders are tied to SMT threads. Intel has gone up to 6-wide, in its P-cores, but this pales in comparison to the 10-wide decoders in ARM's biggest cores and I don't even know what Apple is now up to.
Also, by restricting their new cores to 64-bit only, decoding is now so cheap that ARM found it can completely get rid of its mop caches, which provides additional transistor and power savings.
As for just "leaving the bloat in", I think maybe the legacy Intel wanted to drop could provide some simplifications in key critical paths. Not sure about that, but it's conceivable when you're talking about how addressing works.