News AMD Talks Hybrid Ryzen CPU Concepts, Avoiding Intel's AVX-512 Problem

Speaking of Intel's take on AVX-512 hybrid approach on consumer CPUs, the AVX-512 instructions were never added to the efficiency cores because it would use way too much die space and defeat the whole purpose of efficiency cores.

And, this is why scalable vector ISAs like the RISC-V vector extensions are superior to fixed-size SIMD. You can support both kinds of microarchitecture while running the exact same code.

Though, catching the bad instruction fault on the E-cores and only scheduling the thread on the P-cores would be something that could be added at least to LINUX (some third party patches were in the works), if Intel had not fused the feature entirely.

Developers were also reluctant to use AVX-512 is because the CPU takes a heavy frequency hit when this mode was engaged. I don't see how Intel didn't notice this before. Then, there's also a small additional penalty of about three percent when switching into and out of 512-bit execution mode.

Since AVX-512 couldn't be enabled at all unless you disable the efficiency cores completely, then they also wouldn’t benefit your workload unless disabling those extra cores was offset by the AVX-512 instructions on the remaining cores.

If your code benefits from AVX-512, it'll probably benefit from turning off the efficiency cores too. Sixteen 256-bit AVX channels are the same as eight 512-bit channels in theoretical throughput. Because there are fewer load/store commands and fewer bunches of setup code to run, overall theoretical efficiency should be higher.

The real solution would be for Intel to detect the presence of AVX-512 instructions then automatically and unconditionally pin the thread to the big cores. It wouldn't be THAT hard either, just catch the unknown instruction exception and see if it is AVX-512, and then move the thread.
 
Last edited by a moderator:
I asked McAfee how AMD felt about that approach (Intel's) to hybrid designs.
"What I will say is this, I think the way that we think about it, the approach of two very different performance and efficiency cores with very different ISA support and IPC and capability is not necessarily the right approach," McAfee responded. "I think it invites far more complexity around what can execute where, and as we've looked at different options for core design, that's not the approach that we're taking.


I know I keep repeating myself and a lot of people disagree. Regardless, IMO, putting low performance cores on a CPU designed for a HEDT is a dumb idea.

Even more... putting a GPU on a CPU designed for HEDTs is an even dumber idea.
 
That's created interesting problems: Intel's performance cores support AVX-512, but the smaller efficiency cores do not. That led Intel to disable AVX-512 support entirely (forcibly in the end), thus de-featuring its own chip and wasting precious die area
I would stress the part where the article says "forcibly in the end"; lack of AVX-512 isn't an issue, it's a feature, Intel loves its segmentation and assumed that blocking/removing AVX-512 in its mainstream Alder Lake and Raptor Lake CPUs would boost (somewhat) the sales of its HEDT Sapphire Rapids. AMD is looking at this issue from a completely different angle, since they (still) haven't become so lunatic to fuse off features like AVX-512 that are physically present on their chips.
 
Intel joined something named "RISE" with Samsung and QCOM, to work on RISC-V designs. Perhaps they can make use of it in a SIMD accelerator. Slap a CXL/PCIE5 interface on a RISC-V avx512 accelerator chip and get all the avx512 heat you want to pay for.

There may be another option in the works. One of the PVC GPU presentation mentions a common SIMD architecture with the CPU. How far are we from getting standard c++ inclusion for something like SYCL that would detect accelerators with avx512 capabilities at runtime and jit compile the kernels to take advantage? You can have this now if you use dpc++.
 
  • Like
Reactions: artk2219
The key to AMD's smooth roadmap, & much else, is Infinity Fabric IMO.

AMD figured they couldn't beat intel at big fancy chips, so they focused on a fancy bus to team simpler resource modules into consumer & large scale modules.

This often meant that old, prevalidated subsystems (IO eg.) could be left intact, & performance objectives are pursued in focused/isolated areas.

Less to re-design, test & trouble shoot. Their product roadmap timelines have been achieved far better.

AMD's focus on IF was a big part of it's amazing resurgence from the brink. They made inferior early Zen parts initially, but won via superior architecture (including costs).
 
AMD's current Ryzen 7040 laptop chips already feature a hybrid design, but not with two different types of CPU cores. Instead, the Ryzen 7040 has just one type of CPU core paired with an in-built AI accelerator engine that operates independently of the CPU and GPU cores.

An incidental accelerator shouldn't make the CPU/APU be considered a hybrid design. Maybe mismatched cache per chiplet (7900X3D, 7950X3D) could be thought of as hybrid, since that sounds like what Zen 4 + Zen 4C and Zen 5 + Zen 5C could mostly be.

I asked McAfee if, conceptually, it would be feasible that efficiency cores would be better for AI than a dedicated piece of silicon (the AI engine). McAfee explained that the AI engines' strict focus on AI-specific operations would give it an efficiency advantage over any general-purpose CPU compute -- even an efficiency core.

That answer should have been obvious, come on now.

I know I keep repeating myself and a lot of people disagree. Regardless, IMO, putting low performance cores on a CPU designed for a HEDT is a dumb idea.

Even more... putting a GPU on a CPU designed for HEDTs is an even dumber idea.

Not if it's done right.

I'm going to assume you are counting Ryzen 7000 as HEDT. That iGPU costs almost no die area, is in the TSMC N6 I/O chiplet separate from CPU cores, and can be more than enough for users who just want some display outs, video decode/encode, and even light gaming. Yeah, AMD should add a similar iGPU to Threadripper CPUs.

Ryzen (or any) CPUs can already have certain cores that consistently boost a little higher than others, so those cores would be favored. We've also seen 7900X3D and 7950X3D with the differing levels of cache per CCX. AMD's hybrid/heterogeneous implementation could be a more extreme version of these two things.

At this point I just want to know the specifics of how they would reduce the die area of a low performance core, other than removing half the L3 cache or using a smaller process node. That's one of the last pieces of the puzzle.
 
That answer should have been obvious, come on now.



<snip>
You're right -- it is completely obvious, and I knew the answer before I asked — it's an interviewing tactic. The intention of this question was to open up the topic of efficiency cores paired with performance cores in a hybrid config, which AMD hasn't spoken about publicly. The only reason I left mention of that question, which did work to get him to speak about a topic they aren't really talking about yet, was because I used the word "conceptually" in the question. It is important that I disclose that, as much of his answer came as a follow-on to that question, and I don't want to misconstrue his statements.

I think that an accelerator does make the chip a hybrid architecture, as it is connected to the same memory space as the CPU and GPU cores.
 
Last edited:
lack of AVX-512 isn't an issue, it's a feature, Intel loves its segmentation and assumed that blocking/removing AVX-512 in its mainstream Alder Lake and Raptor Lake CPUs would boost (somewhat) the sales of its HEDT Sapphire Rapids.

The most inefficient core is one that the customer is not paying for! Giving away free capability is not efficient.

Chiplets, tiles and design tools that allow greater flexibility are going to lead to ever more market segmentation to reduce what economists would term "consumer surplus" (the consumer mirror image to profit).

AMD is looking at this issue from a completely different angle, since they (still) haven't become so lunatic to fuse off features like AVX-512 that are physically present on their chips.

AMD doesn't market segment as aggressively because their smaller volumes (and perhaps less control of the production side ) don't allow them to do so as efficiently. This can result in more "consumer surplus" (the feeling of getting more than you paid for) if you are in the right part of the SKU stack. It also means that finding a perfect match to your requirements/budget may be harder.
 
  • Like
Reactions: artk2219
The key to AMD's smooth roadmap, & much else, is Infinity Fabric IMO.

AMD figured they couldn't beat intel at big fancy chips, so they focused on a fancy bus to team simpler resource modules into consumer & large scale modules.

This often meant that old, prevalidated subsystems (IO eg.) could be left intact, & performance objectives are pursued in focused/isolated areas.

Less to re-design, test & trouble shoot. Their product roadmap timelines have been achieved far better.

AMD's focus on IF was a big part of it's amazing resurgence from the brink. They made inferior early Zen parts initially, but won via superior architecture (including costs).
I agree with your assertions.

I think Intel has superior resources and some superior technologies - but, Intel overthinks, over engineers and puts too much effort in trying to differentiate its CPUs.

The AMD approach of KISS has a lot of short-term benefit.
 
  • Like
Reactions: artk2219
Actually, it's worth noting that the first gossip and leak about AMD's hybrid future surfaced in a June 2021 patent application.

This patent indicates, "A method for relocating a computer-implemented task from a relatively less-powerful processor to a relatively more-powerful processor"

This diagram shows the arrangement of Big/Little processors. There has been no update on this though. The hesitancy was due to Windows not offering a proper task scheduler in Windows 10, according to AMD, and it wanted those cores to be useful instead of just a marketing gimmick to let it say it has more cores.

AMD had applied for this patent describing methods by which one type of CPU would move work over to another type of CPU:

4mEXZzr.jpg


As per this patent, CPUs would rely on core utilization metrics to determine when it was appropriate to move a workload from one type of CPU to the other.

Metrics include the amount of time the CPU has been working at maximum speed, the amount of time the CPU has been using maximum memory, average utilization over a period of time, and a more general category in which a workload is moved from one CPU to the other based on unspecified metrics related to the execution of the task.

So if I understand this correctly, when the CPU determines that a workload should move from CPU A to CPU B, the core currently performing the work (CPU A, in this case), is put into an idle or stalled state.

The architecture state of CPU A is saved to memory and loaded by CPU B, which continues the process. AMD's patent describes these shifts as bi-directional -- the small core can shift work to the large, or vice-versa.

 
  • Like
Reactions: artk2219 and jp7189
Speaking of Intel's take on AVX-512 hybrid approach on consumer CPUs, the AVX-512 instructions were never added to the efficiency cores because it would use way too much die space and defeat the whole purpose of efficiency cores.
Yes, for area and power reasons, I think. It's worth remembering that Gracemont is the first generation of E-cores even to have their (256-bit) AVX extensions. Before that, they were limited to just the 128-bit SSE family.

catching the bad instruction fault on the E-cores and only scheduling the thread on the P-cores would be something that could be added at least to LINUX (some third party patches were in the works), if Intel had not fused the feature entirely.
Could work, but not well. The problem is that an app sees there are N hardware threads available (or vCPUs, if you prefer). It naively spins up N application threads, to take advantage of them all. If they all end up scrunched on the P-cores, then those get oversubscribed, while the E-cores sit around mostly unused.

And all it would take to trigger a fault is just one AVX-512 instruction. The thread executing it might not even get a measurable advantage from using it, but would still suffer the consequences of getting faulted-over and locked to the P-cores. So, it could actually hurt your overall throughput more than it helps.

I'm sure Intel analyzed this option. I think it's why the initial Alder Lake CPUs didn't have the feature disabled in hardware - because the decision not to go heterogeneous ISA was probably finalized at a fairly late date.

Developers were also reluctant to use AVX-512 is because the CPU takes a heavy frequency hit when this mode was engaged.
The massive frequency hit was mostly an issue on server CPUs, because they have very strict requirements to stay within their power envelope. On consumer CPUs (e.g. Rocket Lake, Tiger Lake), what you see is the CPU holding up frequencies fairly well, but consuming gobs of power instead.

Power-efficiency of AVX-512 seems to have improved significantly, on Intel 7.

Then, there's also a small additional penalty of about three percent when switching into and out of 512-bit execution mode.
The extensions aren't explicitly modal, but you're right that it's bad to leave a block of AVX-512 code without using vzeroupper.

I wonder how necessary that is on Golden Cove or Zen 4...

The real solution would be for Intel to detect the presence of AVX-512 instructions then automatically and unconditionally pin the thread to the big cores. It wouldn't be THAT hard either, just catch the unknown instruction exception and see if it is AVX-512, and then move the thread.
That's basically what I was talking about, above - have the kernel trap a SIGILL, check if it's an AVX-512 instruction, then set the thread's CPU mask and put it back into the run queue. But, that has all the downsides I mentioned.

As far as I'm aware nobody has gone the hybrid-ISA route, and I think for good reasons. Intel will probably enable AVX-512 on their E-cores, before long. AMD's adoption of AVX-512 probably means it's inevitable that Intel brings it back to their consumer CPUs.

The cases where AVX-512 provides any benefit are still somewhat niche, but more people have been dabbling with using it to accelerate things like text-processing. I'm skeptical it's energy-efficient to do so, but it's at least a performance win.
 
I know I keep repeating myself and a lot of people disagree. Regardless, IMO, putting low performance cores on a CPU designed for a HEDT is a dumb idea.
So, in other words, you'd prefer to have a CPU with just one super-core that's maybe 10x the area of a regular P-core, but runs about 1.5x or 2x as fast? Because, that's basically what you're saying. Essentially, you care more about single-threaded performance than parallelism, so the logical conclusion is that you want to go back to a single-core CPU that puts all its power and die area into making one thread run as fast a possible.

As soon as we start talking about lots of threads, then you cannot ignore concerns about area-efficiency or power-efficiency. With a highly-scalable workload, it pretty quickly becomes a better option just to use lots of efficient cores. This is the basic idea behind GPUs.

Even more... putting a GPU on a CPU designed for HEDTs is an even dumber idea.
I've said this before, and I guess I have to say it again: Intel has not put E-cores or iGPU in their HEDT models!!!

You might think an i9 qualifies as HEDT, but not as the term is classically defined. HEDT has a bigger socket, more memory channels, more PCIe lanes, generally more TDP, more cores (at the upper limit), etc.

BTW, HyperThreading (or SMT, as more generally known) essentialy makes every core that has it into a P or E core, depending on its occupancy! That not only means you've probably been using the equivalent of a hybrid CPU for years, but also that OS thread schedulers should be pretty good managing them. If you really do not want any form of Hybrid CPU, then you should disable HyperThreading!
 
Last edited:
I would stress the part where the article says "forcibly in the end"; lack of AVX-512 isn't an issue, it's a feature, Intel loves its segmentation and assumed that blocking/removing AVX-512 in its mainstream Alder Lake and Raptor Lake CPUs would boost (somewhat) the sales of its HEDT Sapphire Rapids.
This only makes sense if you ignore that Intel shipped 2 generations of mobile CPUs and one generation of mainstream desktop CPUs with AVX-512 fully enabled.

The reason they disabled it on Alder Lake is that they faced a problem of how to compete with AMD on highly-threaded workloads, and the only solution they could find was to go hybrid. And software ecosystem support to properly utilize heterogeneous ISA CPUs is a long way off.

That was the decision chain. Intel was in a tricky situation, and they picked the lesser of two evils.
 
Intel joined something named "RISE" with Samsung and QCOM, to work on RISC-V designs. Perhaps they can make use of it in a SIMD accelerator. Slap a CXL/PCIE5 interface on a RISC-V avx512 accelerator chip and get all the avx512 heat you want to pay for.

There may be another option in the works. One of the PVC GPU presentation mentions a common SIMD architecture with the CPU.
Sounds like the 2nd gen Xeon Phi (Knights Landing). Didn't work out too well.

How far are we from getting standard c++ inclusion for something like SYCL that would detect accelerators with avx512 capabilities at runtime and jit compile the kernels to take advantage? You can have this now if you use dpc++.
SYCL might be easier than straight OpenCL, but it's still some work to use it effectively. I don't think it's going to see wide adoption, outside of areas where people are already using things like OpenMP, OpenCL, or CUDA. Don't get me wrong - I'd rather use SYCL than those other options, but it's not a magic bullet for parallelizing everything (nor should it be).

I think WebGPU is a more interesting prospect. Combined with Web Assembly, we should start to see web apps that have performance and capabilities approaching that of natively-compiled ones.
 
  • Like
Reactions: artk2219
At this point I just want to know the specifics of how they would reduce the die area of a low performance core, other than removing half the L3 cache or using a smaller process node. That's one of the last pieces of the puzzle.
Case studies exist in the form of ARM cores. Seriously, ever since they introduced their X-series and Neoverse cores, those are just scaled up versions of their A7x-series cores. Because ARM sells these to SoC makers, they're fairly transparent about the details.

Here's some reading which shows the microarchitectural parameters they tweaked, as well as the impacts on IPC and area:

As I mentioned, you can probably also find some discussion of their N-series and V-series cores, which are derived from the mobile A7x-series and X-series mobile cores.
 
  • Like
Reactions: artk2219
You're right -- it is completely obvious, and I knew the answer before I asked — it's an interviewing tactic.
I had the same thought as @usertests , but I quickly realized that just because I know doesn't mean all your readers will know.

The intention of this question was to open up the topic of efficiency cores paired with performance cores in a hybrid config, which AMD hasn't spoken about publicly.
Good point. I guess it's a little like trolling, but you don't want it to be obvious. Make a statement that they'll want to clarify or qualify, thereby revealing more information.
 
  • Like
Reactions: artk2219
AMD had applied for this patent describing methods by which one type of CPU would move work over to another type of CPU:

4mEXZzr.jpg
Multi-path connectivity scares me, unless it's like a proper ring or mesh. But, it would be interesting if they paired little cores with big ones, so they shared the same L2 or L3 cache slice.

As per this patent, CPUs would rely on core utilization metrics to determine when it was appropriate to move a workload from one type of CPU to the other.

Metrics include the amount of time the CPU has been working at maximum speed, the amount of time the CPU has been using maximum memory, average utilization over a period of time, and a more general category in which a workload is moved from one CPU to the other based on unspecified metrics related to the execution of the task.
Sounds a lot like Intel's Thread Director.

So if I understand this correctly, when the CPU determines that a workload should move from CPU A to CPU B, the core currently performing the work (CPU A, in this case), is put into an idle or stalled state.

The architecture state of CPU A is saved to memory and loaded by CPU B, which continues the process. AMD's patent describes these shifts as bi-directional -- the small core can shift work to the large, or vice-versa.
Sounds like a general description of a context switch.
 
BTW, HyperThreading (or SMT, as more generally known) essentialy makes every core that has it into a P or E core, depending on its occupancy! That not only means you've probably been using the equivalent of a hybrid CPU for years, but also that OS thread schedulers should be pretty good managing them. If you really do not want any form of Hybrid CPU, then you should disable HyperThreading!
Thanks for bringing up that obvious but often ignored point.

I've stated this before: in order of per core processing power P-cores>Zen cores>E-cores>Zen SMT>P-core HT. There is more than one way for the task to go on modern CPUs. If you are worried about having weak threads slowing you down it is better to turn off HT and leave E-cores enabled than the other way around.

I used to have one of those older 12700ks with 512 enabled. I tried to find a practical use for AVX512 for a typical consumer like myself. Since I wasn't interested in emulating PS3 I could find none. And even though the E-core implementation in Alder was flawed because it dropped your ring clock to like 3600mhz and the P-core to E-core latency was bad, E-cores were still far more useful than 512 in my 12700k. With RPL it is no contest.
In typical consumer chips there is no need for 512 and having low cost E-cores is far more beneficial. Especially in a CPU with 4 or 6 P-cores.

It is a shame AMD can't do any more than strip cache and I think the AVX512 point is a bad straw man argument defense of this. Maybe they have less to gain. Their Zen cores are already between P and E cores in die area and power consumption and they are already having problems with heat density so it does seem to have less benefit with them compared to Intel.
 
This only makes sense if you ignore that Intel shipped 2 generations of mobile CPUs and one generation of mainstream desktop CPUs with AVX-512 fully enabled.

The reason they disabled it on Alder Lake is that they faced a problem of how to compete with AMD on highly-threaded workloads, and the only solution they could find was to go hybrid. And software ecosystem support to properly utilize heterogeneous ISA CPUs is a long way off.

That was the decision chain. Intel was in a tricky situation, and they picked the lesser of two evils.
There's a difference between disabling something, like Intel did with the first Alder Lake parts, and using a f**ing laser in order to physically prevent its usage under any circumstance whatsoever. Even if it was officially unsupported and it required some support from the motherboard vendor, having AVX-512 capabilities for the big cores would have been a nice feature to have for some users.
 
Even if it was officially unsupported and it required some support from the motherboard vendor, having AVX-512 capabilities for the big cores would have been a nice feature to have for some users.
I agree, but Intel said it wasn't validated and therefore you couldn't rely on it actually working correctly. I think they just wanted to limit their potential support liability and any PR issues that could arise from a vocal minority complaining about either implementation bugs or manufacturing defects in the AVX-512 functionality of their Alder Lakes.

Personally, I wish they'd have released a Xeon E-series, based on the small Alder Lake die (i.e. the 6 P + 0 E configuration) that enabled it. That would provide it on an entry-level platform, even if they restricted it for use only on a W680 board.
 
  • Like
Reactions: artk2219
So, in other words, you'd prefer to have a CPU with just one super-core that's maybe 10x the area of a regular P-core, but runs about 1.5x or 2x as fast? Because, that's basically what you're saying. Essentially, you care more about single-threaded performance than parallelism, so the logical conclusion is that you want to go back to a single-core CPU that puts all its power and die area into making one thread run as fast a possible.

As soon as we start talking about lots of threads, then you cannot ignore concerns about area-efficiency or power-efficiency. With a highly-scalable workload, it pretty quickly becomes a better option just to use lots of efficient cores. This is the basic idea behind GPUs.


I've said this before, and I guess I have to say it again: Intel has not put E-cores or iGPU in their HEDT models!!!

You might think an i9 qualifies as HEDT, but not as the term is classically defined. HEDT has a bigger socket, more memory channels, more PCIe lanes, generally more TDP, more cores (at the upper limit), etc.

BTW, HyperThreading (or SMT, as more generally known) essentialy makes every core that has it into a P or E core, depending on its occupancy! That not only means you've probably been using the equivalent of a hybrid CPU for years, but also that OS thread schedulers should be pretty good managing them. If you really do not want any form of Hybrid CPU, then you should disable HyperThreading!
I agree - Intel i9 CPUs should not be regarded as HEDT products. They are products for notebook computers where battery life is important.
 
  • Like
Reactions: artk2219