That explanation might make sense, if they were only more efficient on floating-point workloads. However, the E-cores are also very efficient on integer workloads.
The E-cores are efficient because they were designed to be efficient (but, without sacrificing too much performance). The P-cores are designed for outright performance, even at the expense of efficiency. There's a very thorough analysis of Gracemont, here:
If it were really as simple as the presence/absence of AVX-512, then Intel could've saved a lot of time, trouble, and money by completely removing it from the client version of Golden cove, and then not even bothering with the whole hybrid CPU thing.
Zen 4 manages to be efficient, in spite of AVX-512. I think the main problem with Intel's rollout of AVX-512 is simply one of timing & technology. They pushed it out the door, before the underlying process node was ready. AMD did very well by holding back, and then still implementing it with an eye towards efficiency.
That explanation might make sense, if they were only more efficient on floating-point workloads. However, the E-cores are also very efficient on integer workloads.
The E-cores are efficient because they were designed to be efficient (but, without sacrificing too much performance). The P-cores are designed for outright performance, even at the expense of efficiency. There's a very thorough analysis of Gracemont, here:
If it were really as simple as the presence/absence of AVX-512, then Intel could've saved a lot of time, trouble, and money by completely removing it from the client version of Golden cove, and then not even bothering with the whole hybrid CPU thing.
Zen 4 manages to be efficient, in spite of AVX-512. I think the main problem with Intel's rollout of AVX-512 is simply one of timing & technology. They pushed it out the door, before the underlying process node was ready. AMD did very well by holding back, and then still implementing it with an eye towards efficiency.
It is implied, yes, so I agree. Gracemont, from what I remember reading in the early papers of it, is just a "beefed up Atom" instead of a "stripped down Skylake". That's also why it clocks way lower as they put the efficiency point of the whole core really low.
My point revolves around* that one of the "efficiencies" Intel touted is not only in consumption, but space, which is the biggest reason they love saying 4 E-cores is 1 P-core. Yeah, that's thanks to stripping it of AVX512. If the P-cores didn't have all that stupid pipeline for AVX512 the P cores would have been WAY more space efficient and, I'm sure, power efficient as well.
The reason AMD's AVX512 implementation is fine power wise, is because they're dual pumping and, more importantly, re-use a lot of AVX2's pipes for it. Plus, IIRC, it wasn't the whole AVX512 ISA support? In any case, I'm not off the mark I'm sure.
My point revolves around* that one of the "efficiencies" Intel touted is not only in consumption, but space, which is the biggest reason they love saying 4 E-cores is 1 P-core. Yeah, that's thanks to stripping it of AVX512. If the P-cores didn't have all that stupid pipeline for AVX512 the P cores would have been WAY more space efficient and, I'm sure, power efficient as well.
The information to support or refute your ideas is out there, if you're willing to look for it. Locuza did a detailed die-shot analysis of Alder Lake, here:
First, Golden Cove core is estimated at 7.53 mm^2 (or 5.88 mm^2 without L2 cache). Then, the article goes on to speculate that:
"On the other hand, we could also remove a bit of area for Golden Cove, since Intel really doesn’t want the AVX512 capabilities to be exposed, which increase the size of the register files and execution units. Without considering core floorplan challenges, you may in theory gain about 1 to 2 mm²."
I believe that was referring to the cumulative area for the whole die, because the analysis later estimates the size of Golden Cove's entire VFP unit at 0.986 mm^2! So, Locuza estimates dropping AVX-512 would only slim it down to 0.736 mm^2 at best!
The article then goes on to compare Golden Cove to Gracemont, estimating a the latter's core at 1.63 mm^2 (including everything but L2). This puts the entire Golden Cove VFP unit at 62% of a Gracemont core. In other words, deleting AVX-512 from Golden Cove would shrink it by a mere fraction of the area of one Gracemont core. That's further confirmation that any contribution AVX-512 makes to the area of Golden Cove is only making a minor dent in the differential between their respective areas.
So, the simple answer is: no. AVX-512 isn't keeping Golden Cove from being "WAY more space efficient". We can't really say much about power efficiency, but I believe the upper 256-bits of AVX-512 registers remain inactive, as long as a program never touches them.
A lot has been published by Intel and others about Gracemont. If you truly want to understand how it's so much more efficient than Golden Cove, the answers are out there and not hard to find. I already posted a link to Chips & Cheese and here's what Anandtech published from Intel's Architecture Day:
Zen 4's VFP implementation is the same width as Golden Cove's. It's just sub-divided into more, narrower ports. That doesn't help AMD with power-efficiency. The main efficiency benefit for AMD is that they waited to implement it on TSMC N5.
The main difference is that Intel has double the FMA ports, whereas Zen 4 can only schedule Adds, in those slots. I'm reasonably certain AMD and Intel reuse the same datapaths and ALUs for both AVX and AVX-512. If not, then that's even more reason to think it's not hurting Golden Cove's efficiency.
AMD Implemented all of the same AVX-512 ISA extensions as Ice Lake. The only additional instructions implemented by Golden Cove are for fp16 support.
Now, I've just spent about an hour of my time explaining what you could've looked up, yourself. In the future, please check your own facts, instead of making poorly-informed statements and relying on others to catch when you're demonstrably wrong.
The information to support or refute your ideas is out there, if you're willing to look for it. Locuza did a detailed die-shot analysis of Alder Lake, here:
First, Golden Cove core is estimated at 7.53 mm^2 (or 5.88 mm^2 without L2 cache). Then, the article goes on to speculate that:
"On the other hand, we could also remove a bit of area for Golden Cove, since Intel really doesn’t want the AVX512 capabilities to be exposed, which increase the size of the register files and execution units. Without considering core floorplan challenges, you may in theory gain about 1 to 2 mm²."
I believe that was referring to the cumulative area for the whole die, because the analysis later estimates the size of Golden Cove's entire VFP unit at 0.986 mm^2! So, Locuza estimates dropping AVX-512 would only slim it down to 0.736 mm^2 at best!
The article then goes on to compare Golden Cove to Gracemont, estimating a the latter's core at 1.63 mm^2 (including everything but L2). This puts the entire Golden Cove VFP unit at 62% of a Gracemont core. In other words, deleting AVX-512 from Golden Cove would shrink it by a mere fraction of the area of one Gracemont core. That's further confirmation that any contribution AVX-512 makes to the area of Golden Cove is only making a minor dent in the differential between their respective areas.
So, the simple answer is: no. AVX-512 isn't keeping Golden Cove from being "WAY more space efficient". We can't really say much about power efficiency, but I believe the upper 256-bits of AVX-512 registers remain inactive, as long as a program never touches them.
A lot has been published by Intel and others about Gracemont. If you truly want to understand how it's so much more efficient than Golden Cove, the answers are out there and not hard to find. I already posted a link to Chips & Cheese and here's what Anandtech published from Intel's Architecture Day:
Zen 4's VFP implementation is the same width as Golden Cove's. It's just sub-divided into more, narrower ports. That doesn't help AMD with power-efficiency. The main efficiency benefit for AMD is that they waited to implement it on TSMC N5.
The main difference is that Intel has double the FMA ports, whereas Zen 4 can only schedule Adds, in those slots. I'm reasonably certain AMD and Intel reuse the same datapaths and ALUs for both AVX and AVX-512. If not, then that's even more reason to think it's not hurting Golden Cove's efficiency.
AMD Implemented all of the same AVX-512 ISA extensions as Ice Lake. The only additional instructions implemented by Golden Cove are for fp16 support.
Now, I've just spent about an hour of my time explaining what you could've looked up, yourself. In the future, please check your own facts, instead of making poorly-informed statements and relying on others to catch when you're demonstrably wrong.
Good question. I would start by observing that Gracemont was their first E-core even to feature AVX (256-bit). Before that, they only went as far as SSE4 (128-bit).
Furthermore, it doesn't appear that Crestmont - to feature in Meteor Lake & Sierra Forest - will add AVX-512. The soonest we might see E-cores with AVX-512 is probably 2025.
We can speculate that Intel thinks the main markets for its E-cores won't use AVX-512 enough to justify the die area. Intel still controls enough of the client CPU market that by dropping AVX-512 from that segment, it's likely to discourage most app developers from including AVX-512 support, thereby making it a somewhat self-fulfilling prophesy.
With Sierra Forest, it's another matter - but, server users will have the option of using performance-oriented CPUs that do have AVX-512. Many server apps are predominantly scalar/integer-based, and indeed would benefit very little from AVX-512 (in spite of some cases where it's been shown to improve things like string processing).
Using the numbers from above, if AVX-512 added 0.125 mm^2 per E-core (which I get from the lower-end estimate of 1 mm^2 per 8 P-cores), then it could increase their size by perhaps 7.7%. When trying to optimize performance per mm^2, that's enough probably to put it near the top of the list of variables under consideration.
Good question. I would start by observing that Gracemont was their first E-core even to feature AVX (256-bit). Before that, they only went as far as SSE4 (128-bit).
Furthermore, it doesn't appear that Crestmont - to feature in Meteor Lake & Sierra Forest - will add AVX-512. The soonest we might see E-cores with AVX-512 is probably 2025.
We can speculate that Intel thinks the main markets for its E-cores won't use AVX-512 enough to justify the die area. Intel still controls enough of the client CPU market that by dropping AVX-512 from that segment, it's likely to discourage most app developers from including AVX-512 support, thereby making it a somewhat self-fulfilling prophesy.
With Sierra Forest, it's another matter - but, server users will have the option of using performance-oriented CPUs that do have AVX-512. Many server apps are predominantly scalar/integer-based, and indeed would benefit very little from AVX-512 (in spite of some cases where it's been shown to improve things like string processing).
Using the numbers from above, if AVX-512 added 0.125 mm^2 per E-core (which I get from the lower-end estimate of 1 mm^2 per 8 P-cores), then it could increase their size by perhaps 7.7%. When trying to optimize performance per mm^2, that's enough probably to put it near the top of the list of variables under consideration.
But they still added all the pipeline for AVX512 the P-cores, so it doesn't make sense still. Specially when they fuse it off after the fact; that's a huge facepalm.
The saved die space would still be used for more cache or registers. Or a better branch predictor (if possible). Even accounting for the savings on the "smart threader" (telemetry) would probably save some die space as you would probably can make do with a simpler telemetry design and make it work well with a homogeneous ISA.
Anyway, I guess they were too desperate to come up with an answer to AMD?
It's hard not to think that way, specially considering all the weird rumblings about AVX512 when Alder Lake launched. Remember some motherboards allowed you to enable AVX512 as long as you disabled the E-cores until Intel took that away. Then they flat out started fusing off the capability.
It was rushed out, for some reason. The only reason I can think of is AMD posing too much of a threat with Zen4 on the horizon back then.
It's hard not to think that way, specially considering all the weird rumblings about AVX512 when Alder Lake launched. Remember some motherboards allowed you to enable AVX512 as long as you disabled the E-cores until Intel took that away. Then they flat out started fusing off the capability.
It was rushed out, for some reason. The only reason I can think of is AMD posing too much of a threat with Zen4 on the horizon back then.
Or they knew that they would compete well enough without it on desktop and didn't want to cut into the sales of their more expensive saphire rapids platform that still has avx512.
If they use the same cores for both they probably just cheaped out on fusing them off because they didn't think that the OEMs would be able to enable it.
But they still added all the pipeline for AVX512 the P-cores, so it doesn't make sense still. Specially when they fuse it off after the fact; that's a huge facepalm.
I have a few theories on why client Golden Cove included AVX-512:
Intel hadn't yet reached a final decision about the hybrid strategy. By the time they settled on the Hybrid CPU approach, with symmetrical ISA, it was too late to remove it.
Similar to #1, but they thought they might still enable AVX-512 in some SKUs using the client version of the core.
It was done to accelerate time-to-market of Sapphire Rapids, because having it in the client core would enable Intel to validate it earlier.
Yes ...or don't reuse the die area but just pocket the cost savings. One thing I've wondered is whether Raptor Lake (or Raptor Refresh) completely removed it. I've utterly failed to find confirmation on whether it's still in there. If I had a more time, I'd be tempted to look for the best die shots I could find and do my own A/B comparison.
Perhaps it's telling that none of the serious die analysts even investigated the matter. Maybe it'd be unexpected for such a big change, given the type of refresh Raptor Cove is.
I did run across this die analysis of Meteor Lake:
In this report, Locuza and SemiAnalysis will share and analyze a die shot of Intel’s Meteor Lake compute tile on the Intel 4 process node. With this die shot, we can analyze the various structures …
www.semianalysis.com
On the one hand, the FPU shrank by 37.8% - well above the average 25.2% shrinkage. That would be consistent with no AVX-512. Then again, the article includes only one terse sentence about the matter:
"The FPU design appears pretty much the same, and AVX512 seems to be relatively unchanged based on various software indicators for the instruction."
Given all indications that AVX-512 has not been added to Crestmont, that would seem to rule out theory #1, above.
It's been discussed many times: why didn't Intel simply enable AVX-512 on the P-cores, and the OS could fault any threads using it over to them. The major pitfall is that 99.9% of software assumes symmetric ISA among the cores. Even if you didn't, the OS and threading APIs don't even provide enough visibility or control for adequately supporting CPUs with asymmetric ISA. So, I think it wouldn't have been quite the win that some people think. In some cases, it could've hurt them a lot more than it would've helped in others. Perhaps they even tried it, and found what I'm predicting.
To my knowledge, there's no example of a CPU or SoC with hybrid support of ISA extensions. Intel was taking a big enough risk by going hybrid, but going hybrid with assymetric ISA would've crossed the line into "reckless" territory.
Remember some motherboards allowed you to enable AVX512 as long as you disabled the E-cores until Intel took that away. Then they flat out started fusing off the capability.
The motherboards found a hack around the way Intel initially disabled it. Intel came out and stated that they hadn't validated the AVX-512 functionality, so you could hit bugs or manufacturing defects, if you tried to use it.
I don't blame Intel for fusing it off, in later revs. They don't want a fragmented software ecosystem or people grumbling about stability or bugs encountered, when using AVX-512.
In their shoes, I'd have made the exact same calls as where they ended up. Hybrid is a bigger win than losing AVX-512; just fuse off AVX-512.
The only thing I wish they'd done differently is to sell 4-core and 6-core Xeons, based on the small die without E-cores, that included official AVX-512 support. By marketing them as Xeon (E-series), they would avoid market fragmentation and confusion, since all Xeons (since Gen 11) have supported AVX-512. However, that presumes the small die has no defects in its AVX-512 implementation.
didn't want to cut into the sales of their more expensive saphire rapids platform that still has avx512. If they use the same cores for both they probably just cheaped out on fusing them off because they didn't think that the OEMs would be able to enable it.
Sapphire Rapids has slightly different cores. You can see that from the Locuza link I posted - they added FMA on another port, AMX, and enlarged the L2 cache. None of that's not as big or deep of change as ripping out AVX-512, wholesale.
Regarding the decision not to fuse them off (initially), I wonder if this was indeed done to get AVX-512 on some high-quality silicon for Intel's internal purposes (e.g. hardware validation, software development, or performance tuning).
Wow, I just finish reading that SemiAnalysis analysis of the Meteor Lake die shot (linked above) from almost a year ago, and it's impressive how much they already knew, at the time. Nothing about this L4 cache business, but they seem to do a reasonable job of accounting for that huge "SoC" tile.
That site really puts SemiAccurate to shame, especially considering the absurd subscription fee long sought by the latter. And there's no tiresome attempts at humor or "we told you so"'s sprinkled like every couple paragraphs (or worse!). Charlie should pack it in, or maybe at least return to an ad-based model. I don't see why almost anyone would keep subscribing, at this point.