Where are you getting the 2% figure? Just from eyeballing this slide?
It looks bigger than that, to me (maybe 3-4%, for ST?), but we really shouldn't assume unitless graphs are that precise. More importantly, it's a far bigger chunk of the MT performance gains!
Even if Anandtech had bothered to compute the averages of their rate1 SPEC2017 scores for us, it would still be difficult to divide out the frequency difference, if we don't even know what frequency it boosted to and for how long. For instance, if you simply use the "P-core Max Turbo" frequency, it amounts to a 5.9% improvement. However, there's also the Thermal Velocity Boost, which increased by more than that.
That
probably had a lot to do with the higher memory latency of Meteor Lake's interconnect and LPDDR5X memory. However, perhaps it was also impacted by whatever methodology he used to measure clock-normalized performance.
Quite frankly, I don't believe there was a real regression in Redwood Cove and certainly not
that large. Intel said IPC was about the same and I think that's pretty consistent with what other reviewers have observed.
Right cause there's no internet on your side? I'll do it just this time.
https://www.anandtech.com/show/16084/intel-tiger-lake-review-deep-dive-core-11th-gen/8
That's the effect of increasing cache by 2.5x from 512KB to 1.25MB on Tigerlake.
Too bad they didn't bother to do the same clock-normalized comparison for the Rate-N (multithreaded) benchmarks, because I think you're falling for the fallacy that MT performance is simply ST * N.
The other thing you have to account for is how the latency changed. They discussed it, earlier in that review:
"The private L2 cache gets the biggest update, with a +150% increase in size. Traditionally increasing the cache size by double will decrease the miss rate by √2, so the 2.5x increase should reduce L2 cache misses by ~58%. The flip side of this is that larger caches often have longer access latencies, so we would expect the new L2 to be slightly slower. After many requests, Intel said that its L2 cache was a 14-cycle latency, which we can confirm, making it only +1 cycle over the previous generation. It’s quite impressive to more than double a cache size and only add one cycle of latency. The cache is also now a non-inclusive cache.
The L3 also gets an update, in two ways. The size has increased for the highest core count processors, from 2 MB per core to 3 MB per core, which increases the L3 cache line hit rate for memory accesses. However, Intel has reduced the associativity from 16-way at 8 MB per 4C chip to 12-way at 12 MB per 4C chip, which reduces the cache line hit rate, but improves the power consumption and the L3 cache latency. There is some L3 latency cycle loss overall, however due to the size increase Intel believes that there is a net performance gain for those workloads that are L3-capacity bottlenecked."
So, L2 latency increased 7.7% and L3 suffered both a loss of associativity and increase in latency by ?? %. As for the change in L2 to be non-inclusive, I expect that to have virtually no impact. It just avoids L2 from duplicating the contents of the tiny L1 caches and is probably something they did to simplify cache coherence, rather than to improve L2 utilization.
Sure it is. History doesn't repeat but it rhymes is the phrase.
Again, you've done nothing to show any more than superficial similarities. An actual historian would look for evidence of what was happening inside those organizations and why they made the choices they did, before drawing such analogies.
Why do you think AMD's cores perform nearly similar to Intel when Intel cores are nearly 50% larger than Intel's? One is clearly better than the other.
Intel P-cores have larger out-of-order structures to extend their performance at high frequencies (where Zen 3 & 4 tend to hit a brick wall). That becomes very area-intensive. They could afford to make big P-cores, due to their hybrid strategy. If not for the E-cores, Intel probably wouldn't have made the P-cores so big.
Here's a summary of how the different OoO structures in their cores compare:
Structure | Zen 4 | Zen 3 | Golden Cove | Comments |
---|
Reorder Buffer | 320 | 256 | 512 | Each entry on Zen 4 can hold 4 NOPs. Actual capacity confirmed using a mix of instructions |
Integer Register File | 224 | 192 | 280 | |
Flags Register File | 238 | 122 | Tied to Integer Registers | AMD started renaming the flags register separately in Zen 3 |
FP/Vector Register File | 192 | 160 | 332 | Zen 4 extends vector registers to 512-bit |
AVX-512 Mask Register File | 52 measured + 16 non-speculative | N/A | (152 measured via MMX) | Since SKL, Intel uses one RF for MMX/x87 and AVX-512 mask registers
However Golden Cove does not officially support AVX-512 |
Load Queue | 88 (136 measured) | 72 (116 measured) | 192 | All Zen generations can have more loads in flight than AMD’s documentation and slides suggest. Intel and AMD have different load queue implementations |
Store Queue | 64 | 64 | 114 | A bit small on Zen 4, would have been nice to see an increase here |
Branch Order Buffer | 62 Taken
118 Not Taken | 48 Taken
117 Not Taken | 128 | |
Source: https://chipsandcheese.com/2022/11/05/amds-zen-4-part-1-frontend-and-execution-engine/
It was apparent after just 2 years Netburst was a dead end but the entire company milked the thing for 5 years after that. You could also see from Extreme Overclocking records 5GHz wasn't achievable without exotic cooling yet the company that's supposed to be filled with engineers thought 10GHz was easily achievable, when to this day with the most exotic cooling, the most binned parts, and the easily clocked CPU can't reach it.
Again, you're just looking at it from the outside. What actually happened was that Netburst was architected to scale up to 10 GHz, because they assumed
Dennard Scaling would continue. It obviously didn't, but it wasn't immediately clear to them just how insurmountable the challenges would be in trying to control leakage. So, for a while, they kept pinning their hopes on new process nodes, but that didn't pan out.
Also, the development times of CPUs are long. It takes between 3 and 5 years to bring a new CPU to market, depending a lot on the scale of the changes we're talking about. To switch gears and deliver Core 2, they had to first come to terms with the fact that whatever they planned to follow Netburst had to be scrapped, and then switch gears to working on Core 2, which involved a lot more work than simply continuing to extend Netburst.
You don't need every and explicit detail to understand what's going on, which is what you are saying is needed. Unlike "AI" which does, humans have the ability to understand "between the words".
If you don't understand
why certain decisions were made, you can't presume the same underlying logic will apply to future decisions.