Most of the credit goes to the Chips & Cheese article for actually collecting the data. Too bad they didn't publish their raw measurements - I had to laboriously estimate each datapoint from their plot bitmaps, then compute the transforms from the image space of the plots to their numerical equivalents. A lot of the remaining credit goes to Excel.
Well... I took a bit of a leap. I noticed they seemed to collect 4 data points from each test run they did, on each cluster of cores. I made this leap simply because the number of data points from the different plots matched per-core type (always 14 points for E-cores, 15 for P-cores). So, what I did was extract the data from their plots and combine it into a joint table, for each class of cores.
Once I had recovered the raw values (which I validated by reproducing their plots and visually inspecting them to ensure they looked almost identical), I could then plot the aspects which most interested me.
IMO, the most interesting thing I did with it was to import the data from a Python script I wrote to compute the optimal combination of clock speeds for each power level. I then (naively) combined the throughput from these clockspeed combinations to demonstrate how Alder Lake's E-cores enabled it to deliver more performance at
any power level! For more on that, look here:
Page 4 - Seeking answers? Join the Tom's Hardware community: where nearly two million members share solutions and discuss the latest tech.
forums.tomshardware.com
Yes, I had better things to spend my time on, but I like a good challenge and was genuinely curious about the outcome.
Yes! I used the data I had. Sadly, the only Alder Lake I have access to is a 65W non-K variant. So, I would be even further restricted on clockspeeds and boost durations, if I tried to reproduce their experiment.
My hope was simply that neither 4 E-cores nor 4 P-cores would be bottlenecked too badly, in that 16-core chip. However, I'm well aware that the 4 E-cores were sharing a cluster (I
believe Alder Lake's E-cores can only be disabled on a per-cluster basis), and all E-cores in a cluster share the same slice of L2 cache. So, that actually makes the E-cores look worse than if we had the same data for 1T on each class of core.
Thanks for reminding me of that. That means, if anything, the data is more pessimistic about E-core performance than P-core performance.
I can think of many plausible reasons for that. One being: as you increase clock speeds, cache fetches & memory latencies take more cycles, because the number of nanoseconds for those things don't change but you've got more cycles per nanosecond. So, that stresses the out-of-order buffers' ability to find useful work to do, while the thread is waiting for its data.
I'm sure there are other reasons, but that's one where I think we can probably say these buffer sizes are targeted at a given clockspeed. Running above that target should naturally result in some degree of drop-off. Of course, the effect is going to be workload-dependent - a memory-heavy workload will fill up the request queues, pushing absolute latencies higher.
Anyway, thank you so much for taking the time to read what I posted and
think about it. That's not something I take for granted, around here!
; )
I appreciate your keen insight.