AMD's Future Chips & SoC's: News, Info & Rumours.

Page 82 - Seeking answers? Join the Tom's Hardware community: where nearly two million members share solutions and discuss the latest tech.
Status
Not open for further replies.
While reading top 500 news i stumbled on a couple sentences that when taken together make me think we are gonna see several AMD Rome systems in the future top500 lists.

Intel’s claims on Cascade Lake AP’s performance:1.21x higher Linpack performance compared to the 28-core Xeon 8180 (“Skylake”) processor
To highlight the chip’s floating point performance, AMD ran the standard C-Ray rendering benchmark on-stage at the event, using a single-socket server outfitted with a pre-production version of Rome. When matched against a dual-socket Xeon Platinum 8180M server, the AMD box ran the benchmark to completion first.
Although C-Ray is not linpack, it gives an idea of the upcoming Rome performance.

Sources:
https://www.top500.org/news/amd-takes-aim-at-performance-leadership-with-next-generation-epyc-processor/
https://www.top500.org/news/intel-steers-back-to-hpc-with-cascade-lake-ap-processor/

 


EPYC2.png


The slide has EPYC misspelled in it, as it is misspelled in a number of places in the slide deck, along with other mistakes – the speaker’s first language is not English.

The slide describes Hawk as a 640,000 core system using 64-core Rome CPUs. That’s 10000 CPUs, listed at a throughput of 24.06 PetaFLOPs. That gives 2.4 TeraFLOPS per CPU, which at 16 SP flops/cycle, works out about correct for a 64 core CPU for the 2.35 GHz frequency mentioned in the slide. This would appear to be a base frequency, although AMD’s Naples processors do offer a ‘constant frequency’ mode to ensure consistent performance. Main memory was listed as 665 TB, with 26 PB of disk space. It was not disclosed the breakdown of how the disk is split between NAND and HDDs.
BZOQS5D.png

https://www.amd.com/en/products/cpu/amd-epyc-7601

***SPECULATION***
If 2.35GHz is the base clock of Rome, that is ~6.8% higher than Epyc 7601.
PgyvZZ7.png

https://www.amd.com/en/products/cpu/amd-ryzen-7-2700x
If that 6.8% higher base clock translates into Ryzen Desktop, ~3.95GHz base clock and a 4.59GHz max boost clock.
 
So guys we have a lot of info coming in what do you guys expect from Zen 2 without Juan we no longer have a lot of speculation.

With the latest leaks i have to say i expect 5% higher clock speeds and given the info we got on Rome i expect a good 7% IPC gain AND Amd now supports AVX2 the same way as Intel with the given leaks

https://arstechnica.com/gadgets/2018/11/amd-outlines-its-future-7nm-gpus-with-pcie-4-zen-2-zen-3-zen-4/

Happy to see Amd not only increase core counts but also increase IPC their main weakness. I hope Amd does everything to improve IPC over increasing frequency speed alone.


With the given FP performance upgrades and IPC upgrades dare i say it??????? I expect Zen 2 to meet or beat Intel's coffee-lake IPC but lose in overall frequency but doing so with a higher core count.


What do others expect i hope i'm right but i could be wrong! This site is boring without Juan barely any movement in posts i miss the old days but hey whatever!
 
Let's take a look at where AMD's IPC is in comparison to Intel clock for clock.
Cinebench.png

Ashes.png

https://www.techspot.com/amp/article/1616-4ghz-ryzen-2nd-gen-vs-core-8th-gen/

i7-8700K@4GHz 174/1325
R5-2600X@4GHz 168/1384

Ryzen trades single thread performance for a better implementation of SMT, but these metrics are very close.
Intel has 3.5% better ST, while AMD has 4.4% better Multithread.

The first question I ask myself if single thread performance is so close, why do we see a big gap in gaming.
i7-8700K@4GHz 126/110
R5-2600X@4GHz 112/98

I think it comes down to infinity fabric vs. ring bus latency.
ojt1Rta.png

https://www.tomshardware.com/reviews/amd-ryzen-5-1600x-cpu-review,5014-2.html

aHR0cDovL21lZGlhLmJlc3RvZm1pY3JvLmNvbS9OL0YvNzY1NDgzL29yaWdpbmFsL0ltYWdlMTAxLnBuZw==


The AMD 2nd Gen Ryzen Deep Dive: The 2700X, 2700, 2600X, and 2600 Tested
The numbers AMD gives are:

13% Better L1 Latency (1.10ns vs 0.95ns)
34% Better L2 Latency (4.6ns vs 3.0ns)
16% Better L3 Latency (11.0ns vs 9.2ns)
11% Better Memory Latency (74ns vs 66ns at DDR4-3200)
Increased DRAM Frequency Support (DDR4-2666 vs DDR4-2933)
https://www.anandtech.com/show/12625/amd-second-generation-ryzen-7-2700x-2700-ryzen-5-2600x-2600/3

Intel benefits from faster clocks, as it reduces internal latency.
AMD benefits from Faster lower latency RAM as it reduces internal latency.

If AMD works on it's internal latency we should see more beneficial results compare to Intel's ring bus.

Edit: Added for clarity:
pastedImage_11.png

https://community.amd.com/community/gaming/blog/2017/07/14/memory-oc-showdown-frequency-vs-memory-timings
 


As for IPC, rumors point to 10-15% IPC increase.
More specific rumors said 13% over zen+ and 16% over zen1
And also rumors of UP TO 29% IPC increase. That is perfectly plausible for selected applications benefiting form the 256 bit data path for AVX. 256 bit data path come as a surprise to some.
Of course rumors are rumors ;-)

As for clock I've seen no rumors yet. Pessimists point to +5%, optimists suggest up to +20%
Realistically speaking I don't expect top clocks beyond 5GHz, that would be my most optimistic upper bound. But certainly I expect top clocks beyond 4.5 GHz.

 


Based on what I could see from the presentations and slides...there are a couple of things:

L3 cache is doubled on Rome, so I expect the same CCXs in the consumer version, expect double L3 cache to reduce pulls from system memory, which will account for an overall increase in real world IPC.

Additionally, the improved branch prediction comments are interesting. Depending upon how they plan to do so, it may make significant increases across practical real world applications. For example, increasing size of micro op cache would see an increase in IPC for little cost in die space. However, they could also be looking at improved fetch and pre-fetch algorithms to help. If they delve too deeply into that, they need to be mindful of exploits...but AMD seemed to weather that much better than Intel to this point.

I expect core speeds and turbo speeds to increase in the 5-10% range. 5% is probably close, but optimistically, some EPYC SKUs were as much as 7-8% faster over their predecessors, which could bode well if it pans out in the HEDT SKUs. If they can get around 7% improved clock speed going from 32 to 64 cores, and close to 10% on the 32 core models, I feel like the HEDT side may be in better shape than we suspect.

Having said that, the HEDT SKUs have much smaller surface area, and are compacted into a much smaller space to cool, and so those improvements on the massive MCM server dies may be tempered by less surface area, and more concentrated heat.

I think the chiplets will allow them to do some crazy things...like maybe 10-12 core SKUs on X570, and I expect a further refined turbo core solution yet again.

End result overall, I suspect AMD will be slightly ahead of Intel in IPC, but still lag by close to 10% clock speed (give or take).

So, I anticipate they will be outright better at some things, but still lose out in single thread due to raw clock speed on Intel parts.

If/when Intel jumps to their next node, I anticipate the clock speed advantage will evaporate, and things will be much more interesting.
 
L3 cache is doubled on Rome, so I expect the same CCXs in the consumer version, expect double L3 cache to reduce pulls from system memory, which will account for an overall increase in real world IPC.

Counterpoint: A larger L3 cache will have higher access latency. There's a point where apps that aren't inherently memory bottlenecked start to lose performance when using larger slower caches.
 
Keep in mind that L3 cache on Ryzen is victim cache.
Victim Caching is an improvement to miss caching that loads the small fully-associative cache with victim of a miss and not the requested cache line. A victim cache is a hardware cache designed to decrease conflict misses and improve hit latency for direct-mapped caches.
https://en.wikipedia.org/wiki/Victim_cache

The Core Complex, Caches, and Fabric
Many core designs often start with an initial low-core-count building block that is repeated across a coherent fabric to generate a large number of cores and the large die. In this case, AMD is using a CPU Complex (CCX) as that building block which consists of four cores and the associated caches.
Each core will have direct access to its private L2 cache, and the 8 MB of L3 cache is, despite being split into blocks per core, accessible by every core on the CCX with ‘an average latency’ also L3 hits nearer to the core will have a lower latency due to the low-order address interleave method of address generation.

The L3 cache is actually a victim cache, taking data from L1 and L2 evictions rather than collecting data from prefetch/demand instructions. Victim caches tend to be less effective than inclusive caches, however Zen counters this by having a sufficiency large L2 to compensate. The use of a victim cache means that it does not have to hold L2 data inside, effectively increasing its potential capacity with less data redundancy.

It is worth noting that a single CCX has 8 MB of cache, and as a result the 8-core Zen being displayed by AMD at the current events involves two CPU Complexes. This affords a total of 16 MB of L3 cache, albeit in two distinct parts. This means that the true LLC for the entire chip is actually DRAM, although AMD states that the two CCXes can communicate with each other through the custom fabric which connects both the complexes, the memory controller, the IO, the PCIe lanes etc.
HC28.AMD.Mike%20Clark.final-page-013.jpg

The cache representation shows L1 and L2 being local to each the core, followed by 8MB of L3 split over several cores. AMD states that the L1 and L2 bandwidth is nearly double that of Excavator, with L3 now up to 5x for bandwidth, and that this bandwidth will help drive the improvements made on the prefetch side. AMD also states that there are large queues in play for L1/L2 cache misses.
https://www.anandtech.com/show/11170/the-amd-zen-and-ryzen-7-review-a-deep-dive-on-1800x-1700x-and-1700/9
 


Bandwidth != latency. The fact Bandwidth increased infers an increase in average access time.
 
"Bandwidth != latency. The fact Bandwidth increased infers an increase in average access time."
Increasing bandwidth offsets latency, doesn't it? I know we are not talking about system memory, but wouldn't this equation also apply to cache.
true latency (ns) = clock cycle time (ns) x number of clock cycles (CL)
The latency paradox
Latency is often misunderstood because on product flyers and spec comparisons, it's noted in CL, which is only half of the latency equation. Since CL ratings only indicate the total number of clock cycles, they don't have anything to do with the duration of each clock cycle, and thus, they shouldn't be extrapolated as the sole indicator of latency performance.

By looking at a module's latency in terms of nanoseconds, you can best judge if one module is, in fact, more responsive than another. To calculate a module's true latency, multiply clock cycle duration by the total number of clock cycles. These numbers will be noted in official engineering documentation on a module's data sheet. Here's what these calculations look like.
In the history of memory technology, as speeds have increased, clock cycle times have decreased, resulting in lower true latencies as technology has matured, even though there are more clock cycles to complete. What's more, since speeds are increasing and true latencies are remaining roughly the same, you're able to achieve a higher level of performance using newer, faster, and more energy efficient memory.
Which is more important: speed or latency?
Based on in-depth engineering analysis and extensive testing in the Crucial Performance Lab, the answer to this classic question is speed. In general, as speeds have increased, true latencies have remained approximately the same, meaning faster speeds enable you to achieve a higher level of performance. True latencies haven't necessarily increased, just CAS latencies. And CL ratings are an inaccurate, and often misleading, indicator of true latency (and memory) performance.
J8p9VsO.png


http://www.crucial.com/usa/en/memory-performance-speed-latency
 
Increasing bandwidth offsets latency, doesn't it?

No matter what your bandwidth is, you are still bound by your minimum access time. The extra bandwidth certainly helps with larger reads, but the majority of cache access is for smaller reads, where the increased latency will affect performance.
 


The purpose of L3 victim cache is to reduce the penalty for a missed branch prediction.

The way the Ryzen architecture is structured with large L1/L2 cache, and massive L3 vicitim cache, the point of the large L3 cache is to reduce time to fetch if the prediction algorithm misses a prediction.

So, essentially, L3 is only ever used on a missed prediction for 95% of applications...this means that missed predictions will incur a lesser penalty and will result in an overall improvement in IPC because the penalty for a missed prediction of a code fork will be reduced.

Intel cache is *not* victim cache, and because of that, their latency is much more important.
 
AMD Dominates Retail CPU Sales in Germany’s Largest E-Tailer Where It Outsold Intel 2 to 1
By Khalid Moammer
Dec 1

For every one Intel processor sold at Mindfactory.de last month, buyers purchased approximately two AMD processors.
ea2qf7t.png

l5STeKX.png

0yRDMZw.png

y11PjUd.png
https://wccftech.com/amd-dominates-retail-cpu-sales-outselling-intel-2-to-1/

 
AMD EPYC 7371 Review Now The Fastest 16 Core CPU
By Patrick Kennedy - December 3, 20181

A great example of a community that should clamor for the AMD EPYC 7371 is the Windows Server community. The base Windows Server 2019 license for Windows Server 2019 Stanard and Datacenter are 16 cores which align perfectly with the 16-core EPYC 7371. With the release of the AMD EPYC 7371, AMD will have a chip with more performance than the Intel Xeon Scalable family at 16 cores. It will have more memory capacity and PCIe lanes than the Intel Xeon Scalable family. That will change the industry’s narrative on AMD EPYC v. Intel Xeon.
The AMD EPYC 7371 is the chip that fixes AMD’s Achilles heel: clock speed. Until now, Intel has had an advantage in clock speed while AMD has had an advantage in core counts with their current generations. With a 3.6GHz all core turbo and a 3.8GHz 8-core turbo, the AMD EPYC 7371 now matches or exceeds virtually every public Intel Xeon Scalable SKU.

Key stats for the AMD EPYC 7371: 16 cores / 32 threads, 3.1GHz base, and 3.6GHz all core turbo. With 8 cores active, the AMD EPYC 7371 hits 3.8GHz turbo boost clocks. Feeding these high-speed cores is a whopping 64MB L3 cache or 4MB per core. The CPU features a 200W TDP which is the hottest AMD EPYC 7001 series CPU we have seen in the lab.
https://www.servethehome.com/amd-epyc-7371-review-now-the-fastest-16-core-cpu/

Lots of benchmarks to look through.
 
Something tells me these spec's are fake seems to good to be true, hope i'm wrong.

I mean I think we will top out at 12C/24T and that should be enough to compete quite well with Intel's rumored 10C/20T chip.
 


Entry level @ 6 Cores 12 threads - I would love it to happen ... still it looks too good
 
Status
Not open for further replies.