News AMD 4800S Xbox Chip Shows Perils of Pairing CPUs with GDDR6

bit_user

Titan
Ambassador
Thanks for posting, but I already saw benchmarks of this (including latency) months ago. I forget where.

What's interesting is that it did very nicely on memory bandwidth, matching or possibly even beating the bandwidth of anything out there on DDR4. And keep in mind that the CPU portion of the SoC doesn't get access to the entire GDDR6 data rate.
 
  • Like
Reactions: artk2219

bit_user

Titan
Ambassador
Low Latency matters for Real Time Gaming & Responsiveness in Applications
Memory latency is on the order of 100 nanoseconds (10^-7), while humans can only perceive latency on the order of 10s of milliseconds (10^-2). So, that's 5 orders of magnitude difference, or a ratio of about 100k.

Therefore, the only way that you would perceive higher-latency memory is if you're using a CPU-bound software that makes heavy use of main memory and gets a comparatively poor L3 cache hit-rate. That covers some games, but not all.

Furthermore, it's probably not much different than if you just had low-latency memory that was also a lot lower-bandwidth. In other words, it's just one of many performance parameters that can potentially affect system performance, if it's worsened. It's not that special, particularly when you consider that CPUs can use techniques like caches, prefetching, and SMT to partially hide it.
 
  • Like
Reactions: artk2219

bit_user

Titan
Ambassador
This write up makes it sound like the choice of GDDR6 is just a bad decision, but its there for the graphics, and this chip doesn't have a working GPU.
Exactly. Sony and MS knew what they were doing, when they opted for a unified pool of GDDR6 memory. The goal was to build the fastest machine possible, at a specific price point.

I wonder what the fastest new machine you could spec out with mid-2020 era hardware, at mid-2020 MSRP, equaling the console's price point. Or, if you tried to build an identically-performing machine, at launch (again, assuming non-pandemic prices), how much would it have cost?
 
  • Like
Reactions: JamesJones44

tamalero

Distinguished
Oct 25, 2006
1,192
211
19,670
Memory latency is on the order of 100 nanoseconds (10^-7), while humans can only perceive latency on the order of 10s of milliseconds (10^-2). So, that's 5 orders of magnitude difference, or a ratio of about 100k.

Therefore, the only way that you would perceive higher-latency memory is if you're using a CPU-bound software that makes heavy use of main memory and gets a comparatively poor L3 cache hit-rate. That covers some games, but not all.

Furthermore, it's probably not much different than if you just had low-latency memory that was also a lot lower-bandwidth. In other words, it's just one of many performance parameters that can potentially affect system performance, if it's worsened. It's not that special, particularly when you consider that CPUs can use techniques like caches, prefetching, and SMT to partially hide it.
Id say human perception in this is irrelevant.
As a single delay could cause a cascade of events in computers that brings the chain to slowdown even further until we see what we call "low fps".
 

bit_user

Titan
Ambassador
Id say human perception in this is irrelevant.
As a single delay could cause a cascade of events in computers that brings the chain to slowdown even further until we see what we call "low fps".
Although they have limits, modern computers are designed with queues, buffering, prefetching, speculation, threading, etc. to maximize system throughput, even in the face of heavy memory accesses.

It's pretty rare that you can have one of these mechanisms backfire, but it's certainly possible. Cache-thrashing is a classic example, where a problematic access pattern can trigger an order of magnitude worse performance than raw memory bandwidth should be able to support. Something similar (but hopefully not nearly as bad) could happen with prefetching and speculative execution.

If games are well-optimized, they should mostly avoid those pitfalls. If not, then something like cache-thrashing could be compounded by longer memory-latency, but the effect should still be linear (i.e. only proportional to however much worse the latency is). Nonlinearities are where things get interesting, like your working set increases by just a little bit, but it's enough to blow out of L3 cache and then you're suddenly hit with that huge GDDR6 latency.
 

Thunder64

Distinguished
Mar 8, 2016
170
250
18,960
Thanks for posting, but I already saw benchmarks of this (including latency) months ago. I forget where.

What's interesting is that it did very nicely on memory bandwidth, matching or possibly even beating the bandwidth of anything out there on DDR4. And keep in mind that the CPU portion of the SoC doesn't get access to the entire GDDR6 data rate.

Probably here. It is well known that GDDR trades latency or bandwidth. GPU's like bandwidth as they are very parallel. CPU's want low latency.
 

InvalidError

Titan
Moderator
Memory latency is on the order of 100 nanoseconds (10^-7), while humans can only perceive latency on the order of 10s of milliseconds (10^-2). So, that's 5 orders of magnitude difference, or a ratio of about 100k.
Memory latency penalty hits are cumulative. While individual events may be imperceptible at the human scale, their cumulative effect when they occur by the millions per second does become macroscopically evident and that is how stutters go away when you do things like upgrade from 2400-18 memory to 3200-16.
Although they have limits, modern computers are designed with queues, buffering, prefetching, speculation, threading, etc. to maximize system throughput, even in the face of heavy memory accesses.
You can throw all of the cache, buffering, prefetching, etc. you want at a CPU, there will always be a subset of accesses that cannot be predicted or mitigated with some algorithms being worse offenders than others. Good luck optimizing memory accesses to things that use hashes for indexing into a sparse array such as a key-value store for one common example. Parsing stuff is chuck-full of input-dependant conditional branches that the CPU cannot do anything about until the inputs are known and outcomes will likely either get stuffed in a key-value store or require looking up stuff from one.

GPUs don't mind GDDRx's increased latency because most conventional graphics workloads are embarrassingly parallel and predictable.
 

JamesJones44

Reputable
Jan 22, 2021
789
723
5,760
I'm not sure I would call this a like-for-like comparison. From a pure CPU point of view I get it and agree. However, without being able to test the integrated GPU vs an equivalent dGPU it's hard to say if there are or aren't benefits for strait up Zen2 build. This is especially true for the frame rate tests.

I also wonder how optimized the drivers are for the 4800s and its features. My guess is not very much, but for this argument ill assume they are optimized enough.
 

InvalidError

Titan
Moderator
I'm not sure I would call this a like-for-like comparison. From a pure CPU point of view I get it and agree. However, without being able to test the integrated GPU vs an equivalent dGPU it's hard to say if there are or aren't benefits for strait up Zen2 build. This is especially true for the frame rate tests.
The 4800S is a CPU-only SKU. There is no working IGP to test against anything else.
 
  • Like
Reactions: palladin9479

watzupken

Reputable
Mar 16, 2020
1,142
635
6,070
I don't know if the GDDR6 really matters, particularly for consoles. They are not meant to run general purpose OS. And I am not too sure about the results from DF as well. Maybe I remembered wrongly, but the 4800S system had a PCIE 4.0 x4 slot vs at least a PCIE 3.0 x16 slot for remaining setups. How that impacts performance in games, I think it should be factored in.
 
High latency of GDDR based memory may not be an issue for consoles with regards to gaming anyway, since most of them target either 60 or 30 FPS. It might be a problem if it's trying to target 120 FPS, but as long as everything is well structured, this likely isn't as much of a problem anyway.

Especially when you consider what Naughty Dog's done with the PlayStation.
 

bit_user

Titan
Ambassador
Memory latency penalty hits are cumulative. While individual events may be imperceptible at the human scale, their cumulative effect when they occur by the millions per second does become macroscopically evident
Yes, but that doesn't negate what I said. The effect of stacked memory accesses is still linear and my statement stands that you'll see it when a CPU-bound app is memory-intensive and has a lot of L3 misses.

The main point I was trying to make is that it's just one of many parameters that can affect system performance. Memory latency isn't more important for realtime apps than say CPU clockspeed, for instance. In fact, quite the reverse is true. Doubling memory latency will typically have far less impact on such an app than halving the CPU clockspeed!

People seem to have a false or exaggerated association between "memory latency" and "frame latency",
and that's what I was trying to clear up. However, as usual, you missed the forest for the trees and try to nitpick all the insignificant details until we're way off track.

and that is how stutters go away when you do things like upgrade from 2400-18 memory to 3200-16.
To make this point more compelling, you should cite an example where only the latency is reduced. In the benchmarks I've seen DRAM latency, plays only a minor role in system performance. It also matters a lot whether you're talking about iGPU or dGPU. iGPU performance is disproportionately sensitive to DRAM performance.

You can throw all of the cache, buffering, prefetching, etc. you want at a CPU, there will always be a subset of accesses that cannot be predicted or mitigated with some algorithms being worse offenders than others.
Depends on how many. If the number is small, then you can do the math and find that the total impact on system performance will be small. As I said, sensitivity to memory latency is application-dependent.

More importantly, you're taking my point out of context. I was responding to a notion @tamalero raised that memory latency has a super-linear impact on system performance. In fact, the reverse is true, and this isn't by accident. CPUs are engineered to be insulated as much as possible from memory latency and bandwidth.

Is it so important to "correct" my post, by stating a fact so obvious that I deemed it implicit, that you risk distracting from or even seeming to disagree with my core contention? Really, sometimes I think you should reflect on why you're posting here and what you hope to accomplish. It's only by knowing what you're trying to do that you can evaluate and improve how well you're doing it.

Good luck optimizing memory accesses to things that use hashes for indexing into a sparse array such as a key-value store for one common example.
Heh, nice try. I've actually done some micro-benchmarking to quantify such things. They're rarely as bad as you imagine, because memory allocators are optimized to do things like provide "hot" and contiguous memory. For instance, you have to work fairly hard to get linked list to have no locality. If you just build a linked list in one go, you'll find that many of the entries will actually be contiguous - and CPUs' prefetchers and caches are very good at optimizing contiguous memory accesses.

GPUs don't mind GDDRx's increased latency because most conventional graphics workloads are embarrassingly parallel and predictable.
It's not really true that they're predictable. Ray tracing would be the prime example, but shaders can be arbitrarily complex and involve divergent branches, scatter/gather, variable numbers of light sources, textures, etc. TAA is implemented as a shader, for instance. The main way GPUs deal with latency is by having rather extreme levels of SMT. When one wavefront/warp is stalled on some loads or stores, there's quite likely another in the EU/CU/SM that's ready to go.

Another key thing about GPUs is that the cores operate at a lower clock speed. If you halve the clock speed of the cores, then the software-visible memory latency looks like it was also cut in half. That's not the main reason they use lower clock speeds, but it's a nice side-benefit.
 
Last edited:

bit_user

Titan
Ambassador
I don't know if the GDDR6 really matters, particularly for consoles. They are not meant to run general purpose OS.
Console games might be optimized to do a better job of staying within the limited L3 cache of their particular CPU, but the fact that games tend to benefit so much from the extra L3 cache in AMD's X3D CPUs tells me that they tend to have enough cache misses that they should also be significantly affected by higher DRAM latencies.
 

InvalidError

Titan
Moderator
To make this point more compelling, you should cite an example where only the latency is reduced. In the benchmarks I've seen DRAM latency plays only a minor role in system performance.
I don't remember which website it was but years ago, there was one that did some rather quite exhaustive bandwidth vs latency performance mapping and the impact on 0.1% lows sometimes exceeded 50%. You still see 5-10% differences in 1% lows going merely from 5600-40 to 5600-36 today, not seeing as many sites bothering to track 0.1% lows.

It's not really true that they're predictable. Ray tracing would be the prime example
Which was precisely why I went through the trouble of explicitly mentioning CONVENTIONAL rendering. We aren't going to have sufficiently powerful, power-efficient and affordable hardware for mainstream RT any time soon.
 
  • Like
Reactions: palladin9479

bit_user

Titan
Ambassador
I don't remember which website it was but years ago, there was one that did some rather quite exhaustive bandwidth vs latency performance mapping and the impact on 0.1% lows sometimes exceeded 50%. You still see 5-10% differences in 1% lows going merely from 5600-40 to 5600-36 today, not seeing as many sites bothering to track 0.1% lows.
That would be interesting and informative to see, if you could find it.

Which was precisely why I went through the trouble of explicitly mentioning CONVENTIONAL rendering. We aren't going to have sufficiently powerful, power-efficient and affordable hardware for mainstream RT any time soon.
Based on our prior discussions, your ideas about "conventional rendering" seem to be stuck in the late 1990's. Shaders can be arbitrarily complex and often do quite a lot.

It's interesting to consider that even a mid-range GPU, like the RTX 3070, has about 8.8 trillion shader clocks per second. That works out to 1.1 million shader clocks per pixel per second, at 3840x2160. So, at 100 Hz, you've got up to 11k shader clocks to paint each of those pixels. That should give you just a rough idea how sophisticated modern renderers are, given that most can't maintain a stable 100 Hz framerate at 4k resolution, on such a GPU.

Or, if you want to look at the memory bandwidth aspect, it works out to about 56.6 kB per second per pixel. At a 100 Hz frame rate, that works out to about 580 B/s per pixel per frame. And yet, it's often not enough!
 
It's simple, general purpose computing is very serial in nature, mostly reading inputs, doing some operation, then either storing outputs or doing a conditional jump. Each one of those steps requires memory access and latency really effects that. GPU's on the other hand just need raw data bandwidth, each individual access isn't that important because the GPU is processing thousands of operations at once. For example applying a shader effect on an object is several hundred operations, one on each pixel, all at the same time.
 
  • Like
Reactions: Amdlova

InvalidError

Titan
Moderator
That would be interesting and informative to see, if you could find it.
The closest thing I have found so far are from Anandtech's Haswell memory scaling exploration but those aren't half as exhaustive and pre-date the days of reporting on 1% lows. Roughly the same format as the charts I remember, different color scheme, many more blank boxes.

It's interesting to consider that even a mid-range GPU, like the RTX 3070, has about 8.8 trillion shader clocks per second.
It isn't the shader clocks that matter but how many of them are doing useful work - not doing NOPs, waiting for data, waiting on sync, doing speculative work that will get discarded, doing work that will get nulled or masked out, etc because it is simpler and cheaper time-wise to throw the work away than do a conditional branch to avoid it, especially when thread waves have to start and end together anyway. Same reason that we cannot reliably draw performance conclusions based on raw FP32 figures, tons of FP32 power goes to either throw-away results or simply unused.
 
  • Like
Reactions: palladin9479

edzieba

Distinguished
Jul 13, 2016
560
562
19,760
The latency testing was probably on TechReport (likely as part of the 'Inside the Second' series on frame pacing) but TechReport has been so trashed by the clickbait farming new owners that finding it would be difficult and all the graphs would be dead links.
 

bit_user

Titan
Ambassador
It isn't the shader clocks that matter but how many of them are doing useful work
Yes, of course. However, if they were mostly idle, then there wouldn't be so many of them. As I mentioned, GPUs rely primarily on SMT for latency-hiding. That avoids most of those cycles going to waste.

For instance, Nvidia currently supports up to 48 warps per SM:


According to this, RDNA2 supports 64 wavefronts per WGP:


Warps and wavefronts are the GPU-equivalent of CPU threads, and a SM/WGP is basically the GPU equivalent of a CPU core.

because it is simpler and cheaper time-wise to throw the work away than do a conditional branch to avoid it, especially when thread waves have to start and end together anyway.
No, we're not in the stone ages, any more:

Another sanity-check you could do is just look at GPU performance scaling compared to resolutions and refresh rates. If modern "conventional rendering" were the same as the late-1990's tech you're thinking of, then today's GPUs would be way overpowered and we should have framerates in the thousands of fps! Also, games would look like just higher-resolution versions of what they did back then. In case anyone needs a refresher, here's what passed for cutting-edge, back then:
 
Last edited: