News AMD 4800S Xbox Chip Shows Perils of Pairing CPUs with GDDR6

InvalidError · Aug 1, 2023

bit_user said:
Yes, of course. However, if they were mostly idle, then there wouldn't be so many of them. As I mentioned, GPUs rely primarily on SMT for latency-hiding.

You don't use SMT to hide latency since SMT increases it, makes it more unpredictable and therefore more difficult to hide. You use SMT to increase the amount of shared resources that can be kept busy while individual threads are stalled waiting for something. High-SMT is for sustained throughput, latency be damned. No latency hiding going on there, the embarrassingly parallel nature of GPU workloads simply doesn't care much about it in the first place.

If you want to hide latency, you need more accurate branch prediction, prefetching, deeper out-of-order execution, speculative execution, more non-overlapping execution units, etc. and that is how we end up with Intel supposedly ditching SMT in its future CPUs.

bit_user · Aug 1, 2023

InvalidError said:
You don't use SMT to hide latency since SMT increases it, makes it more unpredictable and therefore more difficult to hide.

In this 2001 paper, they talk about it hiding latency of the functional units:

"Since decoupling features an excellent memory latency hiding efficiency, the large amount of parallelism exploited by multithreading may be used to hide the latency of functional units and keep them fully utilized."

Improving latency tolerance of multithreading through decoupling

The increasing hardware complexity of dynamically scheduled superscalar processors may compromise the scalability of this organization to make an efficient use of future increases in transistor budget. SMT processors, designed over a superscalar core, are therefore directly concerned by this...

ieeexplore.ieee.org

It's also described this way, in this Advanced Computer Architecture lecture, from Imperial College London:

SMT threads exploit memory-system parallelism

Easy way to get lots of memory accesses in-flight
“Latency hiding” – overlapping data access with compute

https://www.doc.ic.ac.uk/~phjk/AdvancedCompArchitecture/Lectures/pdfs/Ch07-Multithreading.pdf

And this paper from ACM's International Conference on Parallel Architecture and Compilation Techniques:

"Network processors employ a multithreaded, chip-multiprocessing architecture to effectively hide memory latency and deliver high performance for packet processing applications."

Latency Hiding in Multi-Threading and Multi-Processing of Network Applications | Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques

dl.acm.org

I could go on, but you get the point. You're obviously entitled to your opinion, but I'm with the experts on this one.

You and @palladin9479 talk about how the parallel nature of GPU workloads "doesn't care about latency", but SMT is the primary mechanism which enables that decoupling. Otherwise, the shader cores would be mostly idle and performance would be like garbage, compared with what we have today.

Yes, SMT isn't the only way to hide latency, but that's the main way GPUs do it.

InvalidError · Aug 1, 2023

bit_user said:
I could go on, but you get the point. You're obviously entitled to your opinion, but I'm with the experts on this one.

They call it "hiding" except increasing the number of things in-flight to keep more things moving while other stuff is waiting is a work-around for latency, not a removal or hiding of it - a thread that was already waiting 80ns on average for memory accesses is still waiting 80+ns on average for the same accesses with SMT, no miraculous latency reduction there. You can improve overall THROUGHPUT and efficiency with SMT, not latency.

Now, more throughput in an embarrassingly parallel task does allow you to complete the job faster, which would be "hiding" latency at a macroscopic scale.

bit_user · Aug 1, 2023

InvalidError said:
They call it "hiding" except increasing the number of things in-flight to keep more things moving while other stuff is waiting is a work-around for latency, not a removal or hiding of it
...
... no miraculous latency reduction there.

You're trying to stuff words in my mouth. Nobody said it was removal, reduction, or elimination. The latency is still there (and even increased, to your point), but the GPU's compute units aren't being directly impacted by it as much, since they can usually execute other threads that aren't blocking on loads, stores, or barriers.

Maxor1 · Aug 1, 2023

Amazed

Admin said:
Digital Foundry recently tested an AMD 4800S desktop kit, featuring an Xbox Series SoC, and found the chip provides underwhelming performance due to its high-latency GDDR6 memory.

AMD 4800S Xbox Chip Shows Perils of Pairing CPUs with GDDR6 : Read more

We are getting this article now. These kits popped up fairly cheap towards the start of the hardware crunch at the beginning of the pandemic. Thought about grabbing one but my current situation doesn't make it as good of a choice as the crap laptop I ended up doing instead. There was always a few questions about these and from what I have seen/heard they are basically all failed x-box chips and the board also leaves a lot of connection to be desired especially in regards to pcie.

zx128k · Aug 2, 2023

Imagine if you could tune the GDDR6 RAM settings and get latency down to the 40-60ns territory. With RDNA2 and Ampere latency was a big factor in lower resolution performance. RDNA2 has less latency, so with small loads they can pull ahead of ampere.

This could explain AMD’s excellent performance at lower resolutions. RDNA 2’s low latency L2 and L3 caches may give it an advantage with smaller workloads, where occupancy is too low to hide latency. Nvidia’s Ampere chips in comparison require more parallelism to shine. source

The CPU here doesn't appear to have increased cache to help with latency. The issue is GDDR6 has a high latency compared to DDR4 or DDR5.

Back to CPU vs GPU latency.

You can see the different goals for GPU memory and CPU memory.

Haswell’s cache and DRAM latencies are so low that we had to put latency on a logarithmic scale. Otherwise, it’d look like a flat line way below RDNA 2’s figures. The i7-4770 with DDR3-1600 CL9 can do a round trip to memory in 63ns, while a 6900 XT with GDDR6 takes 226 ns to do the same.

From another perspective though, GDDR6 latency itself isn’t so bad. A CPU or GPU has to check cache (and see a miss) before going to memory. So we can get a more “raw” view of memory latency by just looking at how much longer going to memory takes over a last level cache hit. The delta between a last level cache hit and miss is 53.42 ns on Haswell, and 123.2 ns on RDNA2.

As an interesting thought experiment, hypothetical Haswell with GDDR6 memory controllers placed as close to L3 as possible may get somewhere around 133 ns. That’s high for a client CPU, but not so much higher than server memory latency.

So GDDR6 would add enough latency to affect game performance. This is going to be the reason for CPU workload to have less performance compared to desktops. GPUs the priority seems to be bandwidth, with cache and memory latency higher than CPUs. With CPUs latency appears more important than with GPUs. You can see the CPU could have more cache to help get latency down but the 4800S doesn't. So the lower performance compared to desktop parts is expected.

There could be a performance advantage for having both CPU and GPU share memory. This possibility is lost with the 4800S.

bit_user · Aug 2, 2023

zx128k said:
Imagine if you could tune the GDDR6 RAM settings and get latency down to the 40-60ns territory. With RDNA2 and Ampere latency was a big factor in lower resolution performance. RDNA2 has less latency, so with small loads they can pull ahead of ampere.

I think that's a misreading of the situation. Yes, infinity cache provided a latency advantage (when you had a cache hit, at least), but the bigger factor was probably the bandwidth advantage. Unfortunately, I think there's probably no way we can isolate one from the other.

zx128k said:
Back to CPU vs GPU latency.

You can see the different goals for GPU memory and CPU memory.

Most of the GPU plot is showing cache & interconnect latency. You only see the GDDR6 latency at the very right-hand side.

zx128k said:
You can see the CPU could have more cache to help get latency down but the 4800S doesn't. So the lower performance compared to desktop parts is expected.

How expensive do you want it to be? You know the RX 6900 XT had a launch price of $1000, right? As we saw from the X3D processors, adding that much cache to a CPU would be a fairly expensive proposition.

zx128k said:
There could be a performance advantage for having both CPU and GPU share memory.

It's theoretically possible, but I've yet to see clear evidence of a real-world case where it is one. I think the benefits you get by doing that are more than overshadowed by the overall benefits you get from using a fast dGPU.

zx128k · Aug 2, 2023

bit_user said:
I think that's a misreading of the situation. Yes, infinity cache provided a latency advantage (when you had a cache hit, at least), but the bigger factor was probably the bandwidth advantage. Unfortunately, I think there's probably no way we can isolate one from the other.

For GPUs the bandwidth appears to be the most important factor but what is being shown is that for both GPU and CPU. Latency and bandwidth can affect performance in different ways. By have the CPU and GPU connected via shared memory you lower the latency for both the CPU and GPU to communicate. This could be an advantage but I have no information to prove this at this point. Basically they are streamlining the console for one workload. This could benefit games on the console but the console doesn't have to perform well in other workloads. Other tasks like browsing the internet don't require much performance.

A PC on the other hand has the perform well in all workloads, thus latency could be much more important for memory performance than consoles. In a console once you max out the GPU performance, extra cpu performance is not really a priority. You can't make the game any faster than the gpu can perform. So both CPU and GPU are balanced for each other. PC's you can add a better GPU and thus CPU performance is important. Even in PC's buying a 7950x and a RX 6750xt makes no sense for gaming on a 4k monitor. Here the gpu will bottleneck performance.

Anyway what I am stating is performance on consoles is based around what the GPU can provide. In this case it one type of gpu. The CPU just has to get the most out of this performance. Here the GDDR6 drawbacks in latency make little difference in the outcome.

Update: PCIe latency. There is likely none in the xbox. Shared memory likely means zero PCIe latency. With the xbox the focus is with game performance. Here the latency could be reduced by >125ns as the cpu doesn't have to send frame data over the PCIe bus to the GPU memory. The data goes straight into GPU memory. So the memory latency could be less of a big deal. So long as the gpu is well fed.

The availability of low-latency switches makes the job of everyone producing a PCIe-based infrastructure easier. Industry-leading switches drop latency to as low as 110ns, or 87 percent lower than competing devices on the market. Low latency switches such as these should be the first choice of system engineers interested in producing high-performance systems.

PCIe is a stable, two-decades-old technology in extensive use, demonstrating low latencies down to 300 ns from end-to-end. source

Table source

	PCIe 3.0	PCIe 4.0
Transfer Rate (x16 slot)	Up to 32 GB/s	Up to 64 GB/s
Bandwidth (Top speed)	8 GT/s per lane	16 GT/s per lane
Latency	250 ns	125 ns
Power Efficiency	8 W/GB	4 W/GB
Compatibility	Compatible with most motherboards	Backward compatible but may require hardware and firmware upgrades for some systems
Cost	Components are generally affordable	Components are generally more expensive

bit_user · Aug 2, 2023

zx128k said:
By have the CPU and GPU connected via shared memory you lower the latency for both the CPU and GPU to communicate.

I understand the theoretical advantage. I'm just saying I haven't encountered real-world data demonstrating this effect. If you run across a real-world example, please share.

zx128k said:
Basically they are streamlining the console for one workload.

Cost-savings is another reason to use unified-memory, and I think probably the main reason they did.

zx128k said:
Shared memory likely means zero PCIe latency.

So, it saves you on the order of a few hundred nanoseconds. How much that matters depends entirely on how you're using the link.

I think CPUs can post writes, which means you can basically write something to the GPU controller and then go away and do something else. It's only if you're doing reads over PCIe or your writes get backed up that you would "see" the latency. Because of this, I think the way GPUs usually work is that you write commands to them, like "DMA this block of memory from here to there" and you let the GPU do it instead of tying up a CPU thread by doing "PIO" copies.

Of course, your command rate is still limited, so graphics APIs have workarounds to squeeze more performance out of it. They let you fill up a buffer with multiple draw commands, and then ship that entire buffer over to the GPU in one operation.

hotaru.hino · Aug 2, 2023

zx128k said:
Anyway what I am stating is performance on consoles is based around what the GPU can provide. In this case it one type of gpu. The CPU just has to get the most out of this performance. Here the GDDR6 drawbacks in latency make little difference in the outcome.

To add to this, consoles could be thought of as a soft-real time system, with that real-time requirement being able to do things within 16 (60FPS) or 33ms (30 FPS) If this doesn't change (which it won't), then you can just plan everything around whatever frame rate you're targeting.

A funny thing I remember reading about is developers complained about the Nintendo 64's use of RAMBUS RAM due to its high latency. But then we have a Super Mario 64 modder who managed to basically double the performance of the game engine compared to the original release by working smartly with how the memory subsystem works (which isn't all that different from how current consoles are laid out).

bit_user · Aug 2, 2023

hotaru.hino said:
A funny thing I remember reading about is developers complained about the Nintendo 64's use of RAMBUS RAM due to its high latency.

I'm not calling BS on this, but the N64 was so ground-breaking for its time and you've got to remember that the console before it was the Super Nintendo with its dinky 3.58 MHz 16-bit CPU and its 128 kB of RAM connected by an 8-bit data bus.

Compared to that, the N64's 64-bit, 93.75 MHz CPU and its 4 MB of RDRAM (connected via 32-bit data bus) must've seemed like an abundance of riches. At the time it launched, the fastest PC was probably a Pentium 200 with a 64-bit data bus. So, the N64 was no slouch by contemporary standards.

The idea that they were whining that it wasn't fast enough seems pretty funny, as there wasn't anything out there too much faster! Along with Sony, they were pioneering the whole concept of 3D gaming. So, there weren't really any standards to judge them by. Super Mario 64 was completely unlike any game I'd ever seen.

hotaru.hino · Aug 2, 2023

bit_user said:
The idea that they were whining that it wasn't fast enough seems pretty funny, as there wasn't anything out there too much faster! Along with Sony, they were pioneering the whole concept of 3D gaming. So, there weren't really any standards to judge them by. Super Mario 64 was completely unlike any game I'd ever seen.

I don't know the exact values for the RDRAM used in the N64, but Wiki claims that PC-800 RDRAM had a latency (I'm guessing CAS) of 45ns. PC133 at the time a CAS latency of 3 cycles (or around 22ns). I don't know how bad that is in practice, but apparently it was enough to cause a ruckus.

This was on top of the memory subsystem issue in that the CPU had to go through the DSP chip and the memory bus could only service one or the other.

Search

News AMD 4800S Xbox Chip Shows Perils of Pairing CPUs with GDDR6

InvalidError

Titan

bit_user

Titan

Improving latency tolerance of multithreading through decoupling

Latency Hiding in Multi-Threading and Multi-Processing of Network Applications | Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques

InvalidError

Titan

bit_user

Titan

Maxor1

Distinguished

zx128k

Reputable

bit_user

Titan

zx128k

Reputable

bit_user

Titan

hotaru.hino

Glorious

bit_user

Titan

hotaru.hino

Glorious

TRENDING THREADS

Latest posts

Moderators online

Share this page