News GDDR7 graphics memory standard published by JEDEC — Next-gen GPUs to get up to 192 GB/s of bandwidth per device

What we want is the memory directly within the logic silicon.
That would increase perf and efficiency by orders of magnitude.
 
  • Like
Reactions: gg83
What we want is the memory directly within the logic silicon.
DRAM is typically made on a different process node than logic. I can't say exactly why.

There have been some recent announcements of companies working on hybrid HBM stacks, where you have multiple layers DRAM dies stacked atop a logic die at the bottom. That's probably as close as we're going to get.

That would increase perf and efficiency by orders of magnitude.
Eh, memory bandwidth doesn't seem to be such a bottleneck, for graphics. HBM does tend to be more efficient than GDDR memory, but probably at least 80% of the energy burned by a graphics card is in the GPU die, itself.

And no, you don't get a big latency win by moving the memory closer. Try computing the distance light travels in a single memory clock cycle and then tell me a couple centimeters is going to make any difference.
 
  • Like
Reactions: Order 66 and gg83
And no, you don't get a big latency win by moving the memory closer. Try computing the distance light travels in a single memory clock cycle and then tell me a couple centimeters is going to make any difference.
By "memory directly within the logic silicon", I assume ekio means something like 3DSoC, which would have memory layers as close as tens of nanometers from logic. I suppose this is called Processing-near-Memory (PnM).

At some point in the future, we will probably move towards "mega APUs" that move everything including memory into a single 3D package, for stupendous performance and efficiency gains, and indeed lower latency. But it's not coming soon.
 
  • Like
Reactions: gg83
At some point in the future, we will probably move towards "mega APUs" that move everything including memory into a single 3D package,
Like Apple's M-series? Even Intel has gotten on board with this, as some Meteor Lake SoCs supposedly have on-package LPDDR5X memory.

for stupendous performance and efficiency gains, and indeed lower latency. But it's not coming soon.
No, latency of HBM is typically worse than conventional DRAM, though I don't know where current & upcoming HBM stands on that front.

LPDDR5 certainly has worse latency than regular DDR5. So far, HBM and LPDDR are the only types of memory to be found on-package.

This level of integration is about energy-efficiency and increasing bandwidth. The one thing it's definitely not about is improving (best case) latency!
 
  • Like
Reactions: gg83
Like Apple's M-series? Even Intel has gotten on board with this, as some Meteor Lake SoCs supposedly have on-package LPDDR5X memory.
No, nanometers of distance like I said. What Apple and Intel are getting out of that packaging is nothing compared to what we'll see in the future.

LPDDR5 certainly has worse latency than regular DDR5. So far, HBM and LPDDR are the only types of memory to be found on-package.
I'm not addressing those types of memory, just interpreting what ekio said which does correspond to something in development.

Instead of "single 3D package", substitute "single 3D chip".
 
I know what cache is…
I am talking about full memory on the logic die such as Groq LPUs.
Memory transfers are a huge bottleneck in modern processors.
No enough space on a die for 'full' memory to be resident there. Even if you could fab DRAM on the same process as dense logic (which you can't, so you'd have to use SRAM at an even greater area penalty) you'd end up with a full reticule die for a teeny tiny logic area after you've shoved in both all the memory cells themselves AND the extra CPU interfaces to talk to them directly (if you;re jsut going by the DDR controller, then you've gained basically nothing from the exercise anyway). And that still does nothing if latency was not the bottleneck to start with.

There's a good reason CPUs do not 'just' make all their cache fast L1 cache rather than the tiered L1/L2/L3(/L4) system currently in use. On die memory comes with some very significant penalties.
 
What about latency? Will the 3rd signaling level effect latency? Or is it unrelated?
It might, but my understanding is that most GPU workloads are much easier to predict data needs on and so latency doesn't actually matter as much. GDDR5/GDDR6 latencies are far worse than DDR5 for example, but the GPUs that use GDDR are built around those latencies. They're known in advance.

AMD GPUs have an option to use "fast" latencies — a form of overclocking — but if you just flip that switch (and it doesn't cause instability), the real-world benefit is typically less than 1~2% from what I've seen. I'm not even sure what the actual latencies are, though... I'd have to go down the Google rabbit hole to try to find out.

Something else worth pointing out is that the Ryzen 4700S — the same core design as the PlayStation 5, but without the GPU enabled — showed very poor performance overall. The general consensus is that it's because the GDDR6 system memory has horrible latencies and Windows workloads don't like it. Basically, GDDR memory types are fine for GPUs, not so much for CPUs.
 
  • Like
Reactions: Order 66
I know what cache is…
I am talking about full memory on the logic die such as Groq LPUs.
Like everyone else, they're using SRAM on their logic die:

Sometimes, large amounts of SRAM are presented as cache. This fits much better with the system management model of modern, general-purpose CPUs. Sometimes, as in many AI processors and some memories in GPUs, it's presented as a directly-addressed memory. This is more efficient, if it fits your programming model.

In neither case is it a substitute for DRAM. Not even in Cerebras' case, and they have up to 40 GB per wafer!

Memory transfers are a huge bottleneck in modern processors.
Funny you should say that, because people periodically do memory-scaling analysis and the gains (other than for iGPU performance) are typically rather small:

84276_untitled-1.png


Comparing the i9-13900K with DDR5-7200 CL34 vs. DDR5-4800 CL40, we see it performing 4% better with DDR5 having 50% more bandwidth and 56.7% as much latency! That's a very poor return on investment.

Granted, I'm sure you can find programs even more memory-bottlenecked than VRAY Next, but most end users probably aren't likely to be using one.
 
What about latency? Will the 3rd signaling level effect latency? Or is it unrelated?
No. The "PHY" in these chips needs to keep up with line rate. With a per-pin data rate of 24-48 Gbps, it works out to about 0.02 to 0.04 ns per bit. Even if PAM3 encoding/decoding adds a couple cycles of latency, that doesn't compare to the ~227 ns of end-to-end latency we see for GDDR6X, on a GPU like the RTX 4090:

The features of GDDR7 more likely to affect latency are things like on-die ECC. Even that shouldn't be a major contributor, but probably measurable.
 
  • Like
Reactions: gg83
my understanding is that most GPU workloads are much easier to predict data needs on and so latency doesn't actually matter as much.
GPUs are better at latency-hiding. They use techniques like SMT/Hyperthreading, but to a much greater degree. They also have hardware warp/wavefront-scheduling, with super low-overhead context switches (at that level).

GDDR5/GDDR6 latencies are far worse than DDR5 for example, but the GPUs that use GDDR are built around those latencies. They're known in advance.
GPUs accept the higher latencies, because you can deal with those a lot more easily than you can deal with being bandwidth-starved.

Something else worth pointing out is that the Ryzen 4700S — the same core design as the PlayStation 5, but without the GPU enabled — showed very poor performance overall. The general consensus is that it's because the GDDR6 system memory has horrible latencies and Windows workloads don't like it. Basically, GDDR memory types are fine for GPUs, not so much for CPUs.
Yeah, CPUs can't hide memory latency nearly as well as GPUs. In the above DDR5 memory-scaling article I linked, they conclude that latency is more important than bandwidth, for DDR5 performance. That DDR5-7200 kit they tested just happens to be really good on both fronts.
 
AFAIK, the point of shorter traces is to reduce cost, lower power requirements, reduce heat output, and allow higher operating frequencies.

PCIe 5.0 interconnect between CPU and GPU is a brute force solution, and not particularly power efficient.
Same goes for DDR5 DIMMs. It's especially noticeable with DDR5 SODIMM, as those aren't designed to run at 1.5V or whatever it takes to push 8000MT/s.
LPDDR5 and GDDR aren't limited to being stuck in modules, and can be placed a lot closer to the CPU/GPU.
HBM takes that one step further, and it's why you see it pushing 1.2TB/s, where as GDDR7 only does 192GB/s.

I am guessing you don't see DRAM/GDDR being placed on the same substrate as the processor, like HBM, because the traces only needs to be short enough. There is also increased complexity and probably cost.

PS5/M# series works around this by using shared/unified RAM. The GPU doesn't have to go through the CPU to pull relevent data, and can read directly from the memory pool.

Speaking of which, this is what the UCIe (Universal Chiplet Interconnect Express) spec aims to solve.
 
LPDDR5 and GDDR aren't limited to being stuck in modules, and can be placed a lot closer to the CPU/GPU.
Actually, DRAM in general isn't limited to being stuck in modules. However GDDR memory must be soldered down! LPDDR5 can't go in normal SODIMMs, but it can work in CAMMs. In general, it was also designed to be soldered down.

HBM takes that one step further, and it's why you see it pushing 1.2TB/s, where as GDDR7 only does 192GB/s.
You latched onto the wrong theme. HBM isn't faster because it's closer. It's faster because it can have a massively wide interface from being on the same interposer. Furthermore, its width enables its interface to run at a lower clock speed, which (along with its proximity) helps drive power-savings.

I am guessing you don't see DRAM/GDDR being placed on the same substrate as the processor, like HBM, because the traces only needs to be short enough. There is also increased complexity and probably cost.
DRAM is a generic term. It's all DRAM. However, stacking non-LP DDR5 high enough to fit a decent capacity on-package is probably a no-go for thermal reasons. The same should be doubly true for GDDR.
 
No. The "PHY" in these chips needs to keep up with line rate. With a per-pin data rate of 24-48 Gbps, it works out to about 0.02 to 0.04 ns per bit. Even if PAM3 encoding/decoding adds a couple cycles of latency, that doesn't compare to the ~227 ns of end-to-end latency we see for GDDR6X, on a GPU like the RTX 4090:
ada_latency.png

The features of GDDR7 more likely to affect latency are things like on-die ECC. Even that shouldn't be a major contributor, but probably measurable.
Thanks! And thanks for the chart
 
  • Like
Reactions: bit_user
Odds are that GPU makers will use the GDDR7 to further cripple the memory bus. Consider that in the past, wider bus widths were more common. for example the GTX 760 has a 256 bit memory bus on GDDR5, but once higher clocking GDDR5 chips became available, GPU makers started moving to a 196 bit memory bus, then, then once GDDR6 came out, we started to see GPU makers move to a 128 bit memory bus.

This is why we see today, cards like the RTX 4060 with 272GB/s VRAM throughput where cards from 2013-2014 were doing like the GTX 970, especially with a average VRAM overclock, doing 270GB/s+. When you consider the relative GPU performance, it goes to show just how starved for bandwidth these newer cards are.
PS, while extra cache helps minimize the impact of slower VRAM when comes to some workloads, the throughput intensive workloads such as dealing with large textures, or larger datasets.
 
Odds are that GPU makers will use the GDDR7 to further cripple the memory bus. Consider that in the past, wider bus widths were more common. for example the GTX 760 has a 256 bit memory bus on GDDR5, but once higher clocking GDDR5 chips became available, GPU makers started moving to a 196 bit memory bus, then, then once GDDR6 came out, we started to see GPU makers move to a 128 bit memory bus.
There's a big detail you're missing, which is the massive increase in cache that occurred in RDNA2 and the RTX 4000 series. I think that was the big enabler for each of them to reduce bus width, rather than GDDR6.

One thing I find interesting about RDNA3 is that AMD actually increased bus widths, contrary to that supposed trend.

This is why we see today, cards like the RTX 4060 with 272GB/s VRAM throughput where cards from 2013-2014 were doing like the GTX 970, especially with a average VRAM overclock, doing 270GB/s+. When you consider the relative GPU performance, it goes to show just how starved for bandwidth these newer cards are.
Try comparing the amount of L2 cache they have.

Also, if they're so bandwidth-starved, why did performance of the RTX 4070 Ti Super improve much less than its memory bandwidth? It got a 33% bandwidth boost, relative to the non-Super, but its actual performance gains were closer to the 10% compute boost it received.

the throughput intensive workloads such as dealing with large textures, or larger datasets.
Have you heard of MIP mapping? It's a texture pre-filtering technique that also massively increases access locality, since you access the texture at approximately the same resolution it's seen on the screen. Texture compression also helps quite a lot, and that's another area of improvement since the old Maxwell days you cited.

As for large datasets, I mean LOD and tessellation are roughly analogous ideas for geometry as MIP mapping is for textures.

Finally, you seem to be forgetting about DLSS. That's another trick Nvidia is using to try and get by with narrower memory data paths.
 
It might, but my understanding is that most GPU workloads are much easier to predict data needs on and so latency doesn't actually matter as much. GDDR5/GDDR6 latencies are far worse than DDR5 for example, but the GPUs that use GDDR are built around those latencies. They're known in advance.

AMD GPUs have an option to use "fast" latencies — a form of overclocking — but if you just flip that switch (and it doesn't cause instability), the real-world benefit is typically less than 1~2% from what I've seen. I'm not even sure what the actual latencies are, though... I'd have to go down the Google rabbit hole to try to find out.

Something else worth pointing out is that the Ryzen 4700S — the same core design as the PlayStation 5, but without the GPU enabled — showed very poor performance overall. The general consensus is that it's because the GDDR6 system memory has horrible latencies and Windows workloads don't like it. Basically, GDDR memory types are fine for GPUs, not so much for CPUs.
Thanks Jarred. It's really interesting how certain workloads and architectures have different tolerances for bit rate, latency, and whatever else. Maybe since GPUs are more parallel, the latency isn't as a concern? And the "g" means specifically designed for gpu workloads huh?
 
Thanks Jarred. It's really interesting how certain workloads and architectures have different tolerances for bit rate, latency, and whatever else. Maybe since GPUs are more parallel, the latency isn't as a concern? And the "g" means specifically designed for gpu workloads huh?
Yeah, the G is for Graphics. I think perhaps in the past the GPU companies (AMD and Nvidia) pushed ahead with faster memory solutions and bypassed JEDEC standards? At least, I seem to recall that happening. But as GDDR memory types have proliferated beyond consumer GPUs, tighter standards have been created.

AFAIK, there's no JEDEC stuff on GDDR5X or GDDR6X as an example — that was Nvidia working directly with Micron to roll their own special variants.
 
  • Like
Reactions: gg83 and bit_user