News GDDR7 graphics memory standard published by JEDEC — Next-gen GPUs to get up to 192 GB/s of bandwidth per device

Page 2 - Seeking answers? Join the Tom's Hardware community: where nearly two million members share solutions and discuss the latest tech.
Maybe since GPUs are more parallel, the latency isn't as a concern?
Essentially, yes. That's because the amount of concurrency in graphics workloads lets them keep many more threads (i.e. warps/wavefronts) in flight than they can execute at any one time. So, if one is blocked on memory access, others can execute in its slot.

However, another key thing about GPUs is the amount of directly-addressed on-chip memory they have. This is super low-latency, and substitutes for some of the DRAM/cache usage you'd find in a comparable program written for a general-purpose CPU.

And the "g" means specifically designed for gpu workloads huh?
Heh, I think the "G" in "GDDR" means it's not useful for much, other than graphics!
: )

My take is that GDDR memory and GPUs are co-optimized. GDDR trades latency for higher bandwidth, specifically because GPUs need higher bandwidth and can accommodate higher latency easily enough.
 
  • Like
Reactions: gg83
There's a big detail you're missing, which is the massive increase in cache that occurred in RDNA2 and the RTX 4000 series. I think that was the big enabler for each of them to reduce bus width, rather than GDDR6.

One thing I find interesting about RDNA3 is that AMD actually increased bus widths, contrary to that supposed trend.


Try comparing the amount of L2 cache they have.

Also, if they're so bandwidth-starved, why did performance of the RTX 4070 Ti Super improve much less than its memory bandwidth? It got a 33% bandwidth boost, relative to the non-Super, but its actual performance gains were closer to the 10% compute boost it received.


Have you heard of MIP mapping? It's a texture pre-filtering technique that also massively increases access locality, since you access the texture at approximately the same resolution it's seen on the screen. Texture compression also helps quite a lot, and that's another area of improvement since the old Maxwell days you cited.

As for large datasets, I mean LOD and tessellation are roughly analogous ideas for geometry as MIP mapping is for textures.

Finally, you seem to be forgetting about DLSS. That's another trick Nvidia is using to try and get by with narrower memory data paths.
The cache improvements do not make up for the the significant lack of throughput. While there are some workloads that benefit from it, tasks which are VRAM throughput intensive, e.g., higher resolutions, higher res textures, memory-hard workloads such as higher res image AI image generation, do not benefit much from extra cache.
Furthermore, with VRAM throughput, the scaling isn't infinite, instead you get diminishing returns as the GPU begins to have enough VRAM throughput for its needs. For example, a card like the RTX 4060 Ti 16GB gets better scaling from VRAM overclocking than the RTX 4070 ti super.

While the card is a really bad value, the GPU is significantly faster than ones from 2014, and having significantly more cache can help, it does not make up for all of the detriment of having a GPU with 2.5 times the compute performance but similar amounts of VRAM throughput as the cards from that time period.
 
The cache improvements do not make up for the the significant lack of throughput.
Source?

While there are some workloads that benefit from it, tasks which are VRAM throughput intensive, e.g., higher resolutions, higher res textures,
I already explained how MIP-mapping makes the issue of high-resolution textures much less severe than one might think. As for higher display resolutions, have you not heard of tile-based rendering? That provides much better spatial coherence in texture lookups, which results in better cache hit-rates, and thus a good return on adding more of it.

memory-hard workloads such as higher res image AI image generation, do not benefit much from extra cache.
These cards are made for gaming. If your point was really that they're bandwidth-starved for inferencing large networks, then you should've said so and I'd have agreed with you.

While the card is a really bad value, the GPU is significantly faster than ones from 2014, and having significantly more cache can help, it does not make up for all of the detriment of having a GPU with 2.5 times the compute performance but similar amounts of VRAM throughput as the cards from that time period.
Leaving aside the bit about LLMs and image generators, the problem with your argument is that it doesn't pass a basic sanity-check. What you're saying is that they provided too much compute for the amount of memory bandwidth. However, GPU compute power is directly proportional to silicon area. Chip cost scales nonlinearly with die size. So, your argument that they built a bandwidth-starved product boils down to an allegation that they wasted money. That's essentially what you're saying, and it makes no sense!

Nvidia and AMD like money. The way you make more money is to sell the cheapest-to-produce hardware for the most revenue. You maximize revenue by providing better value, which means higher framerates. Because the two most important factors in GPU purchases are typically price and gaming performance, it's the framerates that really matter to most buyers - not the specs. So, it's in their interest to balance the amount of compute and bandwidth!

I'm not saying you can never get these cards into situations where they're bandwidth-starved, but that probably means you're using resolutions where they wouldn't perform very well, even with more memory bandwidth! Basically, you're claiming to know better than Nvidia what makes a good gaming GPU. I simply don't understand where such hubris comes from.
 
  • Like
Reactions: JarredWaltonGPU
Not every workload is bandwidth intensive, and while the cache helps in some areas, it does not help in all areas. Gamers Nexus focused more on the cache in their comparison of the 4060 Ti 8GB and 3060 Ti 8GB. Keep in mind the RTX 4060 Ti has around 21% improved raw compute performance over the 3060 Ti, and around 8-12% improved performance at 1080p, though the benefits diminish and even scale negatively in some cases at higher resolutions on some games.
View: https://youtu.be/Y2b0MWGwK_U


To see more of the impact of reduced VRAM throughput, look at workloads where the data being worked on cannot all fit in cache, and constant use of the VRAM will be needed. https://www.pugetsystems.com/labs/articles/nvidia-rtx-4070-and-4060-ti-8gb-content-creation-review/


PS, overclocking the VRAM offers a decent performance boost in davinci resolve, especially for the fusion functions and temporal noise reduction, though good GPU compute performance is still needed. For example, a VRAM overclock on a GTX 1070 only gets a very small performance boost.
sLWWlO8.png


PS, Blender strongly favors the Nvidia architecture (including RT cores for RT based lighting, though that will not make up all of the difference between the generations) as well as strongly favoring cache in their tile rendering, thus the RTX 40xx series gets a large performance boost.

SyuJSme.png


I am not claiming to know more than nvidia, but the same can be said about your clams about balancing compute and bandwidth, as we do not know if that was their goal. The most those of us on the outside can do is draw logical inferences as to why they might have made some changes, and it strongly seems like they are forcing additional resolution segmentation.

When running a game, there are aspects of the game engine that deal in small data sets that are compute intensive, and added cache helps in those areas. Aspects of the game engine leaning more heavily on the cache will see a performance improvement, while aspects leaning more on VRAM throughput will struggle more.

PS, those applications are not PCIe bus intensive workloads, thus the drop to PCIe 4.0 X8 does not make a difference.

The overall point I am getting at is that when they cripple the VRAM throughput there are tradeoffs, and cache is not a panacea to those tradeoffs.
 
Last edited:
Not every workload is bandwidth intensive, and while the cache helps in some areas, it does not help in all areas. Gamers Nexus focused more on the cache in their comparison of the 4060 Ti 8GB and 3060 Ti 8GB. Keep in mind the RTX 4060 Ti has around 21% improved raw compute performance over the 3060 Ti, and around 8-12% improved performance at 1080p, though the benefits diminish and even scale negatively in some cases at higher resolutions on some games.
View: https://youtu.be/Y2b0MWGwK_U


To see more of the impact of reduced VRAM throughput, look at workloads where the data being worked on cannot all fit in cache, and constant use of the VRAM will be needed. https://www.pugetsystems.com/labs/articles/nvidia-rtx-4070-and-4060-ti-8gb-content-creation-review/


PS, overclocking the VRAM offers a decent performance boost in davinci resolve, especially for the fusion functions and temporal noise reduction, though good GPU compute performance is still needed. For example, a VRAM overclock on a GTX 1070 only gets a very small performance boost.
sLWWlO8.png


PS, Blender strongly favors the Nvidia architecture as well as strongly favoring cache in their tile rendering, thus the RTX 40xx series gets a large performance boost.

SyuJSme.png


I am not claiming to know more than nvidia, but the same can be said about your clams about balancing compute and bandwidth, as we do not know if that was their goal. The most those of us on the outside can do is draw logical inferences as to why they might have made some changes, and it strongly seems like they are forcing additional resolution segmentation.

When running a game, there are aspects of the game engine that deal in small data sets that are compute intensive, and added cache helps in those areas. Aspects of the game engine leaning more heavily on the cache will see a performance improvement, while aspects leaning more on VRAM throughput will struggle more.

PS, those applications are not PCIe bus intensive workloads, thus the drop to PCIe 4.0 X8 does not make a difference.
The whole point of having lots of cache is to reduce the need for raw bandwidth. AMD and Nvidia have both said that when discussing their architectures. It won’t universally help — things like crypto hashing don’t benefit at all for example because they basically do random memory accesses — but the vast majority of workloads that these HPUs are intended to run (ie, games) show huge improvements.

4060 Ti has 22% more compute than the 3060 Ti and ends up 10~15 percent faster. But what you didn’t state was that the 3060 Ti has 56% more memory bandwidth, or alternatively the 4060 Ti has 36% less bandwidth.

In situations where cache doesn’t help and a workload depends solely on raw bandwidth, we should expect the 3060 Ti to significantly outperform the 4060 Ti. The fact that this almost never happens (it’s slightly faster at best) shows that the cache is working as expected.

There are exceptions, like the Puget DaVinci Resolve test you show… but then the 4060 Ti was not designed to run DaVinci Resolve. That’s a professional workload, so even though it can be used on the 4060 Ti and 3060 Ti, doing poorly isn’t a particularly meaningful loss for gamers. It’s like the often horrible SPECviewperf results on GeForce cards: No one really cares that they suck at Siemens NX.

But this is all ignoring the point of this article, which is that GDDR7 should directly attack the problem of having insufficient bandwidth and capacity for a given interface width. A 128-bit GDDR7 solution with 32Gbps 24Gb chips would have 512 GB/s of bandwidth and 12GB of capacity — 78% more than the 4060 Ti on bandwidth and 50% more capacity.
 
  • Like
Reactions: bit_user
Not every workload is bandwidth intensive, and while the cache helps in some areas, it does not help in all areas. Gamers Nexus focused more on the cache in their comparison of the 4060 Ti 8GB and 3060 Ti 8GB. Keep in mind the RTX 4060 Ti has around 21% improved raw compute performance over the 3060 Ti, and around 8-12% improved performance at 1080p, though the benefits diminish and even scale negatively in some cases at higher resolutions on some games.
View: https://youtu.be/Y2b0MWGwK_U
First, let's be clear about one thing: you cherry-picked the most extreme bandwidth regression from the RTX 3000 -> RTX 4000 series, where the RTX 3060 Ti had a 256-bit bus (@ 14 Gbps) and the RTX 4060 Ti has a 128-bit bus (@ 18 Gbps). So, in terms of raw bandwidth, its predacessor had 55.5% more bandwidth. Or, looking at it the other way, the new model has just 64.3% as much bandwidth.

Now, in spite of that, the new card still managed to chalk up a clear win in Jarred's 15 game Geomean @ 1440p:

Here's the rasterization Geomean, from the same review:

ktiowcieJreFnhsrREwAJK.png


Now, those frame rates are already getting low enough that we can already say that 1440p is probably the card's upper limit. It doesn't matter how poorly performance scales to 4k, because the GPU wouldn't even have enough compute power, by that point.

To see more of the impact of reduced VRAM throughput, look at workloads where the data being worked on cannot all fit in cache, and constant use of the VRAM will be needed. https://www.pugetsystems.com/labs/articles/nvidia-rtx-4070-and-4060-ti-8gb-content-creation-review/
It's a gaming card. That's what Nvidia designed and marketed it as. If it doesn't work well for non-gaming applications, that's beside the point.

I am not claiming to know more than nvidia, but the same can be said about your clams about balancing compute and bandwidth, as we do not know if that was their goal.
(emphasis added)
Sure, we don't know if they actually try to build cost-effective products or maximize profit, but we have to assume they try. Furthermore, if they weren't very good at it, then it's hard to see how they've managed to stay ahead of AMD and Intel.

Here's what they said about it:

HwRyt9ykHoZnQh2WLx9qNg.jpg


So, you can't claim they didn't at least analyze the issue. Furthermore, there must've been some thought put into exactly how much L2 to include, which (as I've said) costs a lot of money.

The most those of us on the outside can do is draw logical inferences as to why they might have made some changes, and it strongly seems like they are forcing additional resolution segmentation.
Okay, so how much does overclocking just the memory benefit 1440p performance? That would give us a clear indication of just how bottlenecked it really is, at that resolution. 4k performance doesn't matter, because it's definitely underpowered for 4k. So, 1440p is really where you have the best chance to stake your claim about it being bandwidth-starved.

The way I see it, their entire product stack is addressing different price, resolution, and performance segments. They do that by varying compute and bandwidth.

cache is not a panacea to those tradeoffs.
It's not perfect, but it doesn't have to be. It just has to provide a competitive perf/$ offering.
 
Last edited:
The whole point of having lots of cache is to reduce the need for raw bandwidth. AMD and Nvidia have both said that when discussing their architectures. It won’t universally help — things like crypto hashing don’t benefit at all for example because they basically do random memory accesses
Ah, thank you! That's an excellent point. Given these GPUs were designed near the peak of the crypto craze, perhaps there was even some thought to design tradeoffs they could make which are less appealing to miners!

4060 Ti has 22% more compute than the 3060 Ti and ends up 10~15 percent faster. But what you didn’t state was that the 3060 Ti has 56% more memory bandwidth, or alternatively the 4060 Ti has 36% less bandwidth.

In situations where cache doesn’t help and a workload depends solely on raw bandwidth, we should expect the 3060 Ti to significantly outperform the 4060 Ti. The fact that this almost never happens (it’s slightly faster at best) shows that the cache is working as expected.
Another excellent point!

this is all ignoring the point of this article, which is that GDDR7 should directly attack the problem of having insufficient bandwidth and capacity for a given interface width.
We got onto this tangent because it was claimed that Nvidia would use the "excuse" of faster memory to "bandwidth starve" their GPUs even further. Again, I think that claim defies simple logic. It's not in Nvidia's economic interest to design imbalanced GPUs, and they seem pretty adept at minding their economic interests.
 
  • Like
Reactions: JarredWaltonGPU
it was claimed that Nvidia would use the "excuse" of faster memory to "bandwidth starve" their GPUs even further. Again, I think that claim defies simple logic.
Good thing no one claimed that.
The original fear was that Nvidia would use the faster memory to further shrinking the memory bus width while maintaining similar memory performance figures, which has become a trend lately. They are essentially using a deliberately slower memory bus in order to artificially segment their products beyond what the compute performance of the GPU would naturally lead to, and potentially wont let GDDR7 get in the way of that. This means more compromises than would normally be made when going with a mid to lower end video card.
 
Last edited:
The original fear was that Nvidia would use the faster memory to further shrinking the memory bus width while maintaining similar memory performance figures, which has become a trend lately. They are essentially using a deliberately slower memory bus in order to artificially segment their products beyond what the compute performance of the GPU would naturally lead to, and potentially wont let GDDR7 get in the way of that. This means more compromises than would normally be made when going with a mid to lower end video card.
Nvidia, like AMD and Intel, targets different product segments — "deliberately" in all cases! But there's nothing "artificial" about the segmentation. It's been shown, repeatedly, that the lack of raw memory bandwidth only slightly hinders cards like the 4060/4060 Ti, at least in gaming workloads. Large memory overclocks on Ada GPUs (and RDNA 2/3 GPUs) don't generally equal linear scaling in performance, meaning the workloads aren't bandwidth limited thanks to having larger caches.

The difficulty with modern chips is that memory interfaces don't scale well at all. The size of the external GDDR6 memory interfaces on Turing (TSMC 12nm) are similar in size to those on Ampere (Samsung 8nm), which are also similar in size to Ada Lovelace (TSMC 4nm). Cache also doesn't scale as well with smaller nodes, but it scales better than external interfaces. So, putting more cache on a chip to reduce the number of external interfaces represents a win.

Look at the RTX 4070 and RTX 3080 10GB — performance is very similar overall. But the 3080 has a 320-bit interface and 760 GB/s of raw bandwidth paired with 30 teraflops of FP32 compute, while the 4070 has a 192-bit interface and 504 GB/s of bandwidth paired with 29 teraflops of compute. So the larger L2 basically makes up for the 34% deficit in bandwidth in most workloads.

Now if you were to do crypto hashing on the two chips, I'd expect the 3080 to be 50% faster, give or take, because of the bandwidth. But in games, it's usually within 5% either direction, and sometimes the extra 2GB of VRAM makes a difference.

There's a cost to everything Nvidia (or AMD or Intel) might want to put on a chip or graphics card. A wider interface can help in some cases, but in games it's often not the critical factor. Compute is usually the most important aspect, followed by VRAM capacity — provided the whole memory subsystem (VRAM plus cache) isn't completely gimped.

The 4060 Ti comes closer to falling into that category of being gimped, because it only has 288 GB/s of bandwidth with 22 teraflops of compute. Notice that this is 24% less compute than the 4070 but 43% less bandwidth. The 4060 on the other hand has 48% less compute and 46% less bandwidth than the 4070. So, the 4060 remains mostly "balanced" but the 4060 Ti encroaches on "unbalanced" territory. I don't think it entirely gets there, though the 8GB card also has the issue of not having enough capacity.

We'll have to see what happens with the 50-series before coming to any conclusions. There are some good options that will be available even on 128-bit interfaces. Would Nvidia dare to even try a 96-bit interface on a desktop card? Maybe for an RTX 5050, but I hope nothing above it cuts things that far (because even 3GB chips would only yield 9GB of VRAM). 128-bit and above with 3GB / 24Gb chips should be fine.
 
  • Like
Reactions: bit_user
Good thing no one claimed that.
It's exactly the outcome you're describing, but I'm fine with you putting it in your own words:

The original fear was that Nvidia would use the faster memory to further shrinking the memory bus width while maintaining similar memory performance figures,

So, if they "maintain similar memory performance figures", while presumably increasing compute capacity, is the effect not to "bandwidth starve" their GPUs even further?

They are essentially using a deliberately slower memory bus in order to artificially segment their products beyond what the compute performance of the GPU would naturally lead to,
I've asked you for evidence. Show us where a faster memory clock (but same core clocks) delivers a near linear speedup on gaming at 1080p or even 1440p. If it doesn't, then that blows a gaping hole in your theory that it's bandwidth-limited.

Furthermore, up and down their product lineup, they vary both compute and memory bandwidth. They don't rely on just one or the other, because that would be wasteful.

Here's something you probably didn't consider: DRAM density. When they designed the RTX 3060 Ti, probably the cheapest way to provide 8 GB was to have a 256-bit bus. That's why they did it - not because the RTX 3060 Ti actually needed all of that bandwidth. Or, maybe the RTX 3060 Ti did actually need > 128-bit, due to having vastly less L2 cache. But, with the RTX 4000 series and its larger caches, it's more cost-effective for them to use half the dies of double density. So, if they can get enough bandwidth @ 128-bit, then why not?