News Nvidia AD106, A107 Ada Lovelace GPUs May Use PCIe x8 Interface

AgentBirdnest

Respectable
Jun 8, 2022
271
269
2,370
That could really suck if true. :-\
Wouldn't affect me, since I'm using an Zen2 X570 platform. But some of my friends are using Comet Lake or Coffee Lake, which are limited to PCIe 3.0. Coffee Lake may be getting old, but its i7 CPUs are still very relevant... but they may be a bit screwed if they planned on upgrading to a mainstream Ada Lovelace card. Or maybe not... Hopefully 8 lanes in a PCIe 3.0 slot won't have too much of a performance loss. Maybe it'll be fine. And maybe it won't even happen.
Wait'n'see...
 
  • Like
Reactions: purple_dragon
If accurate, it would put the GeForce RTX 4060’s performance in between the RTX 3060 Ti and RTX 3070.

When the RTX 3060 was released, its performance generally outperformed the RTX 2060 Super and RTX 2070 by a few percentage points. The RTX 4060 would be doing the same thing here, being substantially quicker than the RTX 3060, but performing similarly to the RTX 3060 Ti and RTX 3070.


so somehow between a 3060ti & a 3070 base is same thing as 3060 beating 2060 super & a 2070?

how does 3060 beating a 2070 become same as coming between a 3060ti and a 3070?

its not beating the 3070 like the it beat the 2070.

also fact Nvidia's pricing is seeming to be raised (as they plan for 30 and 40 series to coexist and prices...not a good sign) so the cost to performance is less than 30 series.
 

hannibal

Distinguished
Well pci 3.0 and older are dead meet to AMD and Nvidia, so why not move to cheaper solution...

Well, not nice to people who has a little but older platform, but all in all. It makes more sense to buy used GPU with full 16 bit in anyway to those older machines! Cheaper, more speed than you can get from the new and no band wide problems.

They really try to cut cost by any means possible...
 

Zescion

Commendable
Oct 25, 2020
21
6
1,515
Well pci 3.0 and older are dead meet to AMD and Nvidia, so why not move to cheaper solution...

Well, not nice to people who has a little but older platform, but all in all. It makes more sense to buy used GPU with full 16 bit in anyway to those older machines! Cheaper, more speed than you can get from the new and no band wide problems.

They really try to cut cost by any means possible...
Not just cost saving, but smart business decision.
They'll keep selling 30xx GPUs when the new cards are out. This will give one good reason to buy the old cards.
 

thisisaname

Distinguished
Feb 6, 2009
913
509
19,760
Sure if it does not effect the performance but I would hope they would pass the cost saving on in the form of a lower price, but I'm not holding my breath on that happening.
 
Last edited:

InvalidError

Titan
Moderator
After all, PCIe 4.0 x8 features the same bandwidth as PCIe 3.0 x16, and the RTX 2080 Ti - the last GPU to run PCIe 3.0, ran just fine with a PCIe 3.0 x16 interface.
As all of the 4GB x4 cards have demonstrated in the past, it is the LOW-END that gets hurt worst by truncated PCIe bandwidth, not the uber-high-end with ginormous VRAM. I expect this to get much worse if DirectStorage gains momentum as low-end GPUs will have to rely far more heavily on PCIe for asset reloads from system memory than high-end ones.
 
  • Like
Reactions: atomicWAR

Kamen Rider Blade

Distinguished
Dec 2, 2013
1,326
847
20,060
PCIe x12 lane configurations gets no love.

Everybody only cares about PCIe x16 or x8, even x4 gets more love.

Nobody wants to implement x12, despite the fact that it's been part of the PCIe spec for lane configurations since day one.
 

InvalidError

Titan
Moderator
Nobody wants to implement x12, despite the fact that it's been part of the PCIe spec for lane configurations since day one.
Probably because there aren't many actual use-cases for it: no point in having x12 on PCs since the GPU is the only thing that comes remotely close to needing x8 so the most sensible ways of splitting CPU lanes when you don't want to throw the whole x16 at the GPU is either x8x8 or x8x4x4 while modern server chips have so many PCIe lanes that the half-step compromise is unnecessary.
 
  • Like
Reactions: thisisaname

escksu

Reputable
BANNED
Aug 8, 2019
877
353
5,260
I would actually love them to do this to all their cards!! The bump it to pcie 5.0... having 16 lanes of pcie 5.0 for a graphics card is plain stupid. But 8x would be great. Then this leaves another 8 lanes for other things.
 

escksu

Reputable
BANNED
Aug 8, 2019
877
353
5,260
PCIe x12 lane configurations gets no love.

Everybody only cares about PCIe x16 or x8, even x4 gets more love.

Nobody wants to implement x12, despite the fact that it's been part of the PCIe spec for lane configurations since day one.
That could really suck if true. :-\
Wouldn't affect me, since I'm using an Zen2 X570 platform. But some of my friends are using Comet Lake or Coffee Lake, which are limited to PCIe 3.0. Coffee Lake may be getting old, but its i7 CPUs are still very relevant... but they may be a bit screwed if they planned on upgrading to a mainstream Ada Lovelace card. Or maybe not... Hopefully 8 lanes in a PCIe 3.0 slot won't have too much of a performance loss. Maybe it'll be fine. And maybe it won't even happen.
Wait'n'see...

No, pcie 3.0 8x has hardly any performance loss compared to 16x. The only exception to this is when cards are starved on VRAM (4gb cards). But for higher end cards that has 8-16gb of ram, it's not an issue.

In fact you can even get down to 4x with minimal loss (less than 10%) as long as your gpu has sufficient ram to minimize swapping.
 

Kamen Rider Blade

Distinguished
Dec 2, 2013
1,326
847
20,060
Probably because there aren't many actual use-cases for it: no point in having x12 on PCs since the GPU is the only thing that comes remotely close to needing x8 so the most sensible ways of splitting CPU lanes when you don't want to throw the whole x16 at the GPU is either x8x8 or x8x4x4 while modern server chips have so many PCIe lanes that the half-step compromise is unnecessary.
But on consumer end for BiFurication, you could probably implement a variant of Direct Storage / Radeon Pro SSG's technology where you have a 12x / 4x split.
A single PCIe 4x NVMe SSD shared on the same BiFuricated x16 interface in a 12x (for GPU) / 4x split would offer much quicker R/W to the GPU since it would be physically much closer.

We're talking a distance measured in millimeters to centimeters in terms of electrical trace routing between (NVMe controller & GPU) vs being measured close to a foot from a traditional setup of (NVMe SSD to CPU to GPU). The latency difference is HUGE.

And the Radeon Pro SSG proved that having on-board NVMe can easily speed up professional work if implemented correctly on top of game load times.
 

InvalidError

Titan
Moderator
But on consumer end for BiFurication, you could probably implement a variant of Direct Storage / Radeon Pro SSG's technology where you have a 12x / 4x split.
A single PCIe 4x NVMe SSD shared on the same BiFuricated x16 interface in a 12x (for GPU) / 4x split would offer much quicker R/W to the GPU since it would be physically much closer.
Not really since current CPUs already have 16+4 PCIe lanes which already allow you to connect your main NVMe SSD directly to the CPU and the current incarnation of DirectStorage has to go through the memory controller regardless of what PCIe lanes the NVMe SSD is on anyway, so the bulk of latency is going to come from having to copy from NVMe to sysmem first.

When loading a 16MB asset from a 5.0x4 NVMe SSD, you are looking at ~1ms of data transfer time. Even if the data has to traverse two PCIe hubs with a 100ns latency penalty each, the 200ns of added first word latency would only delay loading by ~0.2% the first time data gets loaded to system RAM.

And once data is cached in system memory, you may want that whole 5.0x16 to load it from system memory to VRAM as fast as possible.
 
  • Like
Reactions: KyaraM

Kamen Rider Blade

Distinguished
Dec 2, 2013
1,326
847
20,060
Not really since current CPUs already have 16+4 PCIe lanes which already allow you to connect your main NVMe SSD directly to the CPU and the current incarnation of DirectStorage has to go through the memory controller regardless of what PCIe lanes the NVMe SSD is on anyway, so the bulk of latency is going to come from having to copy from NVMe to sysmem first.
The whole point is cut -out the CPU's main memory controller; by virtue of being so physically close to the GPU & it's memory controller & VRAM pool, you go from NVMe drive directly to GPU VRAM, bypassing the entire CPU & Main Memory Controller section.

When loading a 16MB asset from a 5.0x4 NVMe SSD, you are looking at ~1ms of data transfer time. Even if the data has to traverse two PCIe hubs with a 100ns latency penalty each, the 200ns of added first word latency would only delay loading by ~0.2% the first time data gets loaded to system RAM.
Part of AMD's Radeon SSG was to improve latency, thus improving performance & throughput. You guys covered it over here.
Adding an on-board storage volume cuts through the various layers of hardware, software and OS interaction and provides a big increase in large data set performance.
Video Games and it's expansive modern worlds are "Large Data sets" usually and simply bypassing the CPU & Main Memory Controller section is the next step in performance.

The PS5 does this to some extent and I see modern GPU's from both nVIDIA and AMD going that route eventually on the PC side to match consoles low latency in loading.
The PlayStation 5 features 16GB of GDDR6 unified RAM with 448GB/sec memory bandwidth. This memory is synergized with the SSD on an architectural level and drastically boosts RAM efficiency. The memory is no longer "parking" data from an HDD; the SSD can deliver data right to the RAM almost instananeously.

Essentially the SSD significantly reduces latency between data delivery and memory itself. The result sees RAM only holding assets and data for the next 1 second of gameplay. The PS4's 8GB of GDDR5 memory held assets for the next 30 seconds of gameplay.

Guess what, this is how we match that feature set from consoles on the PC gaming side.
On-"Video Card" PCIe <Latest PCIe spec> x4 lane that is BiFuricated from the PCIe <Latest PCIe spec> x16 lane alotment.
Having x12 lanes from the CPU is more than enough once you're at PCIe 5.0 & you can save alot of data-round tripping by going straight from NVMe to GPU VRAM pool.

Part of AMD's drive to efficiency is to minimize movement of data to only what is necessary in the shortest possible path. This tech will eventually help that goal out by minimizing data path travel and waste.

Also, 200ns is a long time in the CPU/GPU world.
v8SC1yX.png
Alot of work can get done in that time, the GPU could be sitting idly by, or getting work done if it was quickly filled with data to work with.

And once data is cached in system memory, you may want that whole 5.0x16 to load it from system memory to VRAM as fast as possible.
The whole point is to have more than enough bandwidth from the CPU to GPU at PCIe 5.0 x12, yet still have the option to have Faster loading when you can bypass the CPU & it's memory controller system and go directly from on-board NVMe storage to GPU's VRAM. Lower Latency, higher bandwidth, more fluid loading 3D worlds with less pop in.

That's why I want to BiFuricate the modern PCIe 5.0 x16 lane into x12/x4 for the (GPU/On-board NVMe Storage).

Improve performance for gaming & GPU work loads.
 
Last edited:

InvalidError

Titan
Moderator
Also, 200ns is a long time in the CPU/GPU world.
It is insignificant next to the ~1ms it actually takes to transfer the data or even the SSD's ~50us access time latency for the fastest consumer drives currently available, which is already 100+X worse than the hypothetical 200ns from putting the SSD two hops away from the CPU.

Re-loading assets to GPU VRAM from system memory over 5.0x16 (64GB/s) is much faster than a 5.0x4 (16GB/s hypothetical max, 12GB/s for current record holders) NVMe SSD and reloading from system memory has ~300ns of first-word latency vs ~50us for reading directly from NVMe.

Consoles don't have this "problem" since they share a single memory pool shared by GPU and CPU. By the same token of not having separate GPU and CPU RAM, they also don't have the option of caching assets in 32+GB of 2-3X cheaper system memory for very-low-latency access (100+X lower than NVMe) later by the GPU.
 
  • Like
Reactions: KyaraM

Kamen Rider Blade

Distinguished
Dec 2, 2013
1,326
847
20,060
It is insignificant next to the ~1ms it actually takes to transfer the data or even the SSD's ~50us access time latency for the fastest consumer drives currently available, which is already 100+X worse than the hypothetical 200ns from putting the SSD two hops away from the CPU.

Re-loading assets to GPU VRAM from system memory over 5.0x16 (64GB/s) is much faster than a 5.0x4 (16GB/s hypothetical max, 12GB/s for current record holders) NVMe SSD and reloading from system memory has ~300ns of first-word latency vs ~50us for reading directly from NVMe.

Consoles don't have this "problem" since they share a single memory pool shared by GPU and CPU. By the same token of not having separate GPU and CPU RAM, they also don't have the option of caching assets in 32+GB of 2-3X cheaper system memory for very-low-latency access (100+X lower than NVMe) later by the GPU.
That's assuming your system has 32+ GB Main Memory config.

Remember, a end users PC build varies wildy.
Somebody might have 8-16 GB main memory.

Other PC's might have < 8 GB if they're running a older PC.

Others like me might have 64 GB main memory or more.

The variability is wild and exciting.

But having Ultra Low Latency access to the storage doesn't hurt, especially since you're not expected to feed the entire capacity of VRAM all at once, every ms of every frame.

Once you get past initial load, most of the data should be sitting in VRAM, ready for you with occaisional loads and streams into VRAM.
 

InvalidError

Titan
Moderator
But having Ultra Low Latency access to the storage doesn't hurt, especially since you're not expected to feed the entire capacity of VRAM all at once, every ms of every frame.
As I wrote earlier, even the fastest SSD controllers currently in existence have 30+us of access latency, which makes the 200ns from PCIe completely moot. If you want "ultra low latency" (200-300ns) access, asssets have to be cached in system memory to bypass the SSD controller altogether.

Someone with 16GB or less of system memory in a build new enough to be eligible for DirectStorage is unlikely to own a GPU capable of rendering the open-world games that would benefit most from DirectStorage in a remotely satisfactory manner.
 

Kamen Rider Blade

Distinguished
Dec 2, 2013
1,326
847
20,060
As I wrote earlier, even the fastest SSD controllers currently in existence have 30+us of access latency, which makes the 200ns from PCIe completely moot. If you want "ultra low latency" (200-300ns) access, asssets have to be cached in system memory to bypass the SSD controller altogether.
That's assuming you're leaving a extra copy of the data sitting in Main Memory while also duplicating it for VRAM.
Some games may do it, some might not. Some might only have one copy of data and just move it about as needed.

Someone with 16GB or less of system memory in a build new enough to be eligible for DirectStorage is unlikely to own a GPU capable of rendering the open-world games that would benefit most from DirectStorage in a remotely satisfactory manner.
That is until DirectStorage becomes a standard feature across the entire product stack.
 

InvalidError

Titan
Moderator
That's assuming you're leaving a extra copy of the data sitting in Main Memory while also duplicating it for VRAM.
Some games may do it, some might not. Some might only have one copy of data and just move it about as needed.
I doubt PC game developers will go through the trouble of implementing PC DirectStorage in their massive open worlds with huge asset collections without leveraging whatever spare system memory is available to avoid stutters from reloading asset clumps from SSD.