News Nvidia Says Feature Similar to AMD's Smart Access Memory Tech is Coming to Ampere

If NVidia had this capability all along I feel this begs the question of why they had to wait for AMD to implement it. Looks bad either way.

On the other hand, considering Ryzen's appetite for fast memory and the fact that NVidia uses GDDR6X, wouldn't it be an interesting day if it turns out Ryzen 5000 runs faster with an NVidia 3000 series card?
 
  • Like
Reactions: Jim90

InvalidError

Titan
Moderator
Got to love when one company brags about marketing brand-new features that are merely a firmware tweak away from getting matched by every other hardware manufacturer that has the necessary flexibility (maximum BAR mask size in this case) built into older products, just not exposed yet for whatever reason.
 

BaRoMeTrIc

Honorable
Jan 30, 2017
164
16
10,715
OK, so no matter if it's pcie 3 or pcie 4 the pci base address registers can be raised for 256mb x 16 = 4gb to potentially what? 8gb? a full 16?
 

Chung Leong

Reputable
Dec 6, 2019
493
193
4,860
If NVidia had this capability all along I feel this begs the question of why they had to wait for AMD to implement it. Looks bad either way.

From a marketing standpoint, it's better to let everyone get subpar performance than to have some consumers getting subpar performance due to incompatible hardware.
 

InvalidError

Titan
Moderator
OK, so no matter if it's pcie 3 or pcie 4 the pci base address registers can be raised for 256mb x 16 = 4gb to potentially what? 8gb? a full 16?
The PCIe spec had a 64bits version of the initialization registers for a long time to accommodate CPUs with >4GB memory space, I'm actually surprised it took this long for hardware manufacturers to add or unlock support for >256MB BAR.

The maximum size of BAR blocks depends on how many bits the address decoders support. Logically, the BAR should support enough address bits to let memory-mapped IO sit outside the maximum physical RAM address space at a bare minimum so memory-mapped IO don't carve usable space out of it. If the system allows a device to use up to half of the memory-mapped IO space, then the maximum BAR size might be 64GB on CPUs with 128GB max RAM.

There was a time where CPUs only decoded the lower 36bits of memory addresses and OSes used the MSBs to encode flags for internal use, which required kernel rewrites when address decoding got expanded to 40+bits. Chances are that all of the hardware necessary to support larger BAR sizes has been around since then.
 
If NVidia had this capability all along I feel this begs the question of why they had to wait for AMD to implement it. Looks bad either way.

On the other hand, considering Ryzen's appetite for fast memory and the fact that NVidia uses GDDR6X, wouldn't it be an interesting day if it turns out Ryzen 5000 runs faster with an NVidia 3000 series card?
Yeah, it does seem a bit questionable why they wouldn't have enabled such functionality previously. Are there any drawbacks to doing so?

It's also possible that AMD's solution might cover more than just adjusting the PCIe BAR size though, and that might only be a part of it that affects existing games. I think it was suggested that games would have to be optimized for Smart Access Memory to get the most from it, so perhaps it also enables something like direct control over the contents of the "infinity cache" for example. Nvidia's performance gains from adjusting the BAR size alone might not be as large, though we just have speculation to work with for now.

As for Ryzen and RAM speed, that's not how it works. Ryzen's fabric matches the speed of system RAM, but isn't affected by VRAM, and applications are typically processing data stored in system RAM, not on the graphics card. And again, at least from what AMD has shown, the memory bandwidth of their 6000-series cards can effectively be far higher than GDDR6X for data that can fit inside the large 128MB block of L3 cache that they are calling the "infinity cache", which accounts for a relatively large portion of the GPU chip itself. That cache can hold the framebuffer, for example, allowing the GPU to perform operations on it much quicker and more efficiently than if it were stored in VRAM. I'm sure there will be some cases where having faster VRAM would be better, but this new cache is a large part of where AMD's performance and efficiency gains come from this generation, and it reduces the need for faster graphics memory.
 

hannibal

Distinguished
Most likely the difference between pci 3.0 and 4.0 is the normal. Pci 4.0 has bigger bandwide so it can acces the memory faster than 3.0. I think that there were news that this has been working in Linux for some time.
 

InvalidError

Titan
Moderator
but this new cache is a large part of where AMD's performance and efficiency gains come from this generation, and it reduces the need for faster graphics memory.
Modern games use GBs worth of assets to render a scene, a larger L3$ won't alleviate the need for more VRAM. All extra cache does is reduce the frequency of cache misses so the GPU as a whole can make better use of available VRAM and PCIe bandwidth. I also bet modern GPUs have far more pressing uses for L3$ than the frame buffer, such as all of the temporary data shaders need to pass around between passes.
 
  • Like
Reactions: Memnarchon
There was a comment on Phoronix forums a while ago about it and yes, it's about modifying BAR. This however requires BIOS and kernel support, and wasn't possible in Windows for quite a while.
AMD enabled the feature on Windows when they had enough control to make sure it works, they never said it was a feature only they could support.
Yes, one could ask why Microsoft/Intel/Nvidia never came together to make it possible before... Oh wait.
Edit: typo
 
Last edited:
  • Like
Reactions: VforV

VforV

Respectable
BANNED
Oct 9, 2019
578
287
2,270
Got to love when one company brags about marketing brand-new features that are merely a firmware tweak away from getting matched by every other hardware manufacturer that has the necessary flexibility (maximum BAR mask size in this case) built into older products, just not exposed yet for whatever reason.
So you're hating on AMD for bringing this forward, just because they brag about it? Sure, it would have been better to let this go, ignore it like nvidia did and we the gamers would have never had it... perfect reasoning.

Does it really matter if its a brag or easy to implement as long as NOW thanks to AMD we will get it on both of them?

Some people have such a narrow mind... meh.
 
As for Ryzen and RAM speed, that's not how it works. Ryzen's fabric matches the speed of system RAM, but isn't affected by VRAM, and applications are typically processing data stored in system RAM, not on the graphics card. And again, at least from what AMD has shown, the memory bandwidth of their 6000-series cards can effectively be far higher than GDDR6X for data that can fit inside the large 128MB block of L3 cache that they are calling the "infinity cache", which accounts for a relatively large portion of the GPU chip itself.

The point of Smart Access Memory is so that applications can fully address the entire VRAM buffer, it's precisely what it does. I understand it does not affect Infinity Fabric frequency (in fact I never said that), but adding faster memory means you can feed the CPU faster and AMD has shown the gains themselves. Higher VRAM bandwith from GDDR6X should improve on these gains.

As for the RX6000 the on-die cache is a GPU cache and SAM memory operations are handled by the CPU. The GPU cache cannot cache operations that were not processed by it. Notice how AMD never mentions the cache and repeatedly talks about the "high bandwitdh GDDR6 memory" when talking about SAM. For reference:

gpu-management-model.png
 
Last edited:
  • Like
Reactions: Memnarchon
D

Deleted member 2851593

Guest
Got to love when one company brags about marketing brand-new features that are merely a firmware tweak away from getting matched by every other hardware manufacturer that has the necessary flexibility (maximum BAR mask size in this case) built into older products, just not exposed yet for whatever reason.
Do you mean like when Nvidia renames standard implementations of API features to make it look they came up with it, and it's their exclusive? Yeah, it's deceptive and scummy.
 
  • Like
Reactions: Brayzen
It's a fundamental change in how cpu and gpu communicates, so I am not surprised AMD wants to start slow and expand as they go. It's logical to ship to a small group first and stabilize before updating everyone.
Nvidia here is that kid that tells the teacher there is an easier way to do stuff, when he tries to explain methodology.
 
  • Like
Reactions: Brayzen
Yeah, it does seem a bit questionable why they wouldn't have enabled such functionality previously. Are there any drawbacks to doing so?
Because they made RTX IO where they transfer compressed data with low CPU usage which makes it obsolete to use large chunks of data/ram for transfering data because now the data fits into small chunks of ram.
They only enabled it now because of...why not?!
 

InvalidError

Titan
Moderator
So you're hating on AMD for bringing this forward, just because they brag about it? Sure, it would have been better to let this go, ignore it like nvidia did and we the gamers would have never had it... perfect reasoning.
I find it stupid when companies house-brand generic stuff regardless of who the company is and I've bashed all companies for doing it at one time or another, it isn't something I "only do on AMD" and as I wrote previously, the hardware to make larger BAR size has likely been in place for several years already. As for why they waited until now to expand it, I wouldn't be surprised if OSes internally used "unused" BAR bits to internally track stuff and needed time to clean that up the same way OSes had to clean up when server CPUs became capable of decoding memory addresses larger than 32bits about 20 years ago.

Because they made RTX IO where they transfer compressed data with low CPU usage which makes it obsolete to use large chunks of data/ram for transfering data because now the data fits into small chunks of ram.
The size of the IO window isn't the problem, having to move that window around the GPU address space to read/write different chunks is. Having flat access to the entire GPU memory eliminates all operations and latency associated with moving the 256MB window around the GPU's 4-24GB VRAM buffer.

Before:
1- check if the VRAM IO hits the currently active 256MB GPU VRAM memory page
2- send commands to the GPU to move the IO window to the target VRAM memory range when the window is currently at the wrong address
3- wait for the command to complete
4- read/write
5- rinse and repeat for every VRAM IO, threads have to take turns accessing VRAM when they need access to different pages

Now:
1- read/write
 
  • Like
Reactions: Memnarchon
Apr 1, 2020
1,394
1,050
7,060
Even in AMD's own first party benchmarks WITH "rage mode" the benefit could be minimal, 2-4%, nothing that will make or break playability or make the next performance tier, so I wouldn't be surprised if this is as overhyped as hardware accelerated GPU scheduling...Although it -will- be interesting to see if the two have any combination effect to boost performance, as I believe AMD still lacks this feature due to insufficient userbase in the Microsoft Insider Program.
 

InvalidError

Titan
Moderator
Even in AMD's own first party benchmarks WITH "rage mode" the benefit could be minimal, 2-4%
Sometimes it isn't the improvement in gross performance that matters as much as the overall smoothing up of the entire process: with software and drivers having flat access to VRAM, you eliminate thread contention for the 256MB window and with it, all the hiccups that may happen when threads end up waiting for each other to release the window so another thread can move it and do its thing. It is the 1% and 0.1% lows that ruin user experience and threads no longer having to compete for the VRAM window may help quite a bit with that.
 
Now:
1- read/write
The GPU still has to know if the memory address is empty, find the closest empty address otherwise, write the data and move to the next empty address.

It's like increasing the block size on a hard drive, all the other things stay the same except that you can send more data at once.

If the argument is that you could send 10 or however many Gb the GPU has at once to the Vram then what's the point?! What game will ever do that?
 
Sometimes it isn't the improvement in gross performance that matters as much as the overall smoothing up of the entire process: with software and drivers having flat access to VRAM, you eliminate thread contention for the 256MB window and with it, all the hiccups that may happen when threads end up waiting for each other to release the window so another thread can move it and do its thing. It is the 1% and 0.1% lows that ruin user experience and threads no longer having to compete for the VRAM window may help quite a bit with that.
There is always one IO thread that is scheduled to do things in the most efficient way, all the files are put in the best order possible and are being send out in 256MB chunks.
 

Victor_S

Distinguished
Oct 19, 2015
9
5
18,515
As it's already been said, it's really convenient that NVidia claims now they have had this capability all along, (assuming all of this is fact) and NOW that AMD is enabling it Nvidia will be too. But will the scalpers scoop up this feature before we get it? lol They RUSHED Ampere to beat AMD, and now all of a sudden they have an equivalent feature to SAM all along they were just too greedy or lazy to implement it......no thanks.

I switched over to the "NVidia camp" since the GTX 10 series, but honestly, between the botched RTX 20 AND 30 series launch as well as their apparent greed with pricing.....I've had enough. I already have a Ryzen 5600x CPU, and am awaiting the 6800XT launch, I'm supporting AMD for now.
 

InvalidError

Titan
Moderator
The GPU still has to know if the memory address is empty, find the closest empty address otherwise, write the data and move to the next empty address.
The GPU does not need to know anything, keeping tabs on what memory is used for what or free is the drivers' job.

They RUSHED Ampere to beat AMD, and now all of a sudden they have an equivalent feature to SAM all along they were just too greedy or lazy to implement it......no thanks.
I'd bet many GPUs and CPUs going back 10+ years have sufficient BAR flexibility in hardware to implement SAM-like functionality and could get it with nothing more than a firmware update to unlock extra address mask bits. The BAR itself already had to support 40+ bits address decoding to accommodate servers with 200+GB of RAM.
 

Makaveli

Splendid
Got to love when one company brags about marketing brand-new features that are merely a firmware tweak away from getting matched by every other hardware manufacturer that has the necessary flexibility (maximum BAR mask size in this case) built into older products, just not exposed yet for whatever reason.

I would still wait until AMD Solution is fully reviewed by a 3rd party and Nvidia's before making a final decision on this.