News DirectStorage Performance Compared: AMD vs Intel vs Nvidia

And of course what you really want is to never have that data in system memory at all, the real holy grail will be for the pci bus to be replaced with something that allows the GPU to load directly from an attached storage device without the cpu or system ram being involved at all.
 
And of course what you really want is to never have that data in system memory at all, the real holy grail will be for the pci bus to be replaced with something that allows the GPU to load directly from an attached storage device without the cpu or system ram being involved at all.

I don't know for sure, but I would guess you're describing version 2.0 of DirectStorage.
 
Another important question to ask is how much GPU power it takes away while doing it. If this causes stutters when loading on the fly in the middle of a map or something, it may not be a good thing to use in the GPU directly? Maybe?

Regards.

I've had the same question for since Direct Storage was originally announced. I've not yet found anyone who's done significant testing to find out how much RAM and GPU cycles are reduced to see if it's really "faster" overall. I put "faster" in quotes because in a vacuum it is (aka tested on its own), but as a whole I haven't found a good answer.
 
  • Like
Reactions: bit_user
Another important question to ask is how much GPU power it takes away while doing it. If this causes stutters when loading on the fly in the middle of a map or something, it may not be a good thing to use in the GPU directly? Maybe?

Regards.
In theory, it could take a good amount of GPU power, but the GPU might be waiting for asset loading anyways. It might use more GPU power, but the GPU might be idle or waiting for data anyways, therefore it can actually reduce the amount of stuttering.

It might be something the game programmers can optimize for. This especially sounds like something a console developer can leverage. So, if testing shows that the traditional loading pipeline is better, they can use it
 
  • Like
Reactions: bit_user
Considering that many a rig doesn't have such a CPU, the difference would likely be even bigger there. On the other hand, many do not have such a GPU neither, so then the question still is how well it works on an average rig - including issues such as whether a 4 GB GPU would actually not have as much space for all the data the game wants to load into VRAM directly (instead of using the system memory as buffer for decompressed data).

Another important question to ask is how much GPU power it takes away while doing it. If this causes stutters when loading on the fly in the middle of a map or something, it may not be a good thing to use in the GPU directly? Maybe?

Regards.

Game developers have the option to provide the user with a settings option to force CPU decompression for all of the workload. And if I understand it correctly, devs can designate some data to go to the CPU for decompression, which some may want to do for mentioned "on the fly loading". Or they may not want to do that, and still use some form of transition, such as a simple corridor, in which the GPU doesn't have much to do for the output and can process the data for the next area.

And of course what you really want is to never have that data in system memory at all, the real holy grail will be for the pci bus to be replaced with something that allows the GPU to load directly from an attached storage device without the cpu or system ram being involved at all.

If a GPU has plenty of RAM, there sure is an argument for it. SDRAM still has a way lower latency than NVMe SSDs though, by a factor of around 100, so hardly a bottleneck (if enough GB). And for data the GPU doesn't need right now, to be buffered in the system memory, that still makes sense, as it can be accessed faster there than from a storage device.
 
Last edited:
I'm still waiting for Radeon SSG like PCIe x4 interface mounted on the back of the Video Card to help shorten the traces between the Storage and Video Card.

Also have Bi-Furication on PCIe 5.0 for PCIe x12 lanes & PCIe x4 lanes so that the GPU can get x12 Bandwidth while the SSD can get x4 lanes worth of Bandwidth.

That should really help shorten the latency by cutting out the directing of traffic to the CPU and have a ultra short route from Storage to GPU.
 
Another important question to ask is how much GPU power it takes away while doing it. If this causes stutters when loading on the fly in the middle of a map or something, it may not be a good thing to use in the GPU directly? Maybe?
What about the stutters and asset pops from having the CPU bogged down by asset decompression and the GPU having to wait for the CPU 3X as long? I suspect most gamers have nowhere near an i9-12900k or AMD equivalent either.
 
And of course what you really want is to never have that data in system memory at all, the real holy grail will be for the pci bus to be replaced with something that allows the GPU to load directly from an attached storage device without the cpu or system ram being involved at all.
NvLink
 
And of course what you really want is to never have that data in system memory at all, the real holy grail will be for the pci bus to be replaced with something that allows the GPU to load directly from an attached storage device without the cpu or system ram being involved at all.
Having the data staged in system memory isn't such a bad thing as it allows you to cache compressed data from the NVMe drive so the GPU can reload it without NVMe and file system overhead next time it is needed. Putting an NVMe SSD directly on the GPU seems overkill and having to manage what gets stored where sounds like a hassle.
 
And of course what you really want is to never have that data in system memory at all, the real holy grail will be for the pci bus to be replaced with something that allows the GPU to load directly from an attached storage device without the cpu or system ram being involved at all.
This won't happen for a few reasons:
  • The GPU has to know where storage is in the system in the first place, as every device has an address in the system that only the CPU knows. The CPU could tell it, but that's the first hurdle.
  • There's also the issue if PCIe devices are even allowed to talk to each other directly (I'm sure they can, but I'm also certain they can't due to the way the protocol works)
  • The GPU has to implement the NVMe protocol in order to talk to the drive
  • The GPU has to implement a driver for the file system, because as far as it knows, those 0s and 1s don't mean anything useful
  • Then finally, the GPU has to know what files to even look for and when. The GPU can't hold all of the assets in VRAM so... how does it know when to grab the right one?
 
  • Like
Reactions: bit_user
Well... the GPU could have storage on it, amd makes them already (radeon ssg). I would think direct storage is good enough for most people but I suppose if you could have a second SSD for your games (or at least part of your games or caching or whatever) right on the gpu maybe it does help with load times and stuttering and whatnot. Sounds like a way to make gpu's cost another few hundred dollars 🙂. I'd be interested in how much faster that would be. Also if the gpu's supported pci-e 5 and you had a pcie-5 ssd, how much faster would direct storage be.
 
And of course what you really want is to never have that data in system memory at all, the real holy grail will be for the pci bus to be replaced with something that allows the GPU to load directly from an attached storage device without the cpu or system ram being involved at all.

ARM notebooks, consoles and phones already solved that problem with shared memory.

Shared memory is vastly superior because data is directly addressable by both the CPU and GPU without having to load everything from one memory pool to another.

It's one of the reasons ARM is so efficient, and x86 is a battery hog. While you can use shared memory on x86, it was never designed to do so.

PC will have to move to shared memory one day soon too. If anything to lower the cost of PC because PC users are paying twice for volatile memory instead of once like every other system.
 
Last edited:
Another important question to ask is how much GPU power it takes away while doing it. If this causes stutters when loading on the fly in the middle of a map or something, it may not be a good thing to use in the GPU directly? Maybe?

Regards.
Or maybe not. The GPU has to wait for the CPU to tell it what to render anyway. If the CPU is too busy decompressing assets, you'd get the same stutters.
 
OK, let's get dirty....

The elephant in the room here is not compression, decompression, or even awful programming on the part of the clueless game developers who have never heard of speed optimizations or preloading in their lives... No. It's not.

The elephant in the room is Windows file access.

Windows doesn't just open a file and start transferring data. No. First it has to lock down the entire directory tree just to get to the file, wade through 18 layers of cached crap just because everyone in the OS development team thinks more cache is always better (hint: not!), and then, eventually, after copying the directory entry around a few times, it grabs a handle to the file. Oh no. Not done yet. Now it has to read the first block and run it through it's file-type identifier routines, yes, even though it really uses file extensions anyway... After determining that it might really just be a data file after all, it says hey, this is a jpg. I'll send it to the jpg processor and index it into the thumbnail database, because, why not do this every time.... and now, oopsies. Forgot to send it off to the virus scanner because we've only scanned this file a thousand times already, and who knows, it might have changed while no one was looking. Finally, done with virus scanning, oh crap... Wait. The searchindexer... Gotta scan the entire contents and send this thing off to the search indexer just in case the user wants to search their own computer (which has never in the history of Microsoft worked anyway!).... OK.. Identified, thumbnailed, cached, scanned, indexed, and sent off to the pre-processor for that particular filetype... Wait... we need an icon. Let's go look up the icon for that file... Phew... Maybe it's time to send some data to the game?

Turn off your searchindexer service, and disable virus scanning on your game data directory and test this yourself.
Windows file access is the problem. Windows file access is PAINFUL. Yeah, linux isn't too much better. Don't get smug.
Decompression of the file is trivial relative to the other silliness going on.

If you really want speed, then give me a raw partition. I'll slap a little UFS filesystem on it and be done.
Get the OS out of the way.
 
And of course what you really want is to never have that data in system memory at all, the real holy grail will be for the pci bus to be replaced with something that allows the GPU to load directly from an attached storage device without the cpu or system ram being involved at all.

Hmm not really, as others have posted there are tons of issues with that. Another thing to keep in mind is that the system memory needs to have a copy of the contents of the screen at all times in order for preemptive multi-tasking operating systems to work. Whenever you alt+tab, switch screens or have multiple windows moving around, that all is being moved around inside the system memory and needs to be copied in and out of the GPU.
 
the real holy grail will be for the pci bus to be replaced with something that allows the GPU to load directly from an attached storage device without the cpu or system ram being involved at all.
CXL is half of this - it will enable the data to get sent directly from the SSD to GPU memory.

However, we'll probably never see a case where the CPU isn't what's initiating the reads. That's because the CPU runs the filesystem driver, which is needed in order to know where on the SSD a given asset is located. There are also security implications and possibly race-conditions involved in having both the GPU and the CPU sending commands to the SSD.

CXL is rapidly chipping away at the use cases for NVLink.

Furthermore, we're seeing CXL-based SSDs coming onto the market, but I'm not aware of any NVLink SSDs ever made.
 
Last edited:
I'm still waiting for Radeon SSG like PCIe x4 interface mounted on the back of the Video Card to help shorten the traces between the Storage and Video Card.
Won't happen. SSD latencies are orders of magnitude too high for that to make any difference.

Radeon SSG was a specialty product aimed at a couple specific market verticals. NVMe already lessened the value proposition of doing something like that, and CXL erodes it further. Such a product is also complicated and expensive to support.

Also have Bi-Furication on PCIe 5.0 for PCIe x12 lanes & PCIe x4 lanes so that the GPU can get x12 Bandwidth while the SSD can get x4 lanes worth of Bandwidth.
The GPU would also need to support x12, which I'm guessing most/all don't. A better solution is just for Intel to upgrade the CPU's x4 connection to PCIe 5.0, like AMD did. Not that it's very consequential, but I mean if it were...