News DeepSeek brings disruption to parallel file systems, releases powerful new open-source Fire-Flyer Fire System

I'd be pretty surprised if OpenAI, Google, and others hadn't also done a lot of optimization in this area. I wonder what they're using. I'm also not sure Ceph is the best point of comparison, but distributed filesystems are an area I really know very little about.

I am reminded of the substantial kernel optimizations Facebook/Meta contributed on storage I/O. At the time, I didn't connect this to AI, but perhaps that was among the driving factors:

In particular, the RWF_UNCACHED optimization seems relevant, here. I would also point out that the above optimizations are orthogonal to what you do at the distributed filesystem layer and it's likely DeepSeek took advantage of at least some of these.

Kudos to DeepSeek for releasing and publicizing their solution. Let's hope we see more of this.
 
  • Like
Reactions: TCA_ChinChin
Indeed, there are other similar filesystems like BeeGFS


I think the remarks in the article that the filesystem will be popular with many users are a little overoptimistic. The cacheless nature only works when you have very fast network and you are still stuck with latency penalty compared to caching data in RAM.

The aggregate 7.7 TB/s speed is very nice for a cluster, but newer CPUs and existing GPUs can easily achieve TB/s memory speed access. Thus, for a cluster, some caching has to happen somewhere or the filesystem will be overwhelmed. In case of DeepSeek 3FS works well because each piece of data needs a lot of operations to process, but this is not so for every workload.

What would be interesting is to see the improvements in 3FS and BeeGFS incorporated into regular NFS, so you can have the best of both worlds - caching when you are reusing the data and fast random access.
 
Indeed, there are other similar filesystems, like BeeGFS

Thanks!

I think the remarks in the article that the filesystem will be popular with many users are a little overoptimistic. The cache less nature only works when you have very fast network and you are still stuck with latency penalty compared to caching data in RAM.
Caching wastes memory, if you're doing reads with no reuse. Cacheless reads can potentially save overhead in the kernel, because it doesn't continually need to find entries in the block cache to evict, although I'm not sure how much overhead that actually causes.

The aggregate 7.7 TB/s speed is very nice for a cluster, but newer CPUs and existing GPUs can easily achieve TB/s memory speed access.
Well, GPUs are limited in persistent or network data access by PCIe speeds, which currently top out at 64 GB/s for a x16 connection. That said, Nvidia has been doing a lot with NVLink and Infiniband (as well as UltraEthernet?), so it's possible that's the avenue over which the data arrives.

xXtravCHY7hgt8icPrmBNL.png


What would be interesting is to see the improvements in 3FS and BeeGFS incorporated into regular NFS, so you can have the best of both worlds - caching when you are reusing the data and fast random access.
I think that's not in the cards. NFS is very much about centralized storage and point-to-point reads & writes. For its maintainers and key users, I'd expect simplicity and reliability are far more important. Scalable, parallel, network-based filesystems have been around for a long time, yet NFS hasn't really veered into such territory.
 
  • Like
Reactions: jp7189 and qxp
Caching wastes memory, if you're doing reads with no reuse. Cacheless reads can potentially save overhead in the kernel, because it doesn't continually need to find entries in the block cache to evict, although I'm not sure how much overhead that actually causes.
I think it depends a lot on the algorithm. If you are limited by the overhead in the kernel, then it means you are limited by the bandwidth to the networked data, and not by the compute your CPUs and GPUs can provide.

I would personally consider such situation unsatisfactory and would try to change the algorithm so it becomes compute, or, at least, memory bandwidth limited.

The way DeepSeek and others are running they need to stream a large amount of data through compute and the filesystem optimizes this streaming part. Even then they have to cache some data - otherwise it would vanish before you can use it. BeeGFS does have a cache but it is limited and it does its own cache eviction thus avoiding that kernel overhead.

For general applications, however, the cacheless or small-cache nature is a problem, because those general applications usually have some measure of locality, hitting the same data repeatedly.

Obviously, a memory mapped database such MariaDB or parquet or RMVL would want to be cached or you hit a bottleneck.

But even for LLM applications you want caching - llama.cpp memory maps the weight tensors and have provisions to improve locality. This way if you have 600+ GB Deepseek on your SSD you can still run it with useable speed on less than half a terabyte of RAM. This would not work if the tensors were served using 3FS or BeeGFS.

Well, GPUs are limited in persistent or network data access by PCIe speeds, which currently top out at 64 GB/s for a x16 connection. That said, Nvidia has been doing a lot with NVLink and Infiniband (as well as UltraEthernet?), so it's possible that's the avenue over which the data arrives.
I was thinking more in terms of systems with unified memory. Ryzen AI MAX+ is an example of such in consumer space, Xeon Max and newer Xeon 9xxx have also much larger memory bandwidth, same goes for newer Epycs. The fact that NVidia GPUs have to suck data through a straw is their big weakness, one they tried to fix through dedicated server architecture.
I think that's not in the cards. NFS is very much about centralized storage and point-to-point reads & writes. For its maintainers and key users, I'd expect simplicity and reliability are far more important. Scalable, parallel, network-based filesystems have been around for a long time, yet NFS hasn't really veered into such territory.
You are right, it does depends on the preferences of the maintainers. So either they decide to add features to let NFS scale up similar to network filesystems, or some of those network filesystems will add features (like proper caching) to compete with NFS. Not sure what is easier.
 
“In Fire-Flyer 2, DeepSeek utilized 180 storage nodes, each loaded with 16 16TB SSDs and two 200Gbps NUCs.”

Just to be sure, should “NUC” be “NIC”? If not, I’d love to understand the setup better.
 
  • Like
Reactions: bit_user and qxp
Caching allows readahead.
You ask for and receive the first 4k, but the transfer was 64k, and the track (remember tracks?) was 2mb, and the cylinder even moreso. Of course this matters far less with SSDs, I don't even know what the current structures look like, even the flash chips/modules/controllers may have the cache, etc.
 
  • Like
Reactions: qxp
Caching allows readahead.
You ask for and receive the first 4k, but the transfer was 64k,
An optimized app doesn't need the kernel to do readahead for it. You can request data in large enough chunks to be efficient and maintain a queue of multiple outstanding reads via io_uring.

Of course this matters far less with SSDs, I don't even know what the current structures look like, even the flash chips/modules/controllers may have the cache, etc.
Modern SSDs that include DRAM seem to have about 1 GB per couple TB, I think. They can also buffer directly in host memory. For the most part, this memory seems to be used for optimizing writes and caching FTL (Flash Translation Layer) structures, but I can't say the drives aren't doing any read ahead. If we found a good benchmark comparing QD=1 sequential 4k vs. QD=1 random 4k reads, it'd be pretty clear if the drive were doing it (I'd expect so).

Even so, modern kernels still default to some modest amount of read-ahead (128 kB, IIRC), when doing normal I/O (i.e. not O_DIRECT).
 
  • Like
Reactions: qxp
Pure random reads result in 100% cache misses which means read cache is pure overhead and waste.
Actually no, you have to work to make that happen. For example, suppose your data is 10TB, while your memory is only 1TB. Then there is a 10% chance for a random read to hit data already in memory and you have a 10% speedup - nothing to sneer at. For many applications the chance you will want recently used data is higher and you get a higher speedup.

To make cache useless you need to engineer your code to always request data that is not in memory and in a random order so that readahead is useless too.
 
Actually no, you have to work to make that happen. For example, suppose your data is 10TB, while your memory is only 1TB. Then there is a 10% chance for a random read to hit data already in memory and you have a 10% speedup - nothing to sneer at. For many applications the chance you will want recently used data is higher and you get a higher speedup.

To make cache useless you need to engineer your code to always request data that is not in memory and in a random order so that readahead is useless too.
You're thinking is both somehow too complex and simplistic all at the same time.

In this context and per the article, the application is engineered to be random. Therefore, darn close to 100% cache misses. It's no more complicated than that.

To your other point, you'd have to populate your theoretical 1TB of cache ahead of time, and that act would be insanely expensive (wasted hardware, energy, time) to have effectively (in your example) a 90% cache miss rate. Also, cache isn't infinitely faster than SSD, so 10% hit rate doesn't equal 10% performance improvement.

In the real world, cache gets populated by data that was previously pulled from SSD. In the training of LLMs, you pull data <generally> no more than once per epoch. By the time you get around to needing that data again, the cache would long since been evicted. So, in this respect, there is 0 benefit from cache. However, all that cache management isn't free. There are upfront system costs, energy, and wasted processing cycles. Therefore, the more cache, the more wasteful the system (in the context of the system this article is describing).
 
  • Like
Reactions: bit_user
You're thinking is both somehow too complex and simplistic all at the same time.

In this context and per the article, the application is engineered to be random. Therefore, darn close to 100% cache misses. It's no more complicated than that.
But that's a wrong statement. You are basically saying that if you are tossing darts blindfolded in random directions you will always avoid the target board. You got to be able to see the board to avoid it!
To your other point, you'd have to populate your theoretical 1TB of cache ahead of time, and that act would be insanely expensive (wasted hardware, energy, time) to have effectively (in your example) a 90% cache miss rate. Also, cache isn't infinitely faster than SSD, so 10% hit rate doesn't equal 10% performance improvement.
Agreed on the fraction. I rounded numbers for simplicity. On the kind of hardware they are using the incoming data is at best ~20 GB/s, while the memory is at least 200 GB/s, likely much more. So it would be 9% or better.

With regard to cache, on Linux to use the data you have to populate the cache with it. For example, suppose you have 10 TB memory mapped file and you access a page in it. What will happen is that data will be read into memory, you will get a page with data to use and the very same page becomes part of the cache, so if you or another application has use for it it does not need to be retrieved again. Thus that 1TB cache will get populated automatically as application starts up and loads data, if you have 1TB or more RAM.

If you don't memory map, but just use read calls the situation is similar, but the data will be copied from the page into the address space of your program and you will be using twice the amount of memory you would with memory map.

What these "cacheless" filesystems do is evict those pages on a shorter timescale that Linux will do it itself.
At best they are trying to create a hierarchy where some pages like executable code or other data are managed in the usual way with long cache lifetime and some pages have shorter cache lifetime.

Lastly, Linux actually has a way to tell kernel that you won't need some data anytime soon with madvise() call, so if you are sure you don't need the data you can evict it. Does not work with hugepages though.

In the real world, cache gets populated by data that was previously pulled from SSD. In the training of LLMs, you pull data <generally> no more than once per epoch. By the time you get around to needing that data again, the cache would long since been evicted. So, in this respect, there is 0 benefit from cache. However, all that cache management isn't free. There are upfront system costs, energy, and wasted processing cycles. Therefore, the more cache, the more wasteful the system (in the context of the system this article is describing).
My original comment was that cacheless nature is not the best for general applications. But even for the training of LLMs it might be simply be that with the amount of hardware they have there is no need for them to optimize cache usage - the existing speed is enough to get the job done. Which would make sense, because if they were limited by disk I/O, I'd expect them to keep optimizing until I/O is not a bottleneck.
 
But that's a wrong statement. You are basically saying that if you are tossing darts blindfolded in random directions you will always avoid the target board. You got to be able to see the board to avoid it!
The key point is the ratio of training data vs. memory. Also, server memory burns quite some power. If the cache hit rate is relatively low, then you might just be better off with less memory.

Agreed on the fraction. I rounded numbers for simplicity. On the kind of hardware they are using the incoming data is at best ~20 GB/s, while the memory is at least 200 GB/s, likely much more. So it would be 9% or better.
As mentioned before, GPUs don't have local access to that much bandwidth from main memory. Nvidia GPUs, themselves have far too little memory to waste any of it on caching data with a low hit-rate (we now know DeepSeek did in fact used $Billions worth of Nvidia GPUs). Therefore, the difference to them between a cache hit vs. miss is much smaller than you suggest.

If you don't memory map, but just use read calls the situation is similar, but the data will be copied from the page into the address space of your program and you will be using twice the amount of memory you would with memory map.
You speak as if caching is free, but it's not. The clearest example of this was in the development of the RWF_UNCACHED flag I mentioned in post # 2:

"Uncached buffered IO is back, after a 5 year hiatus. Simpler and cleaner now. Up to 65-75% improvement, at half the CPU usage on my system. And none of the nonsense of the unpredictability of the page cache.

You can think of uncached buffered IO as being the much more attractive cousing of O_DIRECT - it has none of the restrictions of O_DIRECT. Yes, it will copy the data, but unlike regular buffered IO, it doesn't run into the unpredictability of the page cache in terms of reclaim. As an example, on a test box with 32 drives, reading them with buffered IO looks as follows:

<data omitted>

where it's quite easy to see where the page cache filled up, and performance went from good to erratic, and finally settles at a much lower rate.
...
If the same test case is run with RWF_UNCACHED set for the buffered read, the output looks as follows:

<more data omitted>

which is just chugging along at ~155GB/sec of read performance.
...
where just the test app is using CPU, no reclaim is taking place outside of the main thread. Not only is performance 65% better, it's also using half the CPU to do it."

https://www.phoronix.com/news/Linux-RWF_UNCACHED-2024

He followed up with more specifics:

"Performance results are in patch 8 for reads and patch 10 for writes, with the tldr being that I see about a 65% improvement in performance for both, with fully predictable IO times. CPU reduction is substantial as well, with no kswapd activity at all for reclaim when using uncached IO. "

https://www.phoronix.com/news/Uncached-Buffered-IO-2024

I think that's in a bandwidth-intensive server application, of some sort, that's doing streaming reads/writes with no reuse (i.e. 0% cache hit rate). I'd expect low-reuse also to benefit from the same approach. It'd be interesting to know where the crossover point is between this and regular cached I/O, although I'm sure it has something to do with the hardware setup.
 
The key point is the ratio of training data vs. memory. Also, server memory burns quite some power. If the cache hit rate is relatively low, then you might just be better off with less memory.
And memory costs money so you can save some and buy more compute if you have less RAM. That said, the cost of RAM is very small compared to H100 that DeepSeek was using.

As mentioned before, GPUs don't have local access to that much bandwidth from main memory. Nvidia GPUs, themselves have far too little memory to waste any of it on caching data with a low hit-rate (we now know DeepSeek did in fact used $Billions worth of Nvidia GPUs). Therefore, the difference to them between a cache hit vs. miss is much smaller than you suggest.
True, but NVidia also has a mechanism to read directly from storage and you don't need to bounce through main memory at all. The bounce is only useful if GPUs reuse the data, which means you are caching some of it in main memory.
You speak as if caching is free, but it's not. The clearest example of this was in the development of the RWF_UNCACHED flag I mentioned in post # 2:
"Uncached buffered IO is back, after a 5 year hiatus. Simpler and cleaner now. Up to 65-75% improvement, at half the CPU usage on my system. And none of the nonsense of the unpredictability of the page cache.​
You can think of uncached buffered IO as being the much more attractive cousing of O_DIRECT - it has none of the restrictions of O_DIRECT. Yes, it will copy the data, but unlike regular buffered IO, it doesn't run into the unpredictability of the page cache in terms of reclaim. As an example, on a test box with 32 drives, reading them with buffered IO looks as follows:​
This is a useful feature and I will keep an eye on it as it makes it into distribution kernels and clusters. But it is very tied to specific kernel version, and I generally prefer to write code that works well with plain POSIX. With a little cleverness one can often figure out a way to not need it.
I think that's in a bandwidth-intensive server application, of some sort, that's doing streaming reads/writes with no reuse (i.e. 0% cache hit rate). I'd expect low-reuse also to benefit from the same approach. It'd be interesting to know where the crossover point is between this and regular cached I/O, although I'm sure it has something to do with the hardware setup.
Agreed. This also supports my observation that such cacheless filesystems can be very useful, but only for rather narrow set of applications.
 
  • Like
Reactions: bit_user
I generally prefer to write code that works well with plain POSIX. With a little cleverness one can often figure out a way to not need it.
I generally agree with the idea of writing standards-based portable code, but POSIX AIO is a disaster. For high-performance apps with extreme I/O demand, I think it's not feasible just to rely on POSIX. Meanwhile, the cost of optimizing specifically for whatever small diversity of systems one must support should be easily justifiable.

POSIX APIs really need updating. I'm not sure they've seen any major revisions in more than 25 years!
 
  • Like
Reactions: qxp
I generally agree with the idea of writing standards-based portable code, but POSIX AIO is a disaster. For high-performance apps with extreme I/O demand, I think it's not feasible just to rely on POSIX. Meanwhile, the cost of optimizing specifically for whatever small diversity of systems one must support should be easily justifiable.

POSIX APIs really need updating. I'm not sure they've seen any major revisions in more than 25 years!
Well, let's say POSIX+, i.e. whatever is commonly available on Debian 2 releases back and is not about to be obsolete :)

I think the key part is whether you need to run on different clusters, or you have a specific one that you optimize for. In the latter case, you could optimize the storage and the app together. For example, if I needed optimization along the same line that DeepSeek did and there was no existing network filesystem available, I might have tried writing a storage server that streams directly to GPUs. Though Linux drivers are easy and fun to program so it could have been a networked filesystem too, or a networked block device.
 
On the kind of hardware they are using the incoming data is at best ~20 GB/s, while the memory is at least 200 GB/s, likely much more. So it would be 9% or better.
Hhmmmm... I get the feeling I'm talking about caching on the file server where the data is stored and you're talking about caching on the application server where the data is used. That might account for our difference of opinions on the matter.. does that sound about right?
 
  • Like
Reactions: qxp
Hhmmmm... I get the feeling I'm talking about caching on the file server where the data is stored and you're talking about caching on the application server where the data is used. That might account for our difference of opinions on the matter.. does that sound about right?
Makes sense! The cacheless nature of 3FS and BeeGFS is that on the nodes that mount the filesystems (not on the servers that host the data) the filesystems try to keep very little cached data. I think BeeGFS keeps a few GB maybe less maybe more depending on how it is configured.

This is optimized for the situation where an application such as LLM training needs to stream a lot of data over and over again.

However, if you try to use BeeGFS as a general purpose filesystem it will work on a surface level but you will run into slowdowns due to increased I/O any time your programs need to access files several GB in size - they will just keep rereading it over and over. I think the same is true for 3FS.
 
[***Long sentence warning***]
With DeepSeek unable (or unwilling) to elaborate on the Uyghurs, offer a detailed description of events that occurred on 15 Apr 1989, or even give us one political blunder that had been done by the General Secretary of the CCP; and in addition, is known for easy exploit of programming and training issues which permit some end users to get a detailed description on how to make terror weapons, it's easy to see why there is plenty "anti-China fear" to go around. So no, blockbusters are out of the question. Blockbusters are for the movie stars (Sam Altman, Mark Zuckerberg, etc...); they aren't for troubled communist bloc nations.
 
“In Fire-Flyer 2, DeepSeek utilized 180 storage nodes, each loaded with 16 16TB SSDs and two 200Gbps NUCs.”

Just to be sure, should “NUC” be “NIC”? If not, I’d love to understand the setup better.
NIC: Network Interface Card?
We already know what a NUC is... It's the Navy Unit Commendation award ... 🤣
 
  • Like
Reactions: bit_user