Caching wastes memory, if you're doing reads with no reuse. Cacheless reads can potentially save overhead in the kernel, because it doesn't continually need to find entries in the block cache to evict, although I'm not sure how much overhead that actually causes.
I think it depends a lot on the algorithm. If you are limited by the overhead in the kernel, then it means you are limited by the bandwidth to the networked data, and not by the compute your CPUs and GPUs can provide.
I would personally consider such situation unsatisfactory and would try to change the algorithm so it becomes compute, or, at least, memory bandwidth limited.
The way DeepSeek and others are running they need to stream a large amount of data through compute and the filesystem optimizes this streaming part. Even then they have to cache some data - otherwise it would vanish before you can use it. BeeGFS does have a cache but it is limited and it does its own cache eviction thus avoiding that kernel overhead.
For general applications, however, the cacheless or small-cache nature is a problem, because those general applications usually have some measure of locality, hitting the same data repeatedly.
Obviously, a memory mapped database such MariaDB or parquet or RMVL would want to be cached or you hit a bottleneck.
But even for LLM applications you want caching - llama.cpp memory maps the weight tensors and have provisions to improve locality. This way if you have 600+ GB Deepseek on your SSD you can still run it with useable speed on less than half a terabyte of RAM. This would not work if the tensors were served using 3FS or BeeGFS.
Well, GPUs are limited in persistent or network data access by PCIe speeds, which currently top out at 64 GB/s for a x16 connection. That said, Nvidia has been doing a lot with NVLink and Infiniband (as well as UltraEthernet?), so it's possible that's the avenue over which the data arrives.
I was thinking more in terms of systems with unified memory. Ryzen AI MAX+ is an example of such in consumer space, Xeon Max and newer Xeon 9xxx have also much larger memory bandwidth, same goes for newer Epycs. The fact that NVidia GPUs have to suck data through a straw is their big weakness, one they tried to fix through dedicated server architecture.
I think that's not in the cards. NFS is very much about centralized storage and point-to-point reads & writes. For its maintainers and key users, I'd expect simplicity and reliability are far more important. Scalable, parallel, network-based filesystems have been around for a long time, yet NFS hasn't really veered into such territory.
You are right, it does depends on the preferences of the maintainers. So either they decide to add features to let NFS scale up similar to network filesystems, or some of those network filesystems will add features (like proper caching) to compete with NFS. Not sure what is easier.