@Maxxify, I appreciate your shedding some light on the subject. I get now where the term "folding" comes from. Some more questions if you don't mind:
. How do DRAM and DRAM-less compare in performance? Do DRAM NVMe's generally draw more power, and thus generate more heat?
. Does DRAM act as a tier 1 cache before data overflows into a SLC cache (if any) as a tier 2, and to pSLC as a tier 3?
. How has SSD performance improved over succeeding generations, and how has that improvement filter down from premium SSDs to mainstream to value offerings? (This last question is a bit expansive, so you can just summarize the highlights.)
The term "folding" comes from SanDisk and later their "nCache" technology. You can find articles on this from 10+ years ago, I think. In any case the diagram/graphic they used to explain it at the time was basically showing a DMA-like (direct memory access) operation where 3 blocks of SLC/pSLC compacted into 1 block of TLC. It's more complex than this today as there are different ways to merge blocks but essentially that is the idea. One advantage that was noted is that this can be done on-die without controller interaction, which means you don't have the overhead that killed host (incoming) I/O. You're still limited by simultaneous die operations, though.
DRAM-less drives are more often 4-channel so pull less power as a result, but if you're comparing like-for-like (and this did happen more in the past, e.g. SM2263 v SM2263XT) then DRAM-less technically pulls less power as it does away with external DRAM (which pulls some power) and reliance on a DRAM memory controller for it. However, performance could be worse in some cases, which could make it less efficient in some workloads/scenarios.
DRAM can but usually does not act as a write cache (or if it does, not in the way a HDD's DRAM cache does) but rather as a metadata cache for mapping, wear management, etc. FYI I explain this on my subreddit with my SSD Basics, which although outdated covers some of this. SSDs do use a volatile write cache but you don't need a lot of memory for that when you're accessing at a superpage level (e.g. 16KiB x 4 planes/die x 4 dies/channel x 4 channels). It makes more sense to take advantage of DRAM's latency for logical page (4KiB) mapping and other things. There can be multi-tier non-volatile caching though. Static SLC -> Dynamic SLC -> native is very common (e.g. static is FIFO, since it has different wear than dynamic) and it's possible to do pSLC -> pMLC/pTLC -> TLC/QLC and other things, but not at all common.
SSD performance has improved at the controller level and at the flash level (and for DRAM too, but mostly power efficiency). Controllers are more efficient, have way higher IOPS (and queues), better error correction (necessary for denser flash over time), more intelligent algorithms, etc. Flash has also improved a lot but people often say it hasn't. They cite that 4KiB random still feels the same, but in reality at the flash level there's been significant improvements in latency as well as power efficiency, throughput, etc. Today's DRAM-less NVMe SSDs are insanely fast and efficient as a result.
Not to advertise but I only post here from time to time (and mostly just Memory/Storage forum), you can find resources at my subreddit incl Discord. Not to derail the thread: the SN5000 is a good example of the above, since the QLC on the 4TB has specifications that would've been pretty good with Gen3 TLC drives. So people saying "flash hasn't improved" should surely take no issue with this drive, but then they want those juicy sustained graphs. It looks otherwise to be an SN770 which was a great drive borne from WD's experience with the hardware.