I do not think we will every see software "optimized to take advantage of NVMe speed", just like we aren't seeing much of a hurry to make use of extra CPU cores/threads: for software to load things faster, the software's initialization code would need to get parallelized too to offload load-time processing to threads and load more stuff in parallel.
Much of the time though, what needs to be loaded has dependencies on stuff that is yet to be loaded and processed. If you take an OS as an example, the computer cannot load the OS kernel before it has gone through POST, loaded the MBR, booted into the MBR code which loads the OS bootloader, then the bootloader loads the OS kernel with critical drivers before handing control to the kernel, then the kernel picks up loading the remainder of drivers, boot-time tasks, GUI, etc. The boot order cannot be circumvented because each subsequent boot stage relies on previous stages completing their tasks. Software does much the same thing with libraries that have their own dependencies nested multiple libraries deep.
Many tasks are intrinsically sequential and will never occur in parallel. At least not without exceptional (read impractical) effort. Unless a task is exceptionally compute-intensive and takes an unacceptable amount of time to complete, it usually just isn't worth the development effort. That's why most games one main thread active ~100% of the time and a bunch of much lighter support threads - developers aren't bothering with delegating processing to threads unless the job is easily decoupled from the main code (ex.: rewriting a configuration or save file) or it is absolutely necessary for performance reasons.