Hand optimization has always created faster executing code than relying on the compiler/assembler.
I once programmed a VLIW DSP that was practically impossible to code for in hand-written assembly language, due to all of the scheduling constraints, register bank hazards, etc. that you'd have to keep track of. Messing up could result in program bugs or even locking the CPU (IIRC). It also had 128 general-purpose registers, which would be a heck of a lot to try and allocate manually. In highly-unrolled loops, you did actually need that many.
Instead, what I did was to write C code with intrinsics. Then, I'd look at the assembly language generated by the compiler and see how close it looked to what I expected. While there was any gap, I'd revisit the C code and do some more tweaking to try and resolve whatever was tripping up the compiler. In the end, I think the code came extremely close to the theoretical performance limits of the CPU, just based on the density of actual arithmetic operations. Granted, these were DSP-type operations, like convolutions and other fairly tight loops.
This was about 25 years ago, and compilers were already that good. The main thing people are missing is how to express your intent in a way that's not overspecified or would otherwise trip up compiler optimizations. Here's where detailed knowledge of the language and a bit of compiler knowledge is even more valuable than detailed knowledge of a particular micorachitecture.
Also, like I mentioned, it's worth
really understanding how CPU caches work. Most programmers only have some kind of vague or idealized sense of how they
think caches should work, if that.
Given that the basics are no longer taught to prospective programmers anymore you are correct. However, I have yet to find any compiler/assembler that can rival an experienced assembly language programmer.
I knew a guy who worked in the compiler group at SGI, in the mid 1990's. He said they would treat it as a compiler bug, if someone could hand-code assembly language that was faster than the code generated by their optimizing compiler.
I also worked with a former DEC employee who said the hand-optimized MMX/SSE code I wrote at the time was doing transformations their optimizing compiler could do with ease.