JQB45 :
I guess what I was getting at was 16-bit Assembly was easier then C... Its also easier then 32 and 64 bit Assembly in my opinion.
I have to disagree with this. Intel's use of segmentation in real-mode and 16-bit protected mode resulted in tons of bugs and unnecessary complexity. The ability to address a full address space in 32-bit protected mode and the elimination of segmentation entirely in 64-bit long mode greatly simplified the architecture.
C is very hard to master, but it is very easy to learn. With proper use of language constants, symbols, macros, and pragmas it's possible to render almost any program written in C completely platform portable. In theory, a single well-written source tree can be used to build an identically functioning program across any number of architectures, any number of operating systems, and any number of ABIs provided that there's compiler support for each as required.
Assembly is platform specific by its very nature. It's great to familiarize oneself with the intricacies of the architecture and OS ABI but there are very few cases where assembly is required as a matter of necessity.
The only case that I can think of where assembly is strictly necessary is when changing processing modes, when performing very low level system maintenance for which no compiler support is possible (such as complex hardware interrupt handling), or when performing real-time tasks where specific instructions must be issued in a specific order. Unlike many application-level instructions such as vector arithmetic, these are not easily performed via compiler intrinsics. For example, it's not possible to inline a switch from real mode to protected mode in the middle of a C function as this will almost certainly screw up the compiler's view of the address space as well as muck with the code generator.
There are many cases where assembly is not strictly necessary, but is desirable as a matter of optimization or where a compiler deficiency results in undesirable behavior. The correct approach in these situations is to address the defect or deficiency in the compiler if possible.
For example, Intel's autovectorization routines in ICC are much more effective than GNU's autovectorization routines in GCC. The use of compiler intrinsics which explicitly invoke certain microprocessor behaviour (these intrinsics are usually just wrappers for the appropriate assembly instructions) can sometimes be used to resolve these issues, but these intrinsics are usually not standardized and will break platform portability for that particular codepath. ICC can generate x87/MMX, SSE, and AVX codepaths from the same fully portable source code and then select the most appropriate codepath at load time based on the microprocessor's actual capabilities.
If it's possible to hand-optimize a program or routine at the assembly level to a degree greater than that achievable through a compiler, then it's also possible to generalize and parameterize that optimization such that the compiler can perform that optimization automatically from portable code.