The problem is one of maintaining the program model. This is why we can't have processors with 16 execution units (well we could, but it'd just be a waste) and get 16 IPC. Programs aren't just a collection of instructions that can be executed any time, they have to be executed in a certain order. There are times when this doesn't apply, but such parallelism (i.e. what can be done in parallel instead of one after the other) is very limited in modern code (an average of 2 instructions at a time if you're lucky).
When you throw in multiple cores, it's even more restrictive. Think about it. On a single-processor, if one instruction depends on another to finish, it has to wait until that instruction finishes and writes data back into the register (or forwarding would work as well). On dual-core, if we split those 2 instructions up, one per core, and one depends on the other, we have to wait until the instruction that finished on one core writes to *cache* and the other one reads from cache. That is a *huge* delay in terms of processor wait especially on 30+ pipelined processors like Prescott.
The problem of taking a single Von Neumann model machine (single thread, single memory, single processor, etc.) and automatically splitting it up into semi-autonomous, parallel executing sequences is a huge problem and the main topic of research in parallel computing and grid computing.
That is not to say Intel isn't working on it. ICC 8 already includes an auto-parallelization function that attempts to split single-threaded code into multithreaded applications. How well it does it, however, is another thing. Compilers are only so smart....
"We are Microsoft, resistance is futile." - Bill Gates, 2015.