It seems to me the issue is not so much decoders, but rather execution units.
For example the old, belated, now near dead netburst structure could, at least in theory, decode/issue 6 instructions per clock cycle, but the bottle neck was that it could only retire two instructions per cycle.
By contrast, the K8 has 3 complex execution units, while Conroe has 3 complex + 1 simple execution units, allow more instructioins to be actually completed per clock.
The huge advantage Conroe has is that all three of it's SSE units are 128 bits wide versus 2 x 64 bits for the K8 - this is where the bulk of the performance gap seesm to be.
It should be, of course, noted that the number of instructions issued/retired is not quite comparable.
Both AMD and Intel take x86 instructions and break them down into simpler instructions internally. (Intel calls these mico-ops, AMD calls the risc86) - Due to micro-op fusion the Core cpus tend to do more work per instruction, while the K8 had a pretty good advantage on Netburst.