This is very very true, and the reason I hate "benchmarking" with a single app. You can only parallelize so much before your required to radically change the architecture of your game. More cores means you can have more things going at once without two heavy apps being forced to share CPU resources. This becomes blatantly obvious when your start engineering systems using virtualization. Running 5~6 VM's on a platform with eight to twelve cores you quickly see the benefit of having so many independent processor resources.
BD's uarch was pretty solid in concept, most operations are Integer OPS / Logic Compares or memory operations, actual FPU OPS are pretty rare as a percentage. They stood to save a lot of die real estate by sharing one big FPU and coherent MMU's between cores while keeping separate ALU's. They screwed up in the execution of this idea, the cache has way to high a latency coupled with mispredicts that only emphasize that latency. If they could find a way to fix the latency issue and get better prediction rates and data preload rates, the we would see a significant (25~30% would be my WAG) jump in performance per cycle. After all, pretty much every ALU in the world is the same, there are only so many ways to add 0 and 1 together.
GPGPU for physics would be easy to implement once they got their algorithms worked out. Right now their being slow due to consoles not supporting it.
@Viridian,
A GPU will never be as good as a CPU at generic integer ops / logic compares, they'll always bee a few orders of magnitude behind. It has to due with VLIW operations in general, and no amount of secret magic sauce will fix it. Intel found this out the hard way when they tried to have the Itanium compete with Ultra SPARC and IBM Power. In a VLIW CPU you execute multiple instructions simultaneously. Now these instructions can not be co-dependent. If one of those instructions has an operand that is dependent on another, then it must wait till after the previous one is finished before it can be executed. With VLIW setups instructions are not evaluated for dependencies until their loaded for processing, with the dependent instructions having their results discarded and are run again. Basically you can't do branch prediction or Out of Order Execution very effectively with a VLIW uarch. What a VLIW uarch can do very well is process large amounts of non-dependent calculations, basically vector math. This perfectly fine for GPU's as their processing pixel elements and vertices, all require independent calculations with the final result being a rendered image, they typically render one pixel per "thread". A 1920 x 1080 resolution is 2,073,600 pixels per frame, at 60 fps you get 124,416,000 pixels that need to be rendered per second. This is without considering AA or shader's being applied. 124 million separate vector calculations per second would be nearly impossible on a generic processor but easy on a vector processor with 200~500 separate processing engines (aka cores). This is also why memory access seems so much more important to a GPU then a CPU. CPU's have ultra fast cache inside of them that typically holds all the data their going to need for that moment in time. GPU's massive amounts of independent operations requires that they process very large amounts of data, data that won't even remotely fit into cache.