You still haven't explained how you confused 8086 Integer / Logic instructions with SIMD vector instructions.
And your still demonstrating lack of knowledge in that the SIMD units don't process the kind of code that doesn't lend itself well to massive parallelism. The whole point of creating SIMD (vs MIMD) ISA's was to process thousands of instructions simultaneously. That is exactly how GPU's work, tens of thousands of operations.
Ex, a 1920x1080 screen is 2073600 pixels. Each pixel is compromised of three 8-bit color values and one alpha value for a total of 32-bits per pixel. To perform an operation such as making the screen brighter, the GPU must add a value to the three 8-bit color values. That ends up being 6,220,800 integer operations.
And even the lowest GPUs can perform this with ease, along with rendering everything else on the screen.
So tell us again, how do SIMD units not do handle parallel operations well?
What I find truly funny, is your ranting on integer / logic code. Add / Sub and CMP / JMP instructions, the types that tend to have large sets of serial dependencies. Yet you were attacking AMD's idea to implant a GPU's SIMD units, something that doesn't even process Add / Sub / CMP / JMP.
To implant this GPU they need to first decouple the SIMD unit from the rest of the processing core. And in doing so they decided that it would save die space to add another Integer Unit, after all their really small and do most of the heavy lifting for the CPU. While doing this they decided to use a shared L2 cache, and this is where I think they went wrong. Shared caches have higher latencies then dedicated caches due to coherency checks and locks on segments.
Anyhow the ultimate idea is to have four to sixteen integer units coupled with a GPU's SIMD array. Instead of two 128-bit Fused FPU's, you have 24+ 128-bit Fused FPUs. These fused FPU's can be used for regular SIMD instructions (AVX / FMAC / SSE / XOR) or they can be used for rendering textures and processing video data, or any combination thereof.
Actually ... I've already answered all your questions in various posts. And the last one it the funniest. Assuming BD was finished in early 2011 and heading to production. Then they've had 10~12 additional months to continue to work on it. With PD going to final silicon soon, that will signal when this phase of their design cycle is finished. I
said it wasn't implausible for them to have worked out the cache latency issues and branch prediction issues during that time. Whether they have or have not is something no body here actually knows and we'll have to wait to see. My prediction is some small improvements but nothing major.
Your comparison of BD to a P4 is very telling of your knowledge of uArch and ISA. Back foul field, buzzwords will avail you not.