To be honest, I actually do think this could very well be the next step in CPU evolution. It DOES seem logical, as we've seen serious evolution in "build-up" of units throughout the history of the CPU:
■1980: shortly after the release of the 8086 and 8088, Intel produces the 8087 FPU, a co-processor, giving the computer native floating-point capability. It was a separately installed chip; it could be seen as somewhat analogous to the modern-day video card's add-on capabilities.
■1989: With the introduction of the 80486, the FPU is integrated into the main core, rather than being a separate chip.
■1993: Intel introduces the Pentium, the first CISC super-scalar CPU for home use, giving it a pair of ALUs to work with.
■1996: Intel intoduces the Pentium MMX, which includes a special SIMD capability, allowing it to process two integers with the same instruction at once.
■1998: AMD introduces, with the K6-2, their "3DNow!" extension, which follows up on MMX, allowing it to process two floating-point numbers with the same operation, resulting in an immediate doubling its "FLOPs" rating.
■1999: Intel introduces the Pentium III, bringing forth the "SSE" set, which goes even farther, giving the CPU full-fledged vector capability, of up to FOUR floating-point numbers in one pass.
■2005: AMD introduces their "Toledo" Athlon64 X2, the first home-use dual-core desktop CPU.
■2006: Intel introduces their "Kentsfield" Core2Quad, the first home-use quad-core desktop CPU.
■2010: Intel introduces their "Gulftown" Core i7 CPU, the first home-use hexa-core desktop CPU.
■2010/1: AMD introduces their "Llano" Fusion CPU, coupling traditional CPU cores with a large number of smaller, very-high-speed processing units, yielding the first desktop example of a "hybrid" design between traditional multi-core CPUs and super-wide, vector-based GPUs.
So, over time, we've witnessed as a CPU started as a simple, basic ALU, to having an FPU on a separate chip, to integrating that chip into itself, to getting a SECOND ALU, then to gaining SIMD ability on first the ALUs, then on the FPU, and on to multi-core designs. Now, finally, we're stacking large numbers of extra stream processors alongside traditional cores.
While this progress is thrilling, it also poses challenges, of course; as with any CPU design, there's the question of how to adequately keep it busy; extra care needs to be taken to the design so that the control and branching units will be able to properly distribute the load, and be able to keep all of the stream processors equally busy.
[citation][nom]future_fusion_owner[/nom]therefore a 70 gigaflop Core i7 might be competing with a 1 teraflop AMD processor.[/citation]
This is pretty much exactly my point, with my comment about "Cell on crack." (though a slight nitpick: IIRC, the Core i7 920 starts at ~128 gigaflops, with the i7 980X reaching 230) The Cell's SPEs, which are entirely responsible for its (previously) impressive floating-point throughput numbers, are not very different at all from the stream processors on a modern GPU.
Hence, it wouldn't take all that much to allow the CPU to issue instructions to the cores; it'd be going beyond GPGPU, and simply make it part of the CPU itself. And given that, as far as I understand, the clock rate limit in GPUs lies with the texturing units and not the SPs, they could be clocked MUCH higher when part of a CPU, hence yielding a multi-teraflop CPU, when other CPU designs (i7 hexacore, Cell) are barely scraping around the 200-gigaflop mark.
Of course, it IS possible for others to still compete... External units are quite usable. The x87 FPUs were separate chips from their parent x86 CPUs, which issued instructions to them.