[citation][nom]oparadoxical_[/nom]Makes me wonder just what we will have in ten years from now... Especially for personal computers.[/citation]
GPUs like the 6970 have around 2500 vector cores. Like FPUs in the OP, they can't do the full spectrum of x86 instructions and can only do a specialized subset for one task.
Likewise, we have growing numbers of do everything cores on a die.
One important abstraction is that "cores" are just an FPU, SPU, TLB, etc, all on a die. A 4 core chip is basically 4 processors on one piece of silicon with one bus. A GPU is 2500 VPUs with shared memory, shared FPUs, and a shared bus and output.
The end game is that we have processor chips with specialized parts doing different, specialized tasks, all on one die. Like how Sandy Bridge had integrated graphics, that is just fancy abstraction for throwing a bunch of VPUs on the die that the CPU cores can access with their own bit of l3.
In a decade, expect processor chips to have much more cache, and a collection of VPUs / SPUs / etc on top of some registers and TLBs representing the limits of parallelism.
You merge the cores, and get processors of, say, 256 cores, where 64 of them are general purpose TLBs / Register sets and 192 are mixed FPU / VPUs doing hard computations for the general cores. If you add some floats, that work would be sent to an FPU to do, if you had a munch of float math in parallel, the process would have each operation delegated to an FPU.
Thats in mainstream computing, I think. Server markets are going towards specialized subset instruction set hardware that can't do normal computing tasks, but doesn't need to and actually shouldn't to save power. Every instruction you throw on the cpu pile means more transistors dedicated to operation decoding you could be using for more FPUs and such.