I just don't see it replacing the FPU completely but i must say i'm pretty excited on how OpenCL improves performance and on things that don't demand Low latency Using the GPU resources is efficient.
Maybe your getting your instruction sets confused. When we say "FPU" we're usually referring to the SIMD execution unit inside modern CPUs. "FPU" instructions are 80-bit x87 and are very old. The FPU was originally a co-processor that was used to execute specialized math functions on values that had a floating decimal place. Doing such math in integer registers is cumbersome and slow, thus the 80-bit FPU could do those transactions much faster. It was expensive and only used in specific circumstances. With the advent of "gaming" the requirement to use floating point math become more popular, the 486's had their FPU's integrated into the CPU die instead of being a separate co-processor.
Today we have 64 and 128 bit Single Instruction Multiple Data (SIMD) units, SIMD was a way to do math on multiple data-sets using a single instruction. Adding A to B,C,D in sequence is comparatively slow when you can add A to B,C,D in a single go, no need to clear registers, do multiple memory reads or PUSH / POP on the stack. Early demand for multimedia services pushed the creation of what we now recognize as modern SIMD units (SIMD had been around a long time but was used for specialized computing tasks). MMX and 3D-NOW are both the initial implementations of SIMD, though they were separate from the x87 processing unit. Flash forward and SSE is faster then both MMX and x87. Execution units are now SIMD native and merely emulate the processing of x87 FPU instructions.
So when we say "FPU" we're really referring to SIMD instructions not legacy 80-bit x87 (though their still used). GCN can process SSE SIMD instructions along with general integer operations, no need for another ISA or special compiler support. GPU's started as raster accelerators but have since evolved into very power SIMD array processors. They've since eclipsed the combined x87FPU + SSE SIMD units that are inside our BD / Core CPUs. The next step would be to fuse those SIMD units into the CPU's directly and cut out the now useless FPU. In essence your iGPU becomes your FPU, no changing in instruction sets required. Where as before your program would issue a SSE2.1 instruction to the CPU, and the CPU would issue it to the SIMD FPU, now the CPU would issue it to the iGPU interface. The iGPU would then process it and send it back.
What does this mean?
Right now every "core" on a chip has it's own SIMD FPU unit. They will remove all of those and each core will utilize the iGPU's array. Remember GPU's are the equivalent of 12~30+ FPUs, so even with eight cores sharing one SIMD iGPU there is plenty of power to go around. It's also more efficient use of processing resources.
Honestly the "FPU" has already been replaced by SIMD units. We just refer to them both by the same word.
How is this different from OpenCL / GPCPU?
OpenCL / GPGPU are API's that allow a software program to off load code to a special co-processor, namely the Video card. You can use this to send packages to your dGPU or iGPU. There is latency and it's not seamless, your program must be written specifically for those languages. What I was referring to previous is using the iGPU's SIMD array's to process SSE / AVX / XOR / ect instructions rather then individual FPU SIMD units. Now that GPU's have progressed to the point where their just giant SIMD arrays, its become a waste of silicon to include an individual SIMD FPU into every core on a CMT die. The i5-2500K for example has four SIMD FPU units, Intel could remove those and instead have each core dispatch SIMD FPU instructions to the HD2/3/4/5K unit. BD / PD has 81xx has four large FPU units each compromised of two smaller FPU's, that is eight SIMD FPU's that could be removed and instead have the instructions sent to the SIMD array unit.