Inside Flex FP
The single floating-point unit of the Bulldozer has come under a bit of fire, as an eight-core Bulldozer would have only four physical FPUs. Instead of a dedicated 128-bit FPU per core as in current Phenom II designs, the Bulldozer architecture will feature a single 256-bit FPU shared by two integer cores.
The reason for this is simple - adding a second integer core to a Bulldozer module increases CPU real estate by only 12 percent. This is similar to the “shared resources” strategy Intel employs on a per-core basis with Hyper-Threading, but Flex FP is doing it at the component level and with a larger resource base.
AMD's stance is that as most programs have significantly more integer code than floating-point, a single integer core does not require its own dedicated 256-bit FPU. By adding a second integer unit, and sharing the same FPU, AMD can target the Bulldozer directly at the most common instructions.
AMD's Flex FP also includes some additional enhancements designed to improve performance and keep the data pipelines flowing. Bulldozer has dedicated schedulers for both integer and floating-point commands, rather than using a single scheduler for both units like Intel does on their Core-based processors.
By designing a separate FPU scheduler, each of the floating-point processes can be handled independently. This can not only speed up floating-point operations and keep the FPU path filled up, but it also drops the scheduling load from the integer processor. The only caveat is that there is a scheduler for each physical unit, two for integer and one for FPU per module.
AMD's Flex FP is designed around a full 256-bit FPU that can be further segmented into dual 128-bit data pipes. Flex FP certainly lives up to its name, and can handle two 128-bit SSE instructions through a single core, or both cores can simultaneously process a 128-bit FPU command. Support for AVX (Advanced Vector Extensions) instructions allows Flex FP to handle full 256-bit floating-point execution, but programs need to be recompiled to take advantage of it.
AMD is promoting Flex FP as a more flexible design that can easily handle both standard 128-bit floating-point code and the enhanced 256-bit AVX instructions. This differs from what Intel will offer with the Sandy Bridge FPU, which can process 1x128-bit in legacy mode and 1x256-bit with AVX code. Flex FP allows multiple configurations, so AMD Bulldozer should be able to process as a full 256-bit FPU, just not in the same form.
The difference is that regardless of the configuration, Flex FP can handle only 128-bit pieces, and pairs them up into 2x128-bit for a 256-bit AVX instruction. Intel can handle a full 256-bit floating-point AVX command per core, as well as a dedicated 128-bit path for legacy applications. This may sound equivalent, but this slight difference means that a Sandy Bridge multi-core processor should be faster when using the AVX instruction set.