juanrga :
No. 4SP can be obtained with both FMA4 or SSE2. The difference is that one set provides 2 ops and the other only provides 1 op. It is evident what set is using the above computation. unless you confound both.
You are again confounding modules with cores. Kaveri APU comes with 2C and 4C configurations. It is the possibility of 6C which was droped.
No SSE like AVX1 has up to 2 operands in the same vector instruction, AVX 2 have up to 3 (FMA3) and XOP have up to 4 (FMA4)... the "numbers" that those operations operate in the context of SIMD, is dependent of the "wide" of the vector, but be assured the number of "numbers" is not the most useful ( SSE instructions can have up to 16 byte numbers or 8 16bit short integers "numbers" per vector), but what those operations can do is ... wikipedia it...
You are confounding modules with cores is you...
Idiot use your brain. That "CHART" is for a Llano. Integer cluster/cores on AMD BD don't do FLOPS, or are you not sure ?
Here is the diagram of K10
https://pt.wikipedia.org/wiki/AMD_K10
https://upload.wikimedia.org/wikipedia/commons/thumb/d/d6/AMD_K10_Arch.svg/300px-AMD_K10_Arch.svg.png
And here are the instructions
http://en.wikipedia.org/wiki/Streaming_SIMD_Extensions
Llano have 2 128bit pipes capable of SSE (up to 4.1) per core, while BD only have 2 128bit pipes per 2 cluster/cores... this in the context of FLOPS or FP ops ... And while a SSE instruction can have up to 4x 32bit FP "numbers" its only on the context of SIMD, same instruction doing the same simple operations on those numbers, its not that terrible useful. SSE doesn't have multiply accumulate, in terms of pure logic operations is only one at a time, a mul or an add. And though theoretically Llano has double the FLOP rate of BD, because of 2 128pipes per core against 2 per 2 cores (module) of BD, is not because of that performance is better than BD. The contrary, most most of times is worst. ( the FMISC in Llano K10 is identical to one of the MMX pipes in BD, and the other MMX pipe in BD also doesn't do FP ops )
So theoretically peak rates most of times can be quite deceiving... only to entertain morons.
SSE floating point
Floating point instructions
* Memory-to-register/register-to-memory/register-to-register data movement
--- Scalar– MOVSS
--- Packed – MOVAPS, MOVUPS, MOVLPS, MOVHPS, MOVLHPS, MOVHLPS
* Arithmetic
--- Scalar – ADDSS, SUBSS, MULSS, DIVSS, RCPSS, SQRTSS, MAXSS, MINSS, RSQRTSS
--- Packed – ADDPS, SUBPS, MULPS, DIVPS, RCPPS, SQRTPS, MAXPS, MINPS, RSQRTPS
* Compare
--- Scalar – CMPSS, COMISS, UCOMISS
--- Packed – CMPPS
* Data shuffle and unpacking
--- Packed – SHUFPS, UNPCKHPS, UNPCKLPS
* Data-type conversion
--- Scalar – CVTSI2SS, CVTSS2SI, CVTTSS2SI
--- Packed – CVTPI2PS, CVTPS2PI, CVTTPS2PI
* Bitwise logical operations
--- Packed – ANDPS, ORPS, XORPS, ANDNPS (edt)
So has you can see (i hope) 128 GFLOPS at 4 Ghz is only true for a Llano and in the context of not very useful "single precision"... BD rate is half per core... but i wouldn't count to much on them(SSE), FMA4 of BD can provide quite better performance though the theoretical peak is half, more so because BD has the same number of L/S engines of Llano , but quite more advanced, and in real world performance counting real sustained data rates for those SIMDs, BD delivers a beating. Got it now ?
[ i like more to count only real logic ops per vector, so SSE 2, AVX is 2 or 3(FMA) and XOP is 4(FMA)... and count the wide of the issue ports... in that case BD can do 4 ops per 128bit pipe but only with 128bit FMA4... and in this context that Llano chart is very wrong ( the "numbers" in those vectors is not quite that useful)... but believe me THERE IS MORE THAN ONE WAY TO EXTRAPOLATE PEAK FLOP RATES, different vendors use slightly different methods with what they count... YOU ARE WASTING YOUR TIME... and mine... (edt)]
Sorry to be aggressive but this is the last time... don't bother anymore.