juanrga :
hcl123 :
juanrga :
AMD claims 8 FLOPs per cycle for the Steamroller FPU. Piledriver FPU is also 8 FLOPs per cycle. I already devoted a pair of posts in this thread to explain the 8 FLOPs per cycle.
Piledriver 2M @ 4GHz: 128 GFLOP
Steamroller 2M @ 4GHz: 128 GFLOP
AFAI remember, on the presentation of BD AMD claimed the ability of 4 discrete 64bit FP operations on the 2 128bit FMAC. Now if each of those 64bits ops can be MADD, it will be 8... but then why didn't they claimed 8 instead of 4 ?
Nevertheless 8 FLOPS can be from the fact of SR having 2 FlexFPUs... it will have "kind of" double decoders and for sure it will have a dedicated "dispatcher" per Integer cluster/core, so 2 FlexFPUs attending the great modularity of the design is not far fetched, even those 2 FlexFPUs sharing a same FP dispatcher front-end
The 8 FLOPS can also be from the fact that each FAMC pipe is now 256bit large and able of 256bit ops per cycle without halves... which is one of the reveals around (FP256)... in that case following the logic each FMAC pipe will be able of 4 discrete 64bit ops per cycle, and if MADD, 8 ops per pipe, 16 per FLexFPU... and if 2 of them will be 32 ops...
Now that will be something lol... everybody talks Jim Keller, but i would like more to see Gustafson mark on it lol...
32x4Ghz = 128 GFLOP per module (with 2 FPUs), 256 GLOPS for APU, 512 GFLOPS for CPU... umm why do i think its too much lol
Juanrga, 8x4Ghz = 32GFLOPs not 128... and following x86 scheme of the architectural RF, 32ops will be the same number of registers of 64bit, the larger registers are zoroed. And even if 8 ops is 64bit and so single precision 32bits is double, it will be 16ops per cycle or 64 GFLOPs not 128.
Yes 128GFLOPS by your logic is for 4 modules chips and for 64bit/32bit FP ops, but APUs wont have 4 modules. So i think is pertinent to reference what you are pointing APU or CPU.
On the presentation of BD, AMD claimed "4 DP/BD" for SSE2 and "8
DP/BD" for FMA4.
The 8 FLOPs claimed by AMD correspond to a single (shared) FlexFPU per module. The diagram of the Steamroller module has been posted here numerous times. I posted again a pair of post above: there is no "2 FlexFPUs".
The 8 FLOPs claimed by AMD correspond to 128 bit FMAC units.
The diagram of the Steamroller module has been posted here numerous times. I posted again a pair of post above: there is no "each FAMC pipe is now 256bit large", but the twe 128-bit FMAC units can be fused in one 256 bit FMAC superunit, if needed.
The 128 GFLOP are not per module but per CPU (2 modules).
The 128 GFLOP are not per 4 module but per CPU (2 modules).
Adding the 128 GFLOP of the 2 Steamroller modules to the 922 GFLOP of the GPU we obtain the 1050 GFLOP claimed by AMD for the APU.
You are a confused and confusing guy lol
What is "DP/BD" ... double precision and ?
"" The 8 FLOPs claimed by AMD correspond to a single (shared) FlexFPU per module. "" ... yes i doubted but not contested. But only if is 4ops per 128bit pipe, and only for 32bit FP ops. What you don't understand is that a "vector" which is like an agglomeration of simpler operations, and in this case multiply+add has to be present with up to 4 operands, and since the "issue port" is only 128bit large, for being per cycle it must correspond to 128bit FMA4 instructions, or vectors of 4x 32bit ops because if 64bit ops, then it will not fit "per cycle" trough 128bit ports(must take 2 or more cycles, or be split in halfs for 2 pipes)
Those (128bit FMA4) exist are part of the XOP package. But since intel with is cloth imposed an embargo, nobody really uses XOP outside of very very few specialized apps, so that claim is not about a pure pervasive FLOP capability but only special case. Besides vectors of 32bit FP ops is not what is most useful, better would be vectors of 64bit ops.
So your 128GFLOP can only be of a 4 module CPU, because 8 flops per FlexFPU is 32GFLOPS, and only a 4 module chip could have 128GFLOPS from the CPU side... if that is what you are stating...
Flops is = number of Ops x frequency; and that case per FlexFPU is 8 x 4Ghz =32GFLOPS not 128... and yet not very useful because only in the case of 128bit FMA4 instructions. 256bit FMA4 XOP instructions uses both FMAC pipes, though they are vectors of 64bit ops, the rate is half. Got it now ?
This is what lead me to the speculations of 256bit FMACs, that is, one single FMAC is 256bit large internally and *must* have a 256bit port... and using 2 FlexFPU per module leads up to 32 64bit ops per 4 module CPU. Now this could be very useful, for all those scientific applications.
"" The diagram of the Steamroller module has been posted here numerous times. I posted again a pair of post above: there is no "each FAMC pipe is now 256bit large ""... have you seen an actual, after all this delays, diagram of an actual Steamroller chip ?... NO!
Yes is quite possible that the APU modules only have 1 FlexFPU, after all they have a GPU, that has much more FLOP capability, and compute programing is here to stay. But a server/FX chip could very well have 2 FlexFPU per module, the decode "kind of double" and double dispatch, and other improvements could be sufficient to sustain the needed rates even with 256bit FMACs on 2 FlexFPU per module.
Yes those were not revealed in any "slide presentation", but were disclosed in a "programmer guide" just google AMD FP256, probably youll go to planet3dnow... a help...
http://www.planet3dnow.de/photoplog/file.php?n=24314&w=o
256bit AVX instructions executed with full-width internal operations and "pipeline" rather than decomposing then into 128bit sub-operations ... can only mean the pipeline executing them has 256bit "issue ports" and so is 256 bit wide also, even if internally its composed of bridged 128bit sub-pipelines also suitable for other operations. Going by 2 128bit pipes working together, it would have to be in halves, or the arbitration for Register File access will be a hell and the chip will clock slower.
Now that seems to me like they are needing 256bit FMAC pipes...doesn't to you ?
juanrga :
hcl123 :
No intel is not
i7-3770k: 224 GFLOP
i7-4770k: 448 GFLOP
For FP ops those chips have 3 128bit ports, max possible will be 6MADD or 12 64/32bit FP ops,
12x3.6Ghz x 4 cores = 43.2 GFLOPS x4 = 172.8 GFLOPS (max possible) either IB or Hasfail. Its less none the case, because from the L/S buffers it could be possible to sustain that FP throughput, Intel designs only have 1 L/S engine per core or 2 threads, while AMD have 2 L/S engines per module or 2 threads, AMD could sustain "potentially" double of Intel, for the same number of threads per chip.(edt)
And no... doubt any case Steamroller uses the same FPU of Piledriver. Presenting officially a FPU 30% smaller do to HDLs, and only 1 MMX pipe, and presenting FP256 wouldn't make sense then.
It could be quite different... and better... more so if 2FlexFPUs per module which is quite possible.
Wrong again. The above GFLOP are the numbers claimed by Intel for the CPU. The 448 GFLOP of haswell i7 CPU are reported in several sites.
Intel likes to use DP in its technical datasheets. If you want to obtain DP values you only need to divide the above SP numbers per 2. E.g. the 224 GFLOP (SP) correspond to 112 GFLOP (DP).
You are indeed confusing and confused... you must tell from those numbers what is due to GPU and what is due to CPU cores, and what corresponds to "single precision" and "double precision" and what kind of vectors use them ... 244GFLOPS seems a little low even for a GT2, but must be single precision, since the GPU barely moves on "double precision"... but if there is magic and is only CPU side, just
"present the math please" that leads to those numbers.... Flops is = number of Ops x frequency
( no way in hell could haisfail have 244GFLOPS from the CPU cores side with only 3.6Ghz... gzz.. what propaganda do to ppl heads lol (edt))