juanrga :
hcl123 :
You are a confused and confusing guy lol
What is "DP/BD" ... double precision and ?
Double precision per BD module.
Which is wrong, 8 flops FPU can only be achieved with 128bit FMA4, and those are single precision not double.The main problem is "issue port", tough those are SIMD instruction to claim a sustainable peak flop rate you got to count the "wide" of the issue port
juanrga :
hcl123 :
"" The 8 FLOPs claimed by AMD correspond to a single (shared) FlexFPU per module. "" ... yes i doubted but not contested. But only if is 4ops per 128bit pipe, and only for 32bit FP ops. What you don't understand is that a "vector" which is like an agglomeration of simpler operations, and in this case multiply+add has to be present with up to 4 operands, and since the "issue port" is only 128bit large, for being per cycle it must correspond to 128bit FMA4 instructions, or vectors of 4x 32bit ops because if 64bit ops, then it will not fit "per cycle" trough 128bit ports(must take 2 or more cycles, or be split in halfs for 2 pipes)
Those (128bit FMA4) exist are part of the XOP package. But since intel with is cloth imposed an embargo, nobody really uses XOP outside of very very few specialized apps, so that claim is not about a pure pervasive FLOP capability but only special case. Besides vectors of 32bit FP ops is not what is most useful, better would be vectors of 64bit ops.
Did you read my posts? I already gave FLOP values for
both SSE2 and FMA4 instructions. I also discussed both 32bit and 64bit.
I'm reading and answering now... what do you think ? And i don't care where you picked those values to parrot them, just give the math leading to them, don't pretend to have the authority just because you picked something from somewhere which you think is authoritative. There could be mistakes and there could be very wrong interpretations, which would make you automatically wrong not right, even if those are not your mistakes.
juanrga :
hcl123 :
So your 128GFLOP can only be of a 4 module CPU, because 8 flops per FlexFPU is 32GFLOPS, and only a 4 module chip could have 128GFLOPS from the CPU side... if that is what you are stating... Flops is = number of Ops x frequency; and that case per FlexFPU is 8 x 4Ghz =32GFLOPS not 128... and yet not very useful because only in the case of 128bit FMA4 instructions. 256bit FMA4 XOP instructions uses both FMAC pipes, though they are vectors of 64bit ops, the rate is half. Got it now ?
Did you read my posts? I already gave FLOP values for
both SSE2 and FMA4 instructions. I also discussed both 32bit and 64bit.
Here we go again, sounds exactly like a parrot
juanrga :
hcl123 :
This is what lead me to the speculations of 256bit FMACs, that is, one single FMAC is 256bit large internally and *must* have a 256bit port... and using 2 FlexFPU per module leads up to 32 64bit ops per 4 module CPU. Now this could be very useful, for all those scientific applications.
"" The diagram of the Steamroller module has been posted here numerous times. I posted again a pair of post above: there is no "each FAMC pipe is now 256bit large ""... have you seen an actual, after all this delays, diagram of an actual Steamroller chip ?... NO!
YES, for instance in the japanese article (July this year) cited by 8350rocks. Besides that, AMD already gave (at Computex this year) the GFLOP for kaveri and the value coincides with that resulting from analysing the Steamroller module that everyone knows.
AMD gave "supposedly" the FLOP number for Kaveri, counting the GPU, which will be "single precision". You went along and extrapolated the frequencies and and DP based on your "no math" incredibly likely very wrong assumptions.
juanrga :
hcl123 :
Yes is quite possible that the APU modules only have 1 FlexFPU, after all they have a GPU, that has much more FLOP capability, and compute programing is here to stay. But a server/FX chip could very well have 2 FlexFPU per module, the decode "kind of double" and double dispatch, and other improvements could be sufficient to sustain the needed rates even with 256bit FMACs on 2 FlexFPU per module.
No. AMD only releases a single Steamroller module. Kaveri APU is 'derived' from Berlin APU. There is no steamroller server/FX chip in the roadmaps.
Nobody really knows anything concrete about Berlin yet. AMD gave an interview were it stated the intention of continuing with FX... the rest are assumptions that could be wrong (even mine).
At least i have the care to expose that 2 FlexFPU could be technically possible, not that they will have it.
With you, is how you say and thats the end of it LOL (presumptions are the mother of all F ups lol)
juanrga :
hcl123 :
Yes those were not revealed in any "slide presentation", but were disclosed in a "programmer guide" just google AMD FP256, probably youll go to planet3dnow... a help...
http://www.planet3dnow.de/photoplog/file.php?n=24314&w=o
256bit AVX instructions executed with full-width internal operations and "pipeline" rather than decomposing then into 128bit sub-operations ... can only mean the pipeline executing them has 256bit "issue ports" and so is 256 bit wide also, even if internally its composed of bridged 128bit sub-pipelines also suitable for other operations. Going by 2 128bit pipes working together, it would have to be in halves, or the arbitration for Register File access will be a hell and the chip will clock slower.
Now that seems to me like they are needing 256bit FMAC pipes...doesn't to you ?
No. In the first place, nothing in that manual mentions Steamroller. Some people speculates that entry is for future excavator module. I disagree. It seems related to this
http://images.bit-tech.net/content_images/2011/10/amd-fx-8150-review/bulldozer-fp-unit.jpg IMAGE
In fact your hypothetical FPU with 256bit FMAC units gives GFLOP values that disagree with those given by AMD.
The correct values are:
- Steamroller module has 2 x 128bit FMAC units.
- Each unit can do 2 DP or 4 SP.
- SSE2 => 1 op; FMA4 => 2 op
- Steamroller Kaveri performance is 4C x (4SP x 2 FMA) x 4GHz = 4C x 8 FLOPs x 4GHz = 128 GFLOP
- The GPU performance is 512C x 2 FLOPs x 0.9GHz = 922 GFLOP (SP)
- Total APU performance: 922 + 128 = 1050 GLOP. This is the value claimed by AMD officially. Moreover AMD labs has confirmed that Steamroller gives "8 FLOPs per core".
Everything else is your own misunderstanding or pure fantasy.
What an arrogant little... quite an imagination lol... but you are right (*some*) in the simple math, only mention that 4 SP per 128bit pipe only with 128bit FMA4 instructions, no other instructions could sustain that rate... and there isn't 4C APU version according to reveal, was advented the possibility of 3C, but was discarded by now.
So Kaveri will be only 2 modules 2 FlexFPU and the rate is half of what you claim on the CPU side... unless there is 2 FlexFPU per module LOL ... otherwise you have to re-ẽvaluate your fantasy.
If the 4C means "4 cores" like in Integer Cluster/Cores, then you are more F up than i imagine, those Integer Cluster/Cores don't do any FP calculations, and that is exactly the strong point of the design.
Those vectors/FP instructions are SIMD in nature, the same instruction can run several times with different data. 1 single core module could be enough to fill a 2 FMAC FPU (which again leads to the possibility of 2 FlexFPUs), but obviously it could tend to leave much performance on the floor, it would have to be implemented wisely. Mixing Integer cores for the FP calculations is completely F up, GPUs don't have Integer cores per say, yet is not because of that you can't calculate FLOP rates.
And no *
- SSE2 => 1 op; FMA4 => 2 op *;.. SSE as well FMA4 are vector instructions, SSE have 2 or 3 operands and FMA4 have 4 operands, and of those only very few correspond to actual 1 Add+ 1Mul or 2 adds + 2 Muls in case of FMA4, usually 1 or more operands are destination registers (mem ops).
And no the *
IMAGE* doesn't present anything new, that is exactly how the FLexFPU deals with 256bit instructions now, it uses both FMAC pipes of the FPU.. the scheduler is/was unified since BDver1.
For FP256 as revealed the issue port must be 256bit wide, and the addressing is 256bit wide in nature (
not 128bit in half's) [
EDIT : and AMD could do this because they have a "load buffer" before the FP pipes, the load buffer could be 256bit wide, while the rest of the data paths remain 128bit, for not penalize clock ability]. For 128bit instructions on 256 bit pipes, those could be be packed before and interacting with the scheduler
*at runtime*, like used in the schemes of "Uops fusion" either Intel or the AMD K10 ALU+AGU, or
*at compile time* and so the code is transformed into AVX 256 or FMA4 256, and so no more 128bit vectors. If those are not packed or compiled for 256bit, a 128bit instruction can only be issue one at a time per pipe, wasting half the possibility of a 256bit pipe.
The same happens today for 64bit or 32bit FP ops on 128bit pipes, and since there isn't any *runtime* packing so far, that is why 8 FP ops per FlexFPU only if the code is compiled for 128bit FMA4 instructions, which is single precision.
juanrga :
hcl123 :
juanrga :
Wrong again. The above GFLOP are the numbers claimed by Intel for the CPU. The 448 GFLOP of haswell i7 CPU are reported in several sites.
Intel likes to use DP in its technical datasheets. If you want to obtain DP values you only need to divide the above SP numbers per 2. E.g. the 224 GFLOP (SP) correspond to 112 GFLOP (DP).
You are indeed confusing and confused... you must tell from those numbers what is due to GPU and what is due to CPU cores, and what corresponds to "single precision" and "double precision" and what kind of vectors use them ... 244GFLOPS seems a little low even for a GT2, but must be single precision, since the GPU barely moves on "double precision"... but if there is magic and is only CPU side, just
"present the math please" that leads to those numbers.... Flops is = number of Ops x frequency
( no way in hell could haisfail have 244GFLOPS from the CPU cores side with only 3.6Ghz... gzz.. what propaganda do to ppl heads lol (edt))
And this proves that you don't read my posts. As quoted above by yourself I am saying that the those GFLOPs are for the CPU alone. I am also saying, in the same quote, what figures are double precision (DP) and what figures are single precision (SP).
Haswell is 3.5GHz, not 3.6Ghz; besides that mistake, the above GFLOP values are given officially by Intel. It is very easy to obtain them
http://www.realworldtech.com/wp-content/uploads/2012/10/haswell-3.png?b22ba0 IMAGE
Sandy Bridge double Nehalen FP capabilities by double wide units. Ivy Bridge maintain the same architecture. From the diagram:
- ( 8SP x 1MUL ) + ( 8SP x 1ADD ) = 16 FLOP
- i7-3770k performance is 4C x 16 FLOPs x 3.5GHz = 224 GFLOP (SP). For DP the value is one-half: 112 GFLOP (DP). This is the value claimed by Intel officially in their technical datasheets.
As observed in the above diagram Haswell introduces FMA support. Therefore:
- ( 16SP x 1FMA ) = 32 FLOP
- i7-4770k performance is 4C x 32 FLOPs x 3.5GHz = 448 GFLOP (SP). For DP the value is one-half: 224 GFLOP (DP). Again this is the value claimed by Intel officially.
Now the total performance (CPU + GPU):
- i7-3770k: 224 + 294 =
518 GFLOP. This is the value claimed by Intel officially.
- i7-4770k: 448 + 400 =
848 GFLOP. This is the value claimed by Intel officially.
- Kaveri A10 APU: 922 + 128 =
1050 GLOP. This is the value claimed by AMD officially.
Now please, stop ignoring what has been said. Stop ignoring what both Intel and AMD officially claim about its products, and stop fantasizing about imaginary Steamroller modules only in your head.
What a confusion lol ... And because is (could be) 1050 GFLOPS, it doesn't mean is divided like you say... it could be only GPU (more likely), and single precision... how can one not ignore when you don't really know nothing concrete !? lol
And RWT is wrong, they went along to be part of the propaganda machine.
In that *
IMAGE *, Port 0, Port 1 and Port 5 are 128bit wide... just tell me how you fit 8 SP (8x 32bit ops) trough a 128bit port !!?? ... they are counting only with the SIMD nature of those instructions, once issued they can run several times. But even this is quite "borged", because even RWT with a little effort (LOL), mentions that to load 256bit data trough 128bit data paths, employing a single L/S engine for the purpose, sometimes some instructions, can take up to 5 cycles. big LOL...
No.. in spite all the flash and the potentiality of those exec clusters, the reality is that 256bit rate in intel is ~ the same of BD that uses halfs, and 2 L/S engines. Intel is dropping more potential performance on the floor than AMD, because they have 3 exec pipes (3 issue ports) for FP calculation while AMD only has 2.
That is reality, and a point in favor of AMD design of separated FPUs and modules, the rest is propaganda. Intel design on the side of CPU is not really prone to "simple math" for FLOP calculations, no matter the forgetting of issues and pulling
*theoretical* peak flop rates out of the arse. And worst they don't have FMA4 which makes those *theoretical* numbers very hard to swallow, specially concerning single precision.
Worst the GPU side is neither simple to extrapolate.
The 448 GFLOP bigLOL... if GPU corresponds to 20 EU, 80MADs
From
http://translate.googleusercontent.com/translate_c?depth=1&hl=pt-BR&rurl=translate.google.com&sandbox=0&sl=ja&tl=en&u=http://pc.watch.impress.co.jp/docs/column/kaigai/20130602_601851.html&usg=ALkJrhhPzMDgw2L2K6-JhljRXPUDnNtkBA
Chart images of different GPUs
http://translate.googleusercontent.com/translate_c?depth=1&hl=pt-BR&rurl=translate.google.com&sandbox=0&sl=ja&tl=en&u=http://pc.watch.impress.co.jp/img/pcw/docs/601/851/html/20.jpg.html&usg=ALkJrhiciI2SqQZM48SAolYQbJi2E0FtbQ
Now if 80 MADs if corresponding to 160 ops at 1.2Ghz, means 160x1.2 = 192 GFLOPS , meaning the CPU cores would have to have 256 GFLOPS LOL or 64 GFLOPS per core, double of AMD and that is not what benchmarks tells...
If 448 GFLOPS is Iris... then is double (160 MADs) or 384 GFLOPS for the GPU alone and 64 GFLOPS for the CPU (more like reality)... but then that is not 4770k is it ? ... too much confusion in all of intel, i've given up long ago, you can entertain yourself with this complete futile worthless pointless academic exercises.