Did you mix up the first and second operands? It seems like you intended zmm0 to be your accumulator, but the docs say the first operand is the accumulator. Also, I'm puzzled by your address arithmetic on the 3rd operand, as it seems like you're treating it as a 512-bit vector, yet you're only loading & broadcasting the first element of each. Your example would make more sense to me, if it looked like this:You would typically see these type of instructions repeated a few times in a unrolled loop scenario
vfmadd231ps zmm16, zmm0, real4 bcst [rsi]
vfmadd231ps zmm17, zmm0, real4 bcst [rsi+64]
vfmadd231ps zmm18, zmm0, real4 bcst [rsi+2*64]
vfmadd231ps zmm19, zmm0, real4 bcst [rsi+3*64]
Code:
vfmadd231ps zmm0, zmm16, real4 bcst [rsi+4*0]
vfmadd231ps zmm0, zmm17, real4 bcst [rsi+4*1]
vfmadd231ps zmm0, zmm18, real4 bcst [rsi+4*2]
vfmadd231ps zmm0, zmm19, real4 bcst [rsi+4*3]
So, each is loading a different scalar fp32 weight and multiplying it by all of the elements, before accumulating the result in zmm0.
That's based on this description:
I actually got clang-19.1 to do it! You can see it here:
For those who'd rather not follow the link, here's the source and resulting asm. Note that it uses Intel syntax, which has a different decorator (i.e.
{1to16}
) for bcst
:
Code:
#include <immintrin.h>
__m512 f(__m512 *vals, float *weights)
{
__m512 w0 = _mm512_set1_ps(weights[0]);
__m512 w1 = _mm512_set1_ps(weights[1]);
__m512 w2 = _mm512_set1_ps(weights[2]);
__m512 w3 = _mm512_set1_ps(weights[3]);
__m512 acc = _mm512_set1_ps(0.0f);
acc = _mm512_fmadd_ps(vals[0], w0, acc);
acc = _mm512_fmadd_ps(vals[1], w1, acc);
acc = _mm512_fmadd_ps(vals[2], w2, acc);
acc = _mm512_fmadd_ps(vals[3], w3, acc);
return acc;
}
Code:
vxorps xmm1, xmm1, xmm1
vmovaps zmm0, zmmword ptr [rdi]
vmovaps zmm2, zmmword ptr [rdi + 64]
vmovaps zmm3, zmmword ptr [rdi + 128]
vmovaps zmm4, zmmword ptr [rdi + 192]
vfmadd132ps zmm0, zmm1, dword ptr [rsi]{1to16}
vfmadd231ps zmm0, zmm2, dword ptr [rsi + 4]{1to16}
vfmadd231ps zmm0, zmm3, dword ptr [rsi + 8]{1to16}
vfmadd231ps zmm0, zmm4, dword ptr [rsi + 12]{1to16}
ret
If you follow the link, you can also see how GCC handles it, which pretty much directly translates the intrincis and doesn't involve the
bcast
mode. They both work out to 9 instructions (excluding ret
), so I can't readily say it's worse. The point of the exercise was just to see if either of these compilers knows how to fuse the intrinsics, which Clang clearly does!: )
Last edited: