News FFmpeg devs boast of up to 94x performance boost after implementing handwritten AVX-512 assembly code

Page 3 - Seeking answers? Join the Tom's Hardware community: where nearly two million members share solutions and discuss the latest tech.

bit_user

Titan
Ambassador
You would typically see these type of instructions repeated a few times in a unrolled loop scenario
vfmadd231ps zmm16, zmm0, real4 bcst [rsi]
vfmadd231ps zmm17, zmm0, real4 bcst [rsi+64]
vfmadd231ps zmm18, zmm0, real4 bcst [rsi+2*64]
vfmadd231ps zmm19, zmm0, real4 bcst [rsi+3*64]
Did you mix up the first and second operands? It seems like you intended zmm0 to be your accumulator, but the docs say the first operand is the accumulator. Also, I'm puzzled by your address arithmetic on the 3rd operand, as it seems like you're treating it as a 512-bit vector, yet you're only loading & broadcasting the first element of each. Your example would make more sense to me, if it looked like this:

Code:
vfmadd231ps zmm0, zmm16, real4 bcst [rsi+4*0]
vfmadd231ps zmm0, zmm17, real4 bcst [rsi+4*1]   
vfmadd231ps zmm0, zmm18, real4 bcst [rsi+4*2]  
vfmadd231ps zmm0, zmm19, real4 bcst [rsi+4*3]

So, each is loading a different scalar fp32 weight and multiplying it by all of the elements, before accumulating the result in zmm0.

That's based on this description:

I actually got clang-19.1 to do it! You can see it here:

For those who'd rather not follow the link, here's the source and resulting asm. Note that it uses Intel syntax, which has a different decorator (i.e. {1to16}) for bcst:
Code:
#include <immintrin.h>

__m512 f(__m512 *vals, float *weights)
{
    __m512 w0 = _mm512_set1_ps(weights[0]);
    __m512 w1 = _mm512_set1_ps(weights[1]);
    __m512 w2 = _mm512_set1_ps(weights[2]);
    __m512 w3 = _mm512_set1_ps(weights[3]);

    __m512 acc = _mm512_set1_ps(0.0f);
     acc = _mm512_fmadd_ps(vals[0], w0, acc);
     acc = _mm512_fmadd_ps(vals[1], w1, acc);
     acc = _mm512_fmadd_ps(vals[2], w2, acc);
     acc = _mm512_fmadd_ps(vals[3], w3, acc);
    return acc;
}

Code:
vxorps  xmm1, xmm1, xmm1
vmovaps zmm0, zmmword ptr [rdi]
vmovaps zmm2, zmmword ptr [rdi + 64]
vmovaps zmm3, zmmword ptr [rdi + 128]
vmovaps zmm4, zmmword ptr [rdi + 192]
vfmadd132ps zmm0, zmm1, dword ptr [rsi]{1to16}
vfmadd231ps zmm0, zmm2, dword ptr [rsi + 4]{1to16}
vfmadd231ps zmm0, zmm3, dword ptr [rsi + 8]{1to16}
vfmadd231ps zmm0, zmm4, dword ptr [rsi + 12]{1to16}
ret

If you follow the link, you can also see how GCC handles it, which pretty much directly translates the intrincis and doesn't involve the bcast mode. They both work out to 9 instructions (excluding ret), so I can't readily say it's worse. The point of the exercise was just to see if either of these compilers knows how to fuse the intrinsics, which Clang clearly does!
: )
 
Last edited:
  • Like
Reactions: Scraph

bit_user

Titan
Ambassador
Can someone with a twitter/X account please tell me exactly what's being presented, here? I wanted to dig a bit deeper into this, but it seems to me that the code hasn't actually landed in their main git repo. I even looked through the ffmpeg developers' mailing list, to try and find some discussion of these optimizations, but nothing jumped out when searching through the email subjects over the past year. I also checked the newsfeed, on the official ffmpeg website, but no mention of this or any recent presentation/conferences.

What is this presentation from (conference, etc.)? Who was the presenter? Is the code still being qualified as "experimental"?

Thanks in advance.
 
  • Like
Reactions: Grobe

bit_user

Titan
Ambassador
It turns out that this isn't even ffmpeg/libavcodec! It's actually a software AV1 decoder called DAV1D!

I still have no idea why ffmpeg was tweeting about it.

Anyway, to build the tests, run meson with the -Dtrim_dsp=false option.

Then, the commandline shown in the slide works as advertised. I ran this on an Alder Lake-N (E-cores, only), and here's what I got (blank lines inserted for clarity):
Code:
mc_8tap_regular_w64_0_8bpc_c:         332.3 ( 1.00x)
mc_8tap_regular_w64_0_8bpc_ssse3:      87.0 ( 3.82x)
mc_8tap_regular_w64_0_8bpc_avx2:       84.4 ( 3.94x)

mc_8tap_regular_w64_h_8bpc_c:        7703.9 ( 1.00x)
mc_8tap_regular_w64_h_8bpc_ssse3:     985.0 ( 7.82x)
mc_8tap_regular_w64_h_8bpc_avx2:      985.6 ( 7.82x)

mc_8tap_regular_w64_hv_8bpc_c:      16161.9 ( 1.00x)
mc_8tap_regular_w64_hv_8bpc_ssse3:   2858.8 ( 5.65x)
mc_8tap_regular_w64_hv_8bpc_avx2:    4201.2 ( 3.85x)

mc_8tap_regular_w64_v_8bpc_c:        7711.1 ( 1.00x)
mc_8tap_regular_w64_v_8bpc_ssse3:     972.9 ( 7.93x)
mc_8tap_regular_w64_v_8bpc_avx2:     1219.9 ( 6.32x)

Compiler version is gcc-13.2.0 and no special build options were used. I don't yet know what kind of units those are, but here's how my machine compares to the ones in the slide:

Methodcssse3avx2
00.721.110.65
h2.470.670.35
hv2.320.620.33
v2.680.530.25

First, it seems pretty clear that my CPU is about 70% as fast as whatever they tested it on, between clock speed and IPC. That makes sense, since my machine is limited to just 3.6 GHz. So, they could easily be using an Ice Lake laptop or server CPU.

In light of that, the speedups seen on the C version of the h, hv, and v methods would suggest either they optimized the generic C code or were using a debug build. As for why my AVX2 numbers are off by another factor of 2, I believe that's because Gracemont implemented AVX2 the same way as Zen 1, which is by executing most instructions in 128-bit chunks.
 

bit_user

Titan
Ambassador
Okay, I just rebuilt with the -Dbuildtype=debug option and here's what I got:
Code:
mc_8tap_regular_w64_0_8bpc_c:         377.8 ( 1.00x)
mc_8tap_regular_w64_0_8bpc_ssse3:      85.6 ( 4.41x)
mc_8tap_regular_w64_0_8bpc_avx2:       86.6 ( 4.36x)

mc_8tap_regular_w64_h_8bpc_c:       63750.4 ( 1.00x)
mc_8tap_regular_w64_h_8bpc_ssse3:     989.5 (64.43x)
mc_8tap_regular_w64_h_8bpc_avx2:      963.3 (66.18x)

mc_8tap_regular_w64_hv_8bpc_c:     132461.5 ( 1.00x)
mc_8tap_regular_w64_hv_8bpc_ssse3:   2857.7 (46.35x)
mc_8tap_regular_w64_hv_8bpc_avx2:    4181.8 (31.68x)

mc_8tap_regular_w64_v_8bpc_c:       72946.5 ( 1.00x)
mc_8tap_regular_w64_v_8bpc_ssse3:     957.6 (76.18x)
mc_8tap_regular_w64_v_8bpc_avx2:     1218.1 (59.89x)

If we leave aside the fact that AVX2 performance on my CPU is poor, we can see speedup ratios between the C and SSSE3 versions that much more closely match what's shown in their screen shot.

Therefore, I suspect we've been had. Either someone has optimized the C versions, since their slide was created (which is not remotely supported by the commit log), or they compared against a debug/unoptimized build of the C code. Now, if you're a developer who's spent a while optimizing some code, you should have a pretty good idea of how the C code performs and it should be pretty obvious to you that the data shown in that screen grab is too good to be true. Therefore, I suspect this was an intentional fudge (one with plausible deniability), done to artificially cast their optimizations in a better light.
 
Last edited:
  • Like
Reactions: Scraph
Jun 5, 2024
11
22
15
Therefore, I suspect we've been had. Either someone has optimized the C versions, since their slide was created (which is not remotely supported by the commit log), or they compared against a debug/unoptimized build of the C code.
That might've been their attempt to compile a non-SIMD version since the compiler will make some attempt to use SIMD on its own.

Of course, SIMD instruction sets can be disabled explicitly through compiler flags, while still producing an otherwise optimized build.

But even if that comparison had been done properly, it would still be misleading/dishonest.

More meaningful would be to compile their original implementation with compiler optimizations including "auto vectorization", etc.

Compare the best that the compiler can do against their hand-coded assembly. And of course that difference would be much less than 94x.
 
  • Like
Reactions: bit_user

bit_user

Titan
Ambassador
That might've been their attempt to compile a non-SIMD version since the compiler will make some attempt to use SIMD on its own.

Of course, SIMD instruction sets can be disabled explicitly through compiler flags, while still producing an otherwise optimized build.

But even if that comparison had been done properly, it would still be misleading/dishonest.
Let's back up and consider what we know.

The tweet features a photograph of what appears to be a slide from a presentation. The slide then shows the output of a testing front end, following a couple bullet points which seem to discuss how the assembly language implementations are tested. The slide, itself, doesn't really seem to focus on the performance claims.

GbWNWuQakAAKZi4

Tweet: https://x.com/FFmpeg/status/1852542388851601913

Furthermore, we now know the software project being discussed was DAV1D, whereas the tweet was made from the ffpmeg account. So, there's potentially an innocent explanation: that the slide was prepared by someone using a debug build, which was perhaps appropriate for their purposes. But, the performance numbers were noticed and taken out of context by an audience member and got tweeted about by them or one of their associates.

More meaningful would be to compile their original implementation with compiler optimizations including "auto vectorization", etc.

Compare the best that the compiler can do against their hand-coded assembly. And of course that difference would be much less than 94x.
The default C compiler options on Ubuntu 24.04 are (excluding any that don't seem relevant to code generation):

-DNDEBUG -std=c99 -O3 -fomit-frame-pointer -ffast-math -fPIC

When meson is run with -Dbuildtype=debug, the C compiler options change to:

-std=c99 -O0 -g -fPIC

The architecture targeted is just base x86-64, which limits vector instruction usage to just SSE2. No options are specified that would influence autovectorization, other than -ffast-math, which notably implies -funsafe-math-optimizations. So, it does seem like they allow the C compiler to do what it can with just basic SSE2.

The bigger issue holding it back is that the generic C code probably isn't structured in a way that's very friendly towards autovectorization. Furthermore, their entire tree contains no instances of the restrict keyword, which is sometimes necessary to convince the compiler that subsequent statements are truly independent (and thus very relevant to autovectorization). AFAICT, they also don't use any compiler hints that would tell the compiler which loops will tend to have lots of iterations, which conditional branches represent unlikely error conditions (and can therefore be pessimized), etc.