News FFmpeg devs boast of up to 94x performance boost after implementing handwritten AVX-512 assembly code

Nov 25, 2023
2
1
10
Hand optimization has always created faster executing code than relying on the compiler/assembler.
Wrong. This hasn't been the case in at least ~20 years. There are very few places left where manual assembly code is still desireable.

In fact, it prevents the compiler from making most optimizations to anything using the assembly code, and it provides no latency information, essentially ruining scheduling pipelines.

In the majority of cases, code like this would likely be much more performant if it were written using intrinsics (or generic SIMD or a combination thereof) instead of assembly.
 
  • Like
Reactions: SirStephenH
Wrong. This hasn't been the case in at least ~20 years. There are very few places left where manual assembly code is still desireable.

In fact, it prevents the compiler from making most optimizations to anything using the assembly code, and it provides no latency information, essentially ruining scheduling pipelines.

In the majority of cases, code like this would likely be much more performant if it were written using intrinsics (or generic SIMD or a combination thereof) instead of assembly.
Given that the basics are no longer taught to prospective programmers anymore you are correct. However, I have yet to find any compiler/assembler that can rival an experienced assembly language programmer. Running a compiler/assembler against already optimized code is a bad way to be doing things.
 

JamesJones44

Reputable
Jan 22, 2021
851
779
5,760
Wrong. This hasn't been the case in at least ~20 years. There are very few places left where manual assembly code is still desireable.
In performance critical situations it's still often used, even today. Games even use it for critical path situations.

Should it be the go to solution? Hell no, but when performance matters and an area is slow, dropping into ASM can yield impressive results.
 
  • Like
Reactions: NinoPino

atmapuri

Distinguished
Sep 26, 2011
20
6
18,515
And ffmpeg developers were looking down on AVX512 and pretending it was not there as long as it was an Intel only feature for the last 10 years. Now that it is an AMD only feature, now it is OK.
 
  • Like
Reactions: SirStephenH

ekio

Reputable
Mar 24, 2021
133
154
4,760
Nice but x86 assembly means that it only targets the dead x86 architecture.
Since these are dead isa optimizations, it’s like making an optimization for an obsolete car engine that was engineered in the 60s and that is phasing out anyway.
 

NinoPino

Respectable
May 26, 2022
483
300
2,060
Wrong. This hasn't been the case in at least ~20 years.
I suppose @ex_bubblehead intended that, handwritten assembly code, always have better or at worst equal performance compared to compiled code.

There are very few places left where manual assembly code is still desireable.
Agree, but probably, there are a lot more places that can benefit hand optimization than what we usually think.
In fact, it prevents the compiler from making most optimizations to anything using the assembly code, and it provides no latency information, essentially ruining scheduling pipelines.
Why hand written assembly should do such exotic things ?

In the majority of cases, code like this would likely be much more performant if it were written using intrinsics (or generic SIMD or a combination thereof) instead of assembly.
Why ?
There is nothing better than writing specific parts in pure assembly.
 

bit_user

Titan
Ambassador
The article said:
This group of developers tried to implement a handwritten AVX512 assembly code path, something that has rarely been done before, at least not in the video industry.
This is nonsense. I know x264 likes to use hand-coded assembly language for various architecture-specific optimizations, and that's been true for at least a decade and a half. Another example: libjpeg got hand-optimized for MMX, more than 20 years ago.

More specifically, ffmpeg has long done this. Simply looking at their git repo shows x86 asm files as much as 9 years old!

The article said:
AVX-512 enables processing large chunks of data in parallel using 512-bit registers, which can handle up to 16 single-precision FLOPS or 8 double-precision FLOPS in one operation.
Most video compression code doesn't even use floating point. Instead, it's mostly packed 8-bit and 16-bit ints, which AVX-512 can also handle with the corresponding density increases.

The article said:
On the other hand, AMD's Ryzen 9000-series CPUs feature a fully-enabled AVX-512 FPU so the owners of these processors can take advantage of the FFmpeg achievement.
Ryzen 7000 CPUs also demonstrated quite some benefit from AVX-512 code, even though execution is (mostly) split into two portions of 256-bits. AVX-512 has features such as more registers and per-lane predication, which improve code efficiency, even on 256-bit operands.

The article said:
highlighting the efficiency of hand-optimized assembly code for AVX-512.
I think it's a bit silly, to be honest. Modern compilers are usually pretty good at dealing with C intrinsics, if you give them some hints & sometimes other help. I wonder if they even seriously tried just using intrinsics. Although the function names are a bit hairy, such C code is a lot more readable and maintainable than assembly language, particularly since you don't have to worry about allocating registers by hand.

The article said:
due to the complexity and specialized nature of AVX-512, such optimizations are typically reserved for performance-critical applications and require expertise in low-level programming and processor microarchitecture.
In general, modern compilers will even do a decent job on auto-vectorization. Usually, what holds them back are requirements of strict language conformance, which you can loosen with some compiler options and sometimes a few code tweaks.

Oh and you definitely don't need expertise in processor microarchitecture. It sometimes helps, but modern CPUs shoulder a lot of the burden of dealing with things like sub-optimal instruction ordering. The main thing you should really understand is how CPU caches work.
 
  • Like
Reactions: P.Amini

Thunder64

Distinguished
Mar 8, 2016
199
282
18,960
And ffmpeg developers were looking down on AVX512 and pretending it was not there as long as it was an Intel only feature for the last 10 years. Now that it is an AMD only feature, now it is OK.

Lol at this conspiracy theory.

Nice but x86 assembly means that it only targets the dead x86 architecture.
Since these are dead isa optimizations, it’s like making an optimization for an obsolete car engine that was engineered in the 60s and that is phasing out anyway.

x86 isn't going anywhere anytime soon. Despite what the ARMada or RISC-V people have been saying for years already.
 

bit_user

Titan
Ambassador
Hand optimization has always created faster executing code than relying on the compiler/assembler.
I once programmed a VLIW DSP that was practically impossible to code for in hand-written assembly language, due to all of the scheduling constraints, register bank hazards, etc. that you'd have to keep track of. Messing up could result in program bugs or even locking the CPU (IIRC). It also had 128 general-purpose registers, which would be a heck of a lot to try and allocate manually. In highly-unrolled loops, you did actually need that many.

Instead, what I did was to write C code with intrinsics. Then, I'd look at the assembly language generated by the compiler and see how close it looked to what I expected. While there was any gap, I'd revisit the C code and do some more tweaking to try and resolve whatever was tripping up the compiler. In the end, I think the code came extremely close to the theoretical performance limits of the CPU, just based on the density of actual arithmetic operations. Granted, these were DSP-type operations, like convolutions and other fairly tight loops.

This was about 25 years ago, and compilers were already that good. The main thing people are missing is how to express your intent in a way that's not overspecified or would otherwise trip up compiler optimizations. Here's where detailed knowledge of the language and a bit of compiler knowledge is even more valuable than detailed knowledge of a particular micorachitecture.

Also, like I mentioned, it's worth really understanding how CPU caches work. Most programmers only have some kind of vague or idealized sense of how they think caches should work, if that.

Given that the basics are no longer taught to prospective programmers anymore you are correct. However, I have yet to find any compiler/assembler that can rival an experienced assembly language programmer.
I knew a guy who worked in the compiler group at SGI, in the mid 1990's. He said they would treat it as a compiler bug, if someone could hand-code assembly language that was faster than the code generated by their optimizing compiler.

I also worked with a former DEC employee who said the hand-optimized MMX/SSE code I wrote at the time was doing transformations their optimizing compiler could do with ease.
 
Last edited:

bit_user

Titan
Ambassador
Should it be the go to solution? Hell no, but when performance matters and an area is slow, dropping into ASM can yield impressive results.
Nah, just use C intrinsics. In some cases, even those are overkill.

Programmers tend to think compilers are just dumb, when they see it doesn't generate good code. The dummy usually turns out to be the programmer, who doesn't know the language well enough to see why the compiler didn't do what they expected.

The benefit of learning this stuff is that it helps you write better C/C++ code. There's never enough time to hand-optimize everything, but if you just learn how to make better use of the language & tools, then it helps even in many more cases than you could ever hand-tune.

Why hand written assembly should do such exotic things ?

There is nothing better than writing specific parts in pure assembly.
Modern compilers tend to have detailed knowledge of CPU microarchitectures and use techniques like resource tables to make various low-level optimization decisions. Even if you perfectly hand-tune assembly language for one specific generation of CPU microarchitecture, it could turn out that the next generation comes out and your instruction stream isn't optimal for it. Here's where writing C code is a win - because you can just recompile it for the newer microarchitecture, whereas the hand-written assembly language would have to be overhauled, sometimes even for minor changes in the CPU.

I'll give you a specific and relevant example, although it's only one class of code-generation complication. Zen 4 has a microcoded implementation of the vpcompressd instructions (probably to work around a hardware bug). It takes 142 cycles, which is vastly longer than expected. A good, up-to-date optimizing compiler should know that and avoid emitting such an instruction. If the programmer even knows that and avoided it because they were hand-tuning for Zen-4, then they would have to go back and rewrite that code if/when they want to optimize for a microarchitecture which implemented at the expected performance.

There are more examples we could dig into, like the penalty for fault suppression, which takes 255 or 355 cycles on Zen 4, for reads and writes respectively. To do a good job of writing hand-optimized AVX-512, every coder would have to know this. Or, just teach the compiler about it and then they don't.
 
Last edited:

bit_user

Titan
Ambassador
And ffmpeg developers were looking down on AVX512 and pretending it was not there as long as it was an Intel only feature for the last 10 years. Now that it is an AMD only feature, now it is OK.
I think it's about market share. Intel only implemented it in server/workstation CPUs, a couple generations of laptops (Ice Lake & Tiger Lake) and Rocket Lake (a relatively unpopular desktop CPU that launched only half a year before Alder Lake), which most end users didn't have. Now that AMD implemented it in all their CPUs, many more client machines have it, and that's probably why it's interesting to the FFMPEG devs.

BTW, Ryzen 7000 launched over 2 years ago. So, even after AMD added AVX-512 support, it's not like they just put in the optimizations the very next day.

P.S. once Intel client CPUs ship with AVX10/256, it shouldn't be very hard to port the AVX-512 code to it. AVX10.1 is basically the same as AVX-512, except for a couple minor details (which are enough to make them incompatible) and limiting operand width to 256-bit.

Nice but x86 assembly means that it only targets the dead x86 architecture.
Since these are dead isa optimizations, it’s like making an optimization for an obsolete car engine that was engineered in the 60s and that is phasing out anyway.
libavcodec (which sits at the core of ffmpeg) has architecture-specific optimizations for lots of different CPU ISAs. I think what's interesting to the devs is what CPUs people are actually using (plus, whatever optimizations someone is willing to pay for them to do).
 
Last edited:
  • Like
Reactions: P.Amini
Jun 5, 2024
11
22
15
I'm doubtful that hand-written assembly would've produced much better code than compiler instrinsics. Compiler intrinsics quite often have a 1-to-1 correspondence to SIMD instructions anyways. Where they don't, there are "reasons" (emulation, abstraction, or optimization).

As important for SIMD performance (besides the obvious of actually using SIMD instructions) is organizing your data structures and paths in a way that's suitable for SIMD. You're going to have to think about that whether you're using compiler intrinsics or hand-written assembly. And once you've done that work, compiler intrinsics and the optimizer can take it from there.

Intel also had AVX-512 in their 11th generation desktop CPUs, Rocket Lake, by the way. That's exactly why I'm still running the i7-11700K.
 
Last edited:

Kamen Rider Blade

Distinguished
Dec 2, 2013
1,419
944
20,060
And ffmpeg developers were looking down on AVX512 and pretending it was not there as long as it was an Intel only feature for the last 10 years. Now that it is an AMD only feature, now it is OK.
It was a Intel feature before it was a AMD feature, blame Intel's crap CPU architecture designers who had to make the stupid decision to disable it on the 12th / 13th / 14th gen while AMD was able to keep it going with Zen 4 & Zen 5.

If they didn't have the "Economy Cores" as part of the CPU cluster, then they would've had AVX-512 for the (12/13/14)th gens. That failure is on Intel, nobody else. They shot themselves in the foot.

AMD just managed to implement AVX-512 in a better fashion by taking their time to do it right.

Nice but x86 assembly means that it only targets the dead x86 architecture.
Since these are dead isa optimizations, it’s like making an optimization for an obsolete car engine that was engineered in the 60s and that is phasing out anyway.
x86 Architecture looks very alive & well with all the x86 Computers in use around the world and in Server-land.

For a Obsolete ISA, we sure do get new CPU's very often, nearly every 2 years.

As much as you think ARM & RISC-V is going to take over the world, the most realistic scenario in the long term is that RISC-V is going to eat ARM's lunch and put them in a world of hurt.

Then it'll become a RISC-V & x86 world where ARM is seen as "Old & Busted" vs the "New Hotness of RISC-V".

While the x86 ISA is aging like "Fine Wine", just like the famous 'UpTown Funk' song by Bruno Mars.
“This hit, that ice cold / Michelle Pfeiffer, that white gold.”

x86 is still going strong and will keep on going with new instruction extensions and features as time goes on.
 
Last edited:

bit_user

Titan
Ambassador
Also note that this is 94% faster than pure C code, but between -3% and 60% faster than the AVX2 code path. (See this slide.)
Thanks for the link. As noted by others, it's actually 94.78 times as fast as the C version, but your other statements are correct.

Yeah, it's worth taking a moment to study the slide and look at what it's actually telling us. There are 4 different methods benchmarked, with 4 different implementations of each. The performance ratio of each is compared with the generic C implementation, which I'll bet is written in the simplest possible way and completely devoid of typical optimizations.

I've reformatted the data here, with an additional two columns to show the speedup of AVX2 over SSSE3 and AVX-512 over AVX2:

Methodssse3 vs. Cavx2 vs Cavx512icl vs Cavx2 vs ssse3avx512icl vs avx2
0
2.49​
4.38​
4.24​
1.76​
0.97​
h
29.00​
55.70​
64.19​
1.92​
1.15​
hv
21.23​
27.46​
43.99​
1.29​
1.60​
v
40.29​
67.61​
94.78​
1.68​
1.40​

Note that the least speedup is in the "0" case, where AVX-512 is only 4.24x as fast as the generic C version.

To know what the different methods actually mean, you'd have to study the code. The h, v, and hv seem to imply some sort of separable processing, with hv perhaps incorporating both the h and v passes. I'd hazard a guess that 0 is probably some sort of identity case, especially given how its times compare with the other methods.

Again, just pure speculation, but I wonder if the V case sped up so much only by avoiding repeated packing & unpacking of pixel values, or something silly like that. For there to be such a big speedup, it's got to be either that or cache thrashing. It's common to see vectorized code delivering an order of magnitude speedup, but 2 orders of magnitude suggests the two versions aren't algorithmically identical.
 
Last edited:
  • Like
Reactions: P.Amini and jp7189
Jun 5, 2024
11
22
15
Note that C => SSSE3 is where the big performance gain occurs.

From there, SSSE3 => AVX512 gives only another ~2x improvement.

Their results show that SIMD is much faster than no-SIMD for their application.

But their performance margin between SSSE3 and AVX512? That's "meh" and basically guaranteed from the now 4x wider registers.
 

setx

Distinguished
Dec 10, 2014
263
233
19,060
Wrong. This hasn't been the case in at least ~20 years. There are very few places left where manual assembly code is still desireable.

In fact, it prevents the compiler from making most optimizations to anything using the assembly code, and it provides no latency information, essentially ruining scheduling pipelines.
It's still completely true when it comes to SIMD instructions, especially exotic ones. Convincing high-level language compiler to emit what you want is very tiring and not guaranteed to succeed at all.

In the majority of cases, code like this would likely be much more performant if it were written using intrinsics (or generic SIMD or a combination thereof) instead of assembly.
First of all, intrinsics are basically the same assembly, but very ugly. As for scheduling, not only it's less important on modern superscalars, compilers also make a lot of mistakes: it's not that rare for Clang to produce slower code for Zen4 with Zen4 tuning compared to Zen3 tuning (both AVX2-level).
 

bit_user

Titan
Ambassador
It's still completely true when it comes to SIMD instructions, especially exotic ones. Convincing high-level language compiler to emit what you want is very tiring and not guaranteed to succeed at all.
For exotic instructions, perhaps. For more generic ones, it's not as hard as you think. You can find various tips and tricks on how to achieve better autovectorization, if you do a little searching.

First of all, intrinsics are basically the same assembly, but very ugly.
I agree that they're ugly, but they usually just encode the instruction name and therefore are certainly no worse or less readable than assembly language. Importantly, they save you from having to do register allocation, which is a pain in the neck. Especially when loop unrolling and trying to manage the 32 vector registers available with AVX-512.

In the past, I've wrapped intrinsics with inline functions that were more intuitively named and overloaded on the operand type. Makes for much more readable code that's also easier to modify in certain ways, like changing the vector size or element type.

compilers also make a lot of mistakes: it's not that rare for Clang to produce slower code for Zen4 with Zen4 tuning compared to Zen3 tuning (both AVX2-level).
Source? If you find such a case, you should attach a code snippet in a bug report.
 
Last edited:
  • Like
Reactions: Scraph