News FFmpeg devs boast of up to 94x performance boost after implementing handwritten AVX-512 assembly code

Page 2 - Seeking answers? Join the Tom's Hardware community: where nearly two million members share solutions and discuss the latest tech.

Opcod

Distinguished
May 12, 2006
13
2
18,515
All those stuff, C, nodejs and other crap.. Why do you think the norm is just to push update at every comments.. giving a commit. The worst is quite android, then linux and now windows too. Having a proper code do run much faster as you are on direct acces to everything.
 

bit_user

Titan
Ambassador
While compiler heuristics can generate very different optimizations based on architecture, i've not seen different optimizations for specific processors in a very long time.
Here's the GCC commit adding -march=znver5 support (i.e. Zen 5 uarch-specific tuning):

Of particular interest are the cost table and scheduler:

GCC and Clang can give very different optimizations for a specific architecture though.

Simple example:
https://godbolt.org/z/q3cdMzzYs
They're completely different compilers. So, of course the generated code differs.
 

JamesJones44

Reputable
Jan 22, 2021
853
784
5,760
Here's the GCC commit adding -march=znver5 support (i.e. Zen 5 uarch-specific tuning):
I was thinking general output optimizations for a specific CPU (Motorola 68000 vs 68010 for example). I wasn't thinking cost based optimizations based on supported functions, but a very valid point.
 

jackt

Distinguished
Feb 19, 2011
211
25
18,710
Wrong. This hasn't been the case in at least ~20 years. There are very few places left where manual assembly code is still desireable.

In fact, it prevents the compiler from making most optimizations to anything using the assembly code, and it provides no latency information, essentially ruining scheduling pipelines.

In the majority of cases, code like this would likely be much more performant if it were written using intrinsics (or generic SIMD or a combination thereof) instead of assembly.
<Mod Edit>. A software in assembler will always be faster than c or any language. If the developer is good of course. @ET3D is right.
 
Last edited by a moderator:

setx

Distinguished
Dec 10, 2014
263
233
19,060
For more generic ones, it's not as hard as you think.
It's not that I think, it's what I see for my code. It's pretty much impossible to write single code without compiler's #ifdef that would reliable vectorize at least on Clang/MSVC/Intel.

Source? If you find such a case, you should attach a code snippet in a bug report.
My code for the project I have no rights to share.

A software in assembler will always be faster than c or any language. If the developer is good of corse.
That's not quite true. If you are hitting some hardware limit like memory bandwidth or divider pipelines it would be exactly the same speed even if one version uses slightly less other commands or less registers.
 

bit_user

Titan
Ambassador
It's not that I think, it's what I see for my code. It's pretty much impossible to write single code without compiler's #ifdef that would reliable vectorize at least on Clang/MSVC/Intel.
Like I said, you can find advice and guidance, if you search for it. There's a lot of ground to cover and obviously you should structure your code in the same way you would, if you vectorized it by hand.

Even then, I get that the compiler still might not do what you want, which is why I recommended C + intrinsics. I'm just saying that it's not always necessary to go that far. Sometimes, pure C can indeed vectorize quite well.

My code for the project I have no rights to share.
I'm sure they don't want your whole project. What they would prefer is for people to reproduce the symptom in an isolated and abstract code snippet that you can attach to the bug report.

Optimizing compilers are a little bit of a black art. With virtually every optimization, there will be a few regressions. The better their test suite, the better they can tune their compiler optimizations to minimize the number and extent of such regressions.

I think it won't be long before AI starts creeping into compilers, helping decide which optimizations to apply where, and in what order. That would probably motivate me to get a machine with more AI inferencing horsepower.
 
Nov 4, 2024
4
1
15
Wrong. This hasn't been the case in at least ~20 years. There are very few places left where manual assembly code is still desireable.

In fact, it prevents the compiler from making most optimizations to anything using the assembly code, and it provides no latency information, essentially ruining scheduling pipelines.

In the majority of cases, code like this would likely be much more performant if it were written using intrinsics (or generic SIMD or a combination thereof) instead of assembly.
I'd like to know more about the latency information you mentioned. Would you point me to reference material please?
 

bit_user

Titan
Ambassador
I'd like to know more about the latency information you mentioned. Would you point me to reference material please?
I already posted a link to GCC's cost tables:

Perhaps someone could dig up the equivalent in LLVM, or maybe I'll do it if I get bored.

There are also folks who try to reverse engineer some of this information through microbenchmarking. Agner Fog is probably one of the most widely known providers of this information. I've looked at some of his stuff and he seems to know what he's talking about, but I can't personally endorse it. See the documents under # 4:
 
Nov 4, 2024
4
1
15
I already posted a link to GCC's cost tables:

Perhaps someone could dig up the equivalent in LLVM, or maybe I'll do it if I get bored.

There are also folks who try to reverse engineer some of this information through microbenchmarking. Agner Fog is probably one of the most widely known providers of this information. I've looked at some of his stuff and he seems to know what he's talking about, but I can't personally endorse it. See the documents under # 4:
Ok, I think I understand your point now. The compiler can reorder instructions using the timing information. You're saying that assembler prevents it from doing that?
 

bit_user

Titan
Ambassador
The compiler can reorder instructions using the timing information.
I think it's more that the compiler uses the cost table to help it decide which instructions to issue, when it has a choice. Perhaps the cost table has uses beyond that, but that's going beyond my knowledge of the subject.

You're saying that assembler prevents it from doing that?
Contiguous spans of assembly language aren't reordered or altered by the compiler, at all. If we're talking about inline assembler, then I honestly don't know how much visibility these compilers have into it that might inform what other instructions the compiler generates, but certainly if it's processed using an external tool like nasm and simply linked in, then the compiler doesn't touch it, doesn't see it, and knows nothing about it.

BTW, I wasn't the one who said it. That was @AdelaideSimone, to whom you originally replied.
 

edzieba

Distinguished
Jul 13, 2016
578
583
19,760
Good compiler-produced assembly will be as fast or faster than hand-optimised assembly. Bad compiler-produced assembly (or compiler-produced assembly that does not know about some specific instruction set extension in the first place) can be slower than hand-optimised assembly.

The actual question is: Do you spend your time hand-optimising assembly for your particular function (for each architecture you intend to run on), or do you spend time fixing the compiler so it produces better assembly (for each architecture you intend to run on)?
 
  • Like
Reactions: bit_user
Mar 10, 2020
420
384
5,070
In performance critical situations it's still often used, even today. Games even use it for critical path situations.

Should it be the go to solution? Hell no, but when performance matters and an area is slow, dropping into ASM can yield impressive results.
Years ago a colleague had a data acquisition task to program for. He had a hard 100ms window to grab the voltage, convert it to digital, process and store it. Using C his tightly coded routine took 141ms. The guru programmer was asked for help. He approved the code.. it was like a high end modernist Swedish home…. minimalist, there were no extraneous steps. So the guru turned to the compiler output. He looked at the intermediate code… that’s redundant.. that’s redundant… don’t need to check that… etc. after 15 minutes he submitted his modified code to the compiler to finish off… when the resulting .exe was run 65ms was the result.

Ideally a compiler would optimise to the nth degree, this would arbitrarily disadvantage one of the processor manufacturers (due to differences in hardware implementation) unless 2 heavily optimised code paths were produced.. then there are the differences between models within generations and between generations.. but hand optimising or coding in assembly for a limited number of device types isn’t economically feasible unless the performance gain is essential.
 
  • Like
Reactions: JamesJones44
Nov 5, 2024
1
0
10
If we compare AVX512 to AVX2, the improvement is only between -3% and 60%. Even with a C implementation, a new compiler shouldn't perform this poorly.
The kernels in the diagram uses w64, which is a relatively large block, so having more AVX512 units should definitely provide a boost. However, for smaller blocks (which are also very common), the improvement is quite limited. When testing the implementation of this kernel for small blocks on Zen5 with gcc8 , the difference between AVX2 and AVX512 was minimal:


checkasm: all 12 tests passed
mc_8tap_regular_w8_0_8bpc_c: 54.9 ( 1.00x)
mc_8tap_regular_w8_0_8bpc_ssse3: 7.1 ( 7.72x)
mc_8tap_regular_w8_0_8bpc_avx2: 7.2 ( 7.66x)
mc_8tap_regular_w8_0_8bpc_avx512icl: 7.2 ( 7.67x)
mc_8tap_regular_w8_h_8bpc_c: 629.6 ( 1.00x)
mc_8tap_regular_w8_h_8bpc_ssse3: 25.9 (24.33x)
mc_8tap_regular_w8_h_8bpc_avx2: 19.2 (32.83x)
mc_8tap_regular_w8_h_8bpc_avx512icl: 17.8 (35.33x)
mc_8tap_regular_w8_hv_8bpc_c: 1378.4 ( 1.00x)
mc_8tap_regular_w8_hv_8bpc_ssse3: 83.8 (16.45x)
mc_8tap_regular_w8_hv_8bpc_avx2: 62.2 (22.17x)
mc_8tap_regular_w8_hv_8bpc_avx512icl: 35.5 (38.86x)
mc_8tap_regular_w8_v_8bpc_c: 743.1 ( 1.00x)
mc_8tap_regular_w8_v_8bpc_ssse3: 22.3 (33.36x)
mc_8tap_regular_w8_v_8bpc_avx2: 17.0 (43.74x)
mc_8tap_regular_w8_v_8bpc_avx512icl: 17.0 (43.59x)
 
Good compiler-produced assembly will be as fast or faster than hand-optimised assembly. Bad compiler-produced assembly (or compiler-produced assembly that does not know about some specific instruction set extension in the first place) can be slower than hand-optimised assembly.

The actual question is: Do you spend your time hand-optimising assembly for your particular function (for each architecture you intend to run on), or do you spend time fixing the compiler so it produces better assembly (for each architecture you intend to run on)?
Neither.

Barring *very* specific use-cases where hand dropping in some AVX instructions does in fact make sense, we are well past the days where you are going to manually plop in any SSE opcodes, let alone do blocks in assembly (especially since doing that will actively harm performance). Nevermind you aren't going to see any real improvement in overall performance by optimizing a few instructions (again, barring the specific AVX use-cases), it's program architecture that is going to drive performance.

As for compilers, no one outside of the people who actively work on them is going to touch them. Even recompiling for new CPU architectures is rare (especially given that for most programs the generic SSE3 codepath should be more then good enough), although sometimes necessary if a given CPU architecture is different enough where it trips up the compiler to the point where it affects performance. [And outright compiler *bugs* are rare; I've seen two in my lifetime, both in Diab].
 
Years ago a colleague had a data acquisition task to program for. He had a hard 100ms window to grab the voltage, convert it to digital, process and store it. Using C his tightly coded routine took 141ms. The guru programmer was asked for help. He approved the code.. it was like a high end modernist Swedish home…. minimalist, there were no extraneous steps. So the guru turned to the compiler output. He looked at the intermediate code… that’s redundant.. that’s redundant… don’t need to check that… etc. after 15 minutes he submitted his modified code to the compiler to finish off… when the resulting .exe was run 65ms was the result.
...HOW? A 100ms window is an eternity; grabbing a voltage reading should only take a handful of ms (and most of that within whatever interface library to the HW is being used); I literally do dozens of sensors in that same timespan on a day to day basis.

It sounds to me the code is *so* optimized it's tripping up the optimizer. Even 65ms sounds absurdly long to me.
 
Mar 10, 2020
420
384
5,070
...HOW? A 100ms window is an eternity; grabbing a voltage reading should only take a handful of ms (and most of that within whatever interface library to the HW is being used); I literally do dozens of sensors in that same timespan on a day to day basis.

It sounds to me the code is *so* optimized it's tripping up the optimizer. Even 65ms sounds absurdly long to me.
Don’t remember the exact details, it was over 30 years ago…. Just the overall difficulty and how the solution was done. 486 was the king at the time, the embedded processor we were using was a lot less powerful. How, use your imagination. Today it would barely take a handful of milliseconds.

The point was that hand optimisation has its place.
 
Don’t remember the exact details, it was over 30 years ago…. Just the overall difficulty and how the solution was done. 486 was the king at the time, the embedded processor we were using was a lot less powerful. How, use your imagination. Today it would barely take a handful of milliseconds.

The point was that hand optimisation has its place.
Oh, ok that at least makes sense.

From what I've found though: Instruction level optimization more often then not ends up reducing performance by screwing with the optimizer. It has it's place, but in recent years pretty much all my performance gains have been through architectural improvements.
 

ottonis

Reputable
Jun 10, 2020
220
190
4,760
Apart from the "manual assembly vs compiler generated code"-debate, I think that everybody can agree on the fact that the voluntary work of the FFMPEG developers deserves a good load of appreciation and that it doesn't hurt when programmers are familiar with assembly code and capableof using these skills in order to check the quality of compiler generated code and to directly or indirectly optimze the compiler output.
So, kudos to these guys, not only for optimizing the speed of their codecs but also on utilizing modern hardware such as AVX512.
 

bit_user

Titan
Ambassador
Years ago a colleague had a data acquisition task to program for. He had a hard 100ms window to grab the voltage, convert it to digital, process and store it. Using C his tightly coded routine took 141ms. The guru programmer was asked for help. He approved the code.. it was like a high end modernist Swedish home…. minimalist, there were no extraneous steps. So the guru turned to the compiler output. He looked at the intermediate code… that’s redundant.. that’s redundant… don’t need to check that… etc. after 15 minutes he submitted his modified code to the compiler to finish off… when the resulting .exe was run 65ms was the result.
How many years ago? Which compiler? I'm sure you can no longer answer this, but I also wonder what optimization flags did they try and whether they tried PGO or compiler hints (i.e. expect, unreachable, assume_aligned, prefetch, etc.).

There are some options like -fomit-frame-pointer and -fPIC that can also affect, as well as hardening options which do things like array bounds-checks.

However, based on your description I'm guessing the real culprit is pointer aliasing. Here's where it pays to know your language and your tools, because you can simply assert that two pointers or variabes don't alias and the compiler will then treat them as if they don't. The main reason fortran is traditionally faster is that it has no possibility of aliasing.

It's a shame that people often just ditch the compiler and drop straight to assembly language, in cases like this, because if they just put the time into figuring out why the compiler is generating this "unnecessary" code, they might actually learn something about their tools and how to use them better. Instead, hubris tends to rear its ugly head and they just assume it's because the compiler is dumb and they're ever so smart.

Ideally a compiler would optimise to the nth degree,
It's a tool, not magic. Don't think of them in idealized terms, any more than you would regard another complex machine, like a car engine or a printing press. One thing I always try to drill into junior programmers is: learn your tools! This is common sense, in any other trade, yet somehow I've met more than one working programmer, near the end of their career who almost seem proud of how little they know about them.

this would arbitrarily disadvantage one of the processor manufacturers (due to differences in hardware implementation) unless 2 heavily optimised code paths were produced.. then there are the differences between models within generations and between generations.. but hand optimising or coding in assembly for a limited number of device types isn’t economically feasible unless the performance gain is essential.
You can usually specify both which machine you want the generated code to be compatible with and which it should be tuned for, via separate -mtune and -march options. There are blended targets, as well. I believe GCC has long supported function multi-versioning, though it's not an area I've delved into. This page describes the basics, but it's quite old and I'd imagine support for it has probably evolved since:

Update:
Don’t remember the exact details, it was over 30 years ago…. Just the overall difficulty and how the solution was done. 486 was the king at the time,
Oh. Well, some compilers were indeed a lot more primitive, back then. It's risky to apply lessons from such ancient history to modern tools and software ...at least, not without more specifics.
 
Last edited:

Tech0000

Reputable
Jan 30, 2021
26
26
4,560
Arguing pros/cons and opportunities with using AVX512 assembly vs. C/C++:
1. AVX512 is a very large and rich instruction set - it is it's own language. It allows for so much more that just compute and I've seen few compilers taking advantage of the full richness (although it's much better now that when it was launched in 2016/2017)
2. Yes modern compilers are now usually very good at generating performant machine code 1:1. It will be very hard and time consuming to improve on optimized code compiler generated code. But a good AVX512 assembly programmer can usually do some improvements in the critical inner loops where necessary.
3. Yes AVX512 intrinsic will get you even closer to the optimal solution since it gives you the ability to hand code much more productively also allowing the compiler to optimize the integrated code code (locally and globally) - the compiler can't globally optimize the integration if you write in pure asm (just links it in at link time).

However:
1. Intrinsic does not cover all AVX512 instructions unfortunately. For instance this line, using the memory port broadcast {bcst} decoration, you cannot code in AVX512 intrinsic (using the MASM syntax) but have to hand code it in assembly yourself:

vfmadd231ps zmm28, zmm0, real4 bcst [rsi+64*5]

As it turns out the memory port broadcast feature is extremely useful and in reality is used over and over again, forcing you to go to go asm in certain code areas. but again, asm is very unproductive so intrinsic is preferred where ever possible...

Here is the much bigger point that I feel a lot of the discussion is missing.

2. Vectorization using larger vectors size like AVX512, allows you to vectorize your algorithm and that is where the much bigger performance gain with AVX512 is.
This is critically important to understand, a compiler cannot change a sequential written algorithm in c/c++ into a vectorized algorithm intelligently.

For the interested programmer: To illustrate that this applies broadly not just to compute heavy algorithms lets look at search as an example:

Try to vectorize the search algorithm to find the maximum value in an array (of any length) of float32s and return the index (where the max value was found) together with the max value itself. A sequential solution will be a magnitude slower and the compiler has no chance of changing the sequential algorithm with effective use of vectors.
In AVX512 you would use the fact that you can do 16 Float32 compares, blends etc. per instruction (i.e. using vmaxps, vcmpps, and vblendmps), change the algorithm and use AVX512 intrinsic to take advantage of that (and with two AVX512 FMA units like Intel Xeon has, you half the instruction throughput to 0.5). Before looking up the solution, try to think how you would do it your self first (Intel's site: MAXLOC).
Now, try to apply this to a common sorting algorithm, e.g. quick sort. Can this be vectorized similarly effectively?

3. There are academic attempts at creating libraries to take explicit advantage of the vectorization that AVX512 (and AVX2 etc) allows, e.g. Agner Fog's C++ vector class library (look it up: C++ vector class library) . As a computer scientist, I, at least, found it interesting to start thinking/applying how to vectorize day to day C/C++ code/algorithms in general to take better advantage of the AVX512 silicon since compilers do not really do that. (at least I have not found a compiler that does that for me).
 

bit_user

Titan
Ambassador
I think that everybody can agree on the fact that the voluntary work of the FFMPEG developers deserves a good load of appreciation
Sometimes, contributions come from full-time employees of heavy users, such as Netflix. Other times, they will contract the core developers of these projects, in order to have desired changes, improvements, or bug fixes made.

kudos to these guys, not only for optimizing the speed of their codecs but also on utilizing modern hardware such as AVX512.
I fully agree that it's nice to have tools & libraries add support for newer CPU features. Optimized runtime libraries are one way programs can gain benefits by running on newer CPUs, even if they don't incorporate any AVX-512 code directly (or use compiler options to generate it).

At the end of the day, I don't really care whether someone hand-coded some assembly language maybe they didn't really need to, as long as it works and I can enjoy the fruits of their labor.
 
Mar 10, 2020
420
384
5,070
How many years ago? Which compiler? I'm sure you can no longer answer this, but I also wonder what optimization flags did they try and whether they tried PGO or compiler hints (i.e. expect, unreachable, assume_aligned, prefetch, etc.).

There are some options like -fomit-frame-pointer and -fPIC that can also affect, as well as hardening options which do things like array bounds-checks.

However, based on your description I'm guessing the real culprit is pointer aliasing. Here's where it pays to know your language and your tools, because you can simply assert that two pointers or variabes don't alias and the compiler will then treat them as if they don't. The main reason fortran is traditionally faster is that it has no possibility of aliasing.

It's a shame that people often just ditch the compiler and drop straight to assembly language, in cases like this, because if they just put the time into figuring out why the compiler is generating this "unnecessary" code, they might actually learn something about their tools and how to use them better. Instead, hubris tends to rear its ugly head and they just assume it's because the compiler is dumb and they're ever so smart.


It's a tool, not magic. Don't think of them in idealized terms, any more than you would regard another complex machine, like a car engine or a printing press. One thing I always try to drill into junior programmers is: learn your tools! This is common sense, in any other trade, yet somehow I've met more than one working programmer, near the end of their career who almost seem proud of how little they know about them.


You can usually specify both which machine you want the generated code to be compatible with and which it should be tuned for, via separate -mtune and -march options. There are blended targets, as well. I believe GCC has long supported function multi-versioning, though it's not an area I've delved into. This page describes the basics, but it's quite old and I'd imagine support for it has probably evolved since:

Update:

Oh. Well, some compilers were indeed a lot more primitive, back then. It's risky to apply lessons from such ancient history to modern tools and software ...at least, not without more specifics.
Your detailed dissection of what I was suggesting is redundant.

Yes they are tools, they can’t perform magic. I suggested that if you read beyond the ideal “the nth degree”.

Specifics… it’s over 30 years ago… come on, use your brain. Also, it wasn’t my project. I was an interested observer. The numbers stuck because of the relative magnitude of the improvement. WRT juniors, I’m retiring in a couple of years. And yes, I get people to use tools and recognise limitations.

You can’t compare the feature set of the current tools to the tool set available back in 1992 unless it’s to see how primitive they were. They got the job done most of the time.


Try to be less critical.
 
Last edited:

bit_user

Titan
Ambassador
However:
1. Intrinsic does not cover all AVX512 instructions unfortunately. For instance this line, using the memory port broadcast {bcst} decoration, you cannot code in AVX512 intrinsic (using the MASM syntax) but have to hand code it in assembly yourself:

vfmadd231ps zmm28, zmm0, real4 bcst [rsi+64*5]

As it turns out the memory port broadcast feature is extremely useful and in reality is used over and over again, forcing you to go to go asm in certain code areas. but again, asm is very unproductive so intrinsic is preferred where ever possible...
Are you certain there's no idiom by which a pair of intrinsics can't be used to produce the same instruction? Compilers tend to be pretty good at using idioms as a way of accessing features of the underlying hardware.

2. Vectorization using larger vectors size like AVX512, allows you to vectorize your algorithm and that is where the much bigger performance gain with AVX512 is.
This is critically important to understand, a compiler cannot change a sequential written algorithm in c/c++ into a vectorized algorithm intelligently.
A lot of vectorization tends to be loop-level. I mentioned before that people still need to structure their code in a vectorizable fashion, even if they're not vectorizing it by hand. Yes, this means you need to have a basic handle on the ISA and think about how to map your code onto it, but you'd have to do that anyway and it's still much less work than doing it all manually.

Now, try to apply this to a common sorting algorithm, e.g. quick sort. Can this be vectorized similarly effectively?
To my earlier point about optimized libraries, there are ready-made options available for this:


A popular example of something higher level is this SIMD-optimized JSON parser:

3. There are academic attempts at creating libraries to take explicit advantage of the vectorization that AVX512 (and AVX2 etc) allows,
e.g. Agner Fog's C++ vector class library (look it up: C++ vector class library) .
I would point out that the compiler intrinsics use a set of types that do support type-safety. So, you can make your own typedefs and overloaded wrapper functions and operators, without going as far as he did and actually defining wrapper classes. Better yet, when you just use light weight type aliases, you can pass your vectors directly into any intrinsics you want to use.

I think there are some better examples, FWIW. It's been a while since I've looked. Eigen is one popular option:

At one point, there was a bid to add a SIMD library to Boost. I gather it wasn't accepted, but I haven't followed what the author has done with it since. The library review would probably be an interesting discussion thread to read, if you're interested in the various options and best methods of trying to model SIMD programming in C++.
 

Tech0000

Reputable
Jan 30, 2021
26
26
4,560
Are you certain there's no idiom by which a pair of intrinsics can't be used to produce the same instruction? Compilers tend to be pretty good at using idioms as a way of accessing features of the underlying hardware.
Unless something changed recently, I'm certain. It's a pretty commonly known example. Memory port embedded broadcast cannot be expressed with intrinsic. and as it said it is very useful. Is there a way to trick the compiler to do it anyway - well I've not found one.

You would typically see these type of instructions repeated a few times in a unrolled loop scenario
vfmadd231ps zmm16, zmm0, real4 bcst [rsi]
vfmadd231ps zmm17, zmm0, real4 bcst [rsi+64]
vfmadd231ps zmm18, zmm0, real4 bcst [rsi+2*64]
vfmadd231ps zmm19, zmm0, real4 bcst [rsi+3*64]

A lot of vectorization tends to be loop-level. I mentioned before that people still need to structure their code in a vectorizable fashion, even if they're not vectorizing it by hand. Yes, this means you need to have a basic handle on the ISA and think about how to map your code onto it, but you'd have to do that anyway and it's still much less work than doing it all manually.
Correct a lot is in the inner loops, BUT you do need to think broader than that. For instance how your organize the data structures - if you naively design your data structures for sequential algorithms your will get horrible memory throughput - good use of SIMD requires that your organize the data for SIMD use to maximize the SIMD instruction throughput. And I agree with you, you do need to understand the AVX512 ISA and code c++/c accordingly to help the compiler to at least have a chance to compile using the rich AVX512 effectively. But still you may need to (in my experience) help the compiler with explicit intrinsic here and there when it still doesn't get it right. It's an iterative optimization process.

To my earlier point about optimized libraries, there are ready-made options available for this:

A popular example of something higher level is this SIMD-optimized JSON parser:


I would point out that the compiler intrinsics use a set of types that do support type-safety. So, you can make your own typedefs and overloaded wrapper functions and operators, without going as far as he did and actually defining wrapper classes. Better yet, when you just use light weight type aliases, you can pass your vectors directly into any intrinsics you want to use.

I think there are some better examples, FWIW. It's been a while since I've looked. Eigen is one popular option:

At one point, there was a bid to add a SIMD library to Boost. I gather it wasn't accepted, but I haven't followed what the author has done with it since. The library review would probably be an interesting discussion thread to read, if you're interested in the various options and best methods of trying to model SIMD programming in C++.
Yes there are many SIMD library examples and the once you mention are great too. My point was larger than making good use of effective vector AVX512 libraries. I wanted to point out that vectorizing your code requires deeper thinking both in data structures and algorithms and I used the maxloc as an example (since it is illustrated quite well by intel publicly). As you mentioned, Intel has of course an AVX512 quick sort version as well (which performs quite well). I was hoping that with the maxloc example as guidance, people would be stimulated to try to solve a AVX512 quick sort themselves because it could start a process of thinking: "what else and how could I vectorize in my day to day work"...

the issue is that most CS students to do not really learn vectorized algorithms/data structures in their standard algorithms and data structure or advanced programming classes and are therefore taught to think sequentially and maybe multi core/processor with threading - i feel that needs to change...

CPUs are not getting much faster sequentially since it scales with GHZ - limited by physics. The way forward to get magnitudes faster performance is through effective use of MIMD and SIMD with multi core/CPU and vectorization. CPU vectorization is under utilized. and unfortunately AVX512 got a bad reputation because of the initial implementation back in 2017 and Linus T's comment. Implementation are much better today (does not down clock) and I am all in on AVX512 and vectorization.
 
  • Like
Reactions: bit_user
Jun 5, 2024
11
22
15
1. Intrinsic does not cover all AVX512 instructions unfortunately. For instance this line, using the memory port broadcast {bcst} decoration, you cannot code in AVX512 intrinsic (using the MASM syntax) but have to hand code it in assembly yourself:

vfmadd231ps zmm28, zmm0, real4 bcst [rsi+64*5]
To be clear, intrinsics certainly do cover the vfmadd231ps instruction. The "_mm512_fmadd_ps" covers it along with the 132 and 213 variants. The difference in the 3 isn't relevant at the C abstraction level, so there would be no sense in distinguishing between them there. Likewise, whether the third operand is a register or broadcast from memory is also not relevant at this abstraction level.

One can broadcast from memory using the appropriate intrinsic, and then perform FMA with its intrinsic. It would then be a trivial optimization for the compiler to ultimately omit the explicit 'broadcast' and use this 'bcst' decoration instead.

And if clang/gcc don't currently implement this optimization (new w/ AVX512), do the world a favor and submit a feature request/bug report against it.
 
Last edited:
  • Like
Reactions: bit_user