News Intel fires back at AMD's AI benchmarks, shares results claiming current-gen Xeon chips are faster at AI than next-gen EPYC Turin

CmdrShepard · Jun 16, 2024

bit_user said:
Yes, like I said: it's really a convolution engine. That's a degenerate case.

It's funny to me that he thinks multiplying by lots of zeros is better than the conventional approaches of de-interleaving. Planar image formats are better for most purposes, anyways. If he started with a planar format, he could get an easy ~4x speedup! (And if you don't have a planar image, then you can write a simple AVX-based deinterleaver that's almost as fast as a memcpy().)

That's not the point, he is averaging the whole image color and whether it is planar or packed doesn't really matter -- you still have to do the math, just that with tile engine you can do more at once than even with AVX 512.

And yes, multiplying by zero (and very likely with one too) is usually special-cased code path even in hardware so it's probably faster than normal shuffling.

bit_user · Jun 16, 2024

CmdrShepard said:
That's not the point, he is averaging the whole image color and whether it is planar or packed doesn't really matter

LOL, it absolutely does matter! Multiplies by zero are wasted products. On a planar image, you can compute average color with no multiplication by zeros, resulting in a potential 400% speedup. If you can't see that, I don't know what else to say.

CmdrShepard said:
you still have to do the math, just that with tile engine you can do more at once than even with AVX 512.

My point wasn't whether it's better or worse than AVX-512, but that he wasn't doing anything surprising or even being very clever about efficiently using his hardware.

CmdrShepard said:
And yes, multiplying by zero (and very likely with one too) is usually special-cased code path even in hardware so it's probably faster than normal shuffling.

At best, they could've special-cased it to use less energy, but it's still wasting a slot where you could otherwise be doing a multiply.

CmdrShepard · Jun 16, 2024

bit_user said:
On a planar image, you can compute average color with no multiplication by zeros, resulting in a potential 400% speedup. If you can't see that, I don't know what else to say.

Questions:

1. Why do you think you need to convert to planar to average R, G, B, and A separately?
2. Why only 4x speedup?
3. Why do you assume AVX-512 would be faster than TMUL in this case?

bit_user · Jun 16, 2024

CmdrShepard said:
1. Why do you think you need to convert to planar to average R, G, B, and A separately?

If you convolve them separately, then you don't need to waste 75% of your multiply slots on zeros.

CmdrShepard said:
2. Why only 4x speedup?

Because you're going from 25% of your multiply slots being nonzero to 100%.

CmdrShepard said:
3. Why do you assume AVX-512 would be faster than TMUL in this case?

I didn't say that. I just said you're wasting AMX throughput. I started out saying that processing a planar image would be 4x as fast. If you look at an image format like JPEG, it's not intrinsically interleaved. The interleaving must be explicitly performed by the decoder. So, you could just output planar, instead.

However, you could write an AVX-512 loop to deinterleave them and even inline it with your AMX code, so that you're concurrently using the AVX-512 pipeline for that, while the AMX engine is doing actual multiplies.

CmdrShepard · Jun 17, 2024

bit_user said:
If you convolve them separately, then you don't need to waste 75% of your multiply slots on zeros.

In that case yes, but I was under impression you were talking about AVX-512 code here.

bit_user said:
I didn't say that. I just said you're wasting AMX throughput. I started out saying that processing a planar image would be 4x as fast. If you look at an image format like JPEG, it's not intrinsically interleaved. The interleaving must be explicitly performed by the decoder. So, you could just output planar, instead.

Planar output from most JPEG decoders is usually YUV, not RGB so you'd need to convert that too and depending on subsampling maybe upscale U/V planes. It's not really practical or trivial to get planar output from most image format decoders.

bit_user said:
However, you could write an AVX-512 loop to deinterleave them and even inline it with your AMX code, so that you're concurrently using the AVX-512 pipeline for that, while the AMX engine is doing actual multiplies.

Or you could just write AVX-512 code that does everything without deinterleaving but the point is that involving AVX-512 would change power efficiency (AVX-512 is the single most power hungry thing in an Intel CPUs) and also AVX-512 has a different multiplier offset than TMUL so you'd get different clocks too.

bit_user · Jun 17, 2024

CmdrShepard said:
Planar output from most JPEG decoders is usually YUV, not RGB

libjpeg is incredibly flexible. You can hook in your own colorspace transform that outputs how you want. Also, I was just using JPEG as an example.

CmdrShepard said:
It's not really practical or trivial to get planar output from most image format decoders.

I'm not sure about that. libjpeg is the only one I have direct experience with, but I've used OpenCV a fair amount and it has broad support for different image layouts.

CmdrShepard said:
Or you could just write AVX-512 code that does everything without deinterleaving but the point is that involving AVX-512 would change power efficiency

You don't think using 1/4th as many TMULs is more efficient? Even if the CPU has special handling of 0's, that still involves a ton of data movement to funnel 2kB of data from the AMX register file to the TMUL unit, every time!

CmdrShepard said:
(AVX-512 is the single most power hungry thing in an Intel CPUs)

That's for things like FMACs, not a simple shuffle operation or scatter/gather.

CmdrShepard · Jun 17, 2024

bit_user said:
libjpeg is incredibly flexible. You can hook in your own colorspace transform that outputs how you want. Also, I was just using JPEG as an example.

libjpeg since being taken over by that Guido Vollbeding person has become total trash.

It never had good performance to begin with, and the codebase doesn't lend itself for SIMD optimizations at all -- that's what libjpeg-turbo is for and what I recommend using (unless you can use NVJPEG and skip the CPU completely).

bit_user said:
You don't think using 1/4th as many TMULs is more efficient? Even if the CPU has special handling of 0's, that still involves a ton of data movement to funnel 2kB of data from the AMX register file to the TMUL unit, every time!

As I said before multiplying by 0 and 1 are special cases (one produces zero, the other produces second operand) so they are usually short-circuited in hardware. I don't know whether they actually "use" multiplication slot for that though, depends on implementation.

bit_user said:
That's for things like FMACs, not a simple shuffle operation or scatter/gather.

Well the reason AVX-512 is power-hungry is not just multiplication, it's 512 data lines for 512 bits of data from up to 3 sources and 1 destination throughout all 3 cache levels and register file which you need to keep powered up. When not executing AVX-512 code you can power down upper 256 bits of the bus.

bit_user · Jun 18, 2024

CmdrShepard said:
libjpeg since being taken over by that Guido Vollbeding person has become total trash.

I was using that term generically. The one I first worked with was the classic v6. Lately, what I've used (and most distros have adopted) is libjpeg-turbo.

CmdrShepard said:
As I said before multiplying by 0 and 1 are special cases

Seems plausible, but how do you actually know?

CmdrShepard said:
I don't know whether they actually "use" multiplication slot for that though, depends on implementation.

Never mind that, it uses register space where you could have a nonzero coefficient.

CmdrShepard said:
Well the reason AVX-512 is power-hungry is not just multiplication, it's 512 data lines

And you don't think the 8192 data lines from AMX registers count for anything?

Also, most instructions only use 16 bits from the opmask register - not 512 bits.

CmdrShepard said:
throughout all 3 cache levels

First, that's not how CPUs work - you don't have each functional unit separately interacting with multiple levels of the cache hierarchy. Second, the AMX registers quite likely shares the same ports to L1D as the ZMM registers.

CmdrShepard said:
and register file which you need to keep powered up. When not executing AVX-512 code you can power down upper 256 bits of the bus.

I think you're confused. You're citing the case for using vzeroupper, but that's only relevant if you're executing a mix of narrower SIMD instructions.

Anyway, I don't care if you believe me or not. I really don't begrudge you your CPU, nor do I have any stake in what you do with it.

CmdrShepard · Jun 18, 2024

bit_user said:
Seems plausible, but how do you actually know?

I don't, I am just hoping that the tech has reached that level they can afford couple of value checks to avoid extra work.

bit_user said:
Never mind that, it uses register space where you could have a nonzero coefficient.

That's true, it reduces possible number of operations per unit of time. It just remains to run a test and see how much slower or faster it is than AVX-512 code.

bit_user said:
And you don't think the 8192 data lines from AMX registers count for anything?

There aren't 8192 data lines -- width is 64 bytes which is still 512 bits. You can't have more data lines than your maximum bus width.

bit_user said:
Also, most instructions only use 16 bits from the opmask register - not 512 bits.

If we are talking about zmm registers they are 512 bits wide so instructions utilizing them can't use less.

bit_user said:
First, that's not how CPUs work - you don't have each functional unit separately interacting with multiple levels of the cache hierarchy. Second, the AMX registers quite likely shares the same ports to L1D as the ZMM registers.

That's not what I meant, I was trying to say that you will have everything active when using AVX-512 -- that's the most powered on CPU and pulls the most power.

bit_user said:
I think you're confused. You're citing the case for using vzeroupper, but that's only relevant if you're executing a mix of narrower SIMD instructions.

First CPUs which supported AVX-512 still had 256-bit data bus and worked with two halves of zmm register at a time. Later CPUs actually expanded the data bus to 512 bits. However, if I am not mistaken they could still turn off the upper 256 bits of the data bus to save power when code is not executing which requires it. Take this with a grain of salt, because I am unable to find the document where I read it right now.

bit_user · Jun 19, 2024

CmdrShepard said:
There aren't 8192 data lines -- width is 64 bytes which is still 512 bits. You can't have more data lines than your maximum bus width.

Between the AMX register file and the TMUL unit, there must be 2x 8192 bit data paths (not counting the return path), in order for it to sustain the computation rate you quoted.

Not sure what "bus width" has to do with it, or where you got the idea it was 64 bytes.

CmdrShepard said:
If we are talking about zmm registers they are 512 bits wide so instructions utilizing them can't use less.

You counted the opmask register as one of the operands. Those aren't 512 bits. They're 16-64 bits, but most instructions just use 16 bits.

CmdrShepard said:
That's not what I meant, I was trying to say that you will have everything active when using AVX-512 -- that's the most powered on CPU and pulls the most power.

AVX-512 was the most power-intensive, since it could do 16 fp64 FMAs and 8 fp64 adds per cycle. Each fp64 multiplier should be about 44 times as big as a BFloat16 multiplier. So, even though the AMX unit has 512 BFloat16 multipliers, that should only use as much die area (excluding muxes and other overhead) as 11.6 fp64 multipliers. That's why AVX-512 is such a fire-breathing dragon and running fp64 workloads is how to push it to its limits.

CmdrShepard said:
First CPUs which supported AVX-512 still had 256-bit data bus

That's not true. What the Skylake (server) cores actually split was the register file.

Golden Cove’s Lopsided Vector Register File

Ever since Intel first brought AVX-512 to their big core lineup, they made all of the vector registers 512-bit.

chipsandcheese.com

Technically, Xeon Phi was the first CPU to implement AVX-512, though I haven't read much about those cores.

CmdrShepard said:
Later CPUs actually expanded the data bus to 512 bits.

Not sure where you get this idea of a "data bus". CPUs have lots of different data paths of varying sizes.

CmdrShepard said:
However, if I am not mistaken they could still turn off the upper 256 bits of the data bus to save power when code is not executing which requires it. Take this with a grain of salt, because I am unable to find the document where I read it right now.

Again, you're almost certainly thinking of this issue:

https://www.realworldtech.com/forum/?threadid=179700&curpostid=179700

...but that only maters when you're executing 128-bit or 256-bit vector instructions. Not when you're just doing scalar logic/arithmetic (or probably x87 FP, for that matter).

CmdrShepard · Jun 19, 2024

bit_user said:
Not sure what "bus width" has to do with it, or where you got the idea it was 64 bytes.

The width of the tile register is 64 bytes.

bit_user said:
You counted the opmask register as one of the operands. Those aren't 512 bits. They're 16-64 bits, but most instructions just use 16 bits.

No I was talking about loads and stores from/to zmm register / memory. That takes 512 bit bus, not the ompask register.

bit_user said:
That's not true. What the Skylake (server) cores actually split was the register file.

I was referring to this, it's been a while since I wrote assembly code:

The Skylake-X and Cannon Lake support the AVX512 instruction set. There are two 256-bit vector execution units at port 0 and 1, respectively. These two units can be combined into one 512-bit unit when 512-bit vector instructions are executed. This combined 512-bit unit is accessed through port 0, while port 1 can be used for other purposes simultaneously.

Also this:

All vector units are divided into two or four lanes of 128 bits each

And finally this:

Warm-up period for YMM and ZMM vector instructions
The processor turns off the upper parts of the vector execution units when it is not used, in order to save power. Instructions with 256-bit vectors have a throughput that is approximately 4.5 times slower than normal during an initial warm-up period of approximately 56,000 clock cycles or 14 μs. A sequence of code containing 256-bit vector operations will run at full speed after this warm-up period. The processor returns to the mode of slow 256-bit execution 2.7 million clock cycles, or 675 μs, after the last 256-bit instruction (These times were measured on a 4 GHz processor). Similar times apply to 512-bit vectors.

Perhaps I didn't phrase that very well, but as I said it's been a while.

bit_user said:
Not sure where you get this idea of a "data bus". CPUs have lots of different data paths of varying sizes.

I am referring to memory read and write ports. Those can be 256 or 512 bit wide. With 256 bit ports you need two read ports to be able to execute 512 bit read from memory. Utilizing two ports for a single read will obviously use more power.

rtoaht · Jun 22, 2024

That’s really low from AMD for disabling AMX features in Intel CPU to make the misleading claims. Not surprising though given their recent history:

View: https://youtu.be/MkKQ4VCDL7M?si=uuc4NYAou2W33D2l

bit_user · Jun 22, 2024

CmdrShepard said:
The width of the tile register is 64 bytes.

A data path only that wide wouldn't be able to support the sustained compute figures you quoted. So, even though they're constrained to that byte width, it still doesn't mean one row is the maximum that gets sent to the TMUL unit per cycle.

CmdrShepard said:
No I was talking about loads and stores from/to zmm register / memory. That takes 512 bit bus, not the ompask register.

The context was that you were reaching for some explanation of why AVX-512 can be so energy-intensive, somehow arriving at 4x 512-bits worth of data movement. That's not loads and stores.

CmdrShepard said:
Well the reason AVX-512 is power-hungry is not just multiplication, it's 512 data lines for 512 bits of data from up to 3 sources and 1 destination ...

CmdrShepard said:
I was referring to this, it's been a while since I wrote assembly code:

Not the same thing as a "256-bit data bus", but whatever. We don't need to go off on another pointless tangent.

Also, please cite your sources, when you quote something from somewhere. Otherwise, I can't tell if you're quoting that from some random blog post or official Intel documentation (though I doubt it).

Furthermore, just writing assembly language wouldn't require you to know anything about these implementation-level details.

CmdrShepard said:
Also this:

Again, both irrelevant and questionable.

CmdrShepard said:
And finally this:

The funny thing about that is that it's defined first, (implicitly) in contrast to 128-bit vector operations. Secondly, it's talking about pipeline logic, not data busses.

CmdrShepard said:
I am referring to memory read and write ports. Those can be 256 or 512 bit wide. With 256 bit ports you need two read ports to be able to execute 512 bit read from memory. Utilizing two ports for a single read will obviously use more power.

According to Wikichip:

Store is now 64B/cycle (from 32B/cycle)
Load is now 2x64B/cycle (from 2x32B/cycle)

So, no. Skylake-X could do 2x loads and 1x store at full 512-bit width.

This is really getting ridiculous. The only reason we're even down in the weeds of Skylake-X's memory subsystem is that you're just grasping at straws for why AVX-512 is so energy-intensive. You don't actually know. Please don't waste my time and yours just throwing stuff at the wall.

CmdrShepard · Jun 22, 2024

And how is your response disproving anything I said about AVX-512 and power without some counter-proof from a reputable source?

If you can't recognize quotes from Agner Fog's optimization manuals which I posted that's your problem, not mine. I'll trust him anything he says on the subject of internal CPU architecture because he knows his stuff much better than me (and apparently you).

I also had Skylake-X (I7-9800X was my previous CPU), and I have both AVX-512 and TMUL and I ran tests with both so I do have proof how much power each of them pulls. You can choose not to trust me on that, but you can't claim I am wrong.

bit_user · Jun 22, 2024

CmdrShepard said:
And how is your response disproving anything I said about AVX-512 and power without some counter-proof from a reputable source?

The whole thing is a non-sequitur. It's just a tangent you went onto, after I pointed out how dumb it was to operate on interleaved images like that.

I never said AVX-512 isn't a power hog. I don't have the data to say which is the more severe power-hog, in Sapphire Rapids, between it and AMX. I do believe that Sapphire Rapids largely resolved the issue of AVX-512 -based clock throttling, according to frequency data from when Phoronix tested it.

CmdrShepard said:
If you can't recognize quotes from Agner Fog's optimization manuals which I posted that's your problem, not mine.

Don't troll. You can't blame your failure to cite your sources on people not having memorized them.

CmdrShepard said:
I'll trust him anything he says on the subject of internal CPU architecture because he knows his stuff much better than me (and apparently you).

He's not the only authority out there and I haven't studied his methodology enough to know how it compares with others, nor gone back and validated what he's said about older architectures that we now know more about.

Someone widely regarded as a quack by other molecular biologists could spin a convincing tale and I'd probably believe them, absent any other information on the topic. I lack the expertise in that field to spot a convincing fraud. As an outsider, you should be a little skeptical of anything anyone tells you, absent supporting information or at least some details about their methodology and how they arrived at their conclusions.

I'm not saying Agner Fog isn't good or right on most things, but I can't blindly accept what he says as the ground truth. It's just one data point, as far as I'm concerned.

CmdrShepard said:
I also had Skylake-X (I7-9800X was my previous CPU),

I've dealt with AVX-512 clock-throttling issues on these, but in a professional capacity.

CmdrShepard said:
I have both AVX-512 and TMUL and I ran tests with both so I do have proof how much power each of them pulls.

No, you merely have data points. You don't know that what you measured constitutes a realistic worst-case for either.

CmdrShepard said:
You can choose not to trust me on that, but you can't claim I am wrong.

LOL, you can't claim you're right! At least, not any sweeping conclusions you try to draw.

CmdrShepard · Jun 22, 2024

bit_user said:
I never said AVX-512 isn't a power hog. I don't have the data to say which is the more severe power-hog, in Sapphire Rapids, between it and AMX. I do believe that Sapphire Rapids largely resolved the issue of AVX-512 -based clock throttling, according to frequency data from when Phoronix tested it.

There's no such thing as AVX-512 based clock throttling -- there's only a negative turbo multiplier offset which is applied when instructions are executed. That's not throttling, that's just down-clocking which has separate offsets for AVX2, AVX-512 and TMUL and it's by design. Unlocked Xeons like mine have those offsets fully adjustable -- you can even set them all to 0 if you can keep the CPU cool and provide enough current.

bit_user said:
Don't troll. You can't blame your failure to cite your sources on people not having memorized them.

You are the one trolling, because you are more than capable of verifying the claim using Google. Here's another analysis that corroborates what I say about CPU shutting down lanes:

https://travisdowns.github.io/blog/2020/01/17/avxfreq1.html

bit_user said:
He's not the only authority out there and I haven't studied his methodology enough to know how it compares with others, nor gone back and validated what he's said about older architectures that we now know more about.

Maybe you should ask Jim Keller for an opinion, he once worked on CPUs so he must know better.

bit_user said:
As an outsider, you should be a little skeptical of anything anyone tells you, absent supporting information or at least some details about their methodology and how they arrived at their conclusions.

I worked on code optimization since Pentium MMX and I used his manuals since Core and Nehalem, everything he ever claimed there was spot on. If you can't verify it that's on you.

bit_user said:
I'm not saying Agner Fog isn't good or right on most things, but I can't blindly accept what he says as the ground truth. It's just one data point, as far as I'm concerned.

You are welcome to find your own data points then and stop whining about me not citing my sources.

bit_user said:
I've dealt with AVX-512 clock-throttling issues on these, but in a professional capacity.

Again, not throttling.

bit_user said:
No, you merely have data points. You don't know that what you measured constitutes a realistic worst-case for either.

I ran the most intensive AVX-512 benchmark out there (Prime 95, small FTTs). Feel free to do that and measure your own power draw with and without AVX-512 (there's a checkbox).

I also ran the most intensive TMUL AI micro-benchmark I could find and compile (from onnxruntime). The difference was 200W for TMUL vs 300W for AVX-512.

You can call that data points if you want, but I know for certain that Prime 95 small FFT is the worst possible case for AVX-512, it's indisputably the benchmark with highest power draw. I can't be sure for TMUL since it is still not widely used and the only benchmark I could find was the one in ONNX runtime so until there's one that draws more power when using TMUL we'll have to settle for that.

bit_user said:
LOL, you can't claim you're right! At least, not any sweeping conclusions y try to draw.

You are now arguing on a "NO U" level of discussion, this has gone long enough for my taste -- have a nice day.

Ogotai · Jun 22, 2024

CmdrShepard said:
I ran the most intensive AVX-512 benchmark out there (Prime 95, small FTTs)

CmdrShepard said:
I also ran the most intensive TMUL AI micro-benchmark I could find and compile (from onnxruntime). The difference was 200W for TMUL vs 300W for AVX-512.

here is a thought CmdrShepard.

post screen shots of these tests you claim to have run as well as the parameters of these tests and post em here. then we can all see the data you claim to have from your own tests...

CmdrShepard said:
You are welcome to find your own data points then and stop whining about me not citing my sources.

and i bet if bit_user didnt cite his sources, for some of the posts he has made, you would be saying the same thing, asking him to show is sources.

bit_user · Jun 22, 2024

CmdrShepard said:
There's no such thing as AVX-512 based clock throttling

LOL, you have no clue...

https://blog.cloudflare.com/on-the-dangers-of-intels-frequency-scaling

Now, you said you had a Skylake-X, correct? If you aren't using a server CPU, then you've got more power & thermal headroom to play with. However, server CPUs can't just blow past their rated power limits! So, what happens? They throttle their clocks, often dropping far below their supposed "base frequency", if need be.

CmdrShepard said:
You are the one trolling, because you are more than capable of verifying the claim using Google.

No, I'm not going to rely on Google, because I might find someone else who copied and pasted part of the same text and I don't want to try and check your source only to waste time reading something that wasn't actually what you were reading.

So, here's where I draw the line: either cite your sources or I will ignore the quotes.

CmdrShepard said:
Here's another analysis that corroborates what I say about CPU shutting down lanes:

I don't know why you're still talking about this! It's a tangent off of a tangent and proves nothing about anything!

CmdrShepard said:
Again, not throttling.

Indeed, it was! We got rid of the AVX-512 code and everything miraculously ran faster, accompanied by a proportional increase in the clock speeds relative to what we saw in the AVX-512 versions.

CmdrShepard said:
I ran the most intensive AVX-512 benchmark out there (Prime 95, small FTTs).

I already pointed out the futility of this argument. Then, I told you how much bigger the fp64 multipliers are in the core's AVX-512 than its AMX multipliers. So let me try yet another tack: for your comparison to have any practical relevance to the discussion, you should try comparing them on similar tasks.

We could take de-interleaving an image, as an example. Or using bf16 AI, as another (although that might be flawed, as I mention below).

CmdrShepard said:
I also ran the most intensive TMUL AI micro-benchmark I could find and compile (from onnxruntime). The difference was 200W for TMUL vs 300W for AVX-512.

Heh, but what you don't know about that is how much it's memory-bottlenecked. That tends to be a limiting factor for AI performance, which is why Intel made the Xeon Max and why AI GPUs have stacks super-fast HBM.

CmdrShepard said:
this has gone long enough for my taste -- have a nice day.

That's cool with me.

CmdrShepard · Jun 24, 2024

Ogotai said:
post screen shots of these tests you claim to have run as well as the parameters of these tests and post em here. then we can all see the data you claim to have from your own tests...

You don't get to make demands because I am not your mom -- I don't have to post anything other than what I already did, nor I have a desire to prove myself to anyone here.

I gave more than enough information for those who doubt me to try to reproduce what I claim I tested and to compare their results to mine. You don't have access to a CPU with AVX-512 and TMUL to verify? Tough luck, you either accept my results or you can rent a cloud server for whatever smallest unit of time they sell to run some benchmarks and put your money where your mouth is.

Ogotai said:
and i bet if bit_user didnt cite his sources, for some of the posts he has made, you would be saying the same thing, asking him to show is sources.

Are you his pillow princess to feel the need to defend him like this? He's a grown up man, I am sure he can handle an argument without you as his advocate.

And we are done arguing anyway -- it's one thing to be sceptical, but totally another thing to not know who Agner Fog is and compare him to a quack if you claim you did any x86 code optimization at any point over the last 20+ years.

It's even worse when you ask for proof of my claim that upper part of AVX registers and data bus are being shut off to save power, you get a quote, get a source when you ask, and then you try to dismiss all of it by claiming it's tangential.

bit_user · Jun 25, 2024

CmdrShepard said:
And we are done arguing anyway -- it's one thing to be sceptical, but totally another thing to not know who Agner Fog is

Who said I don't know Agner Fog? I've seen his stuff for decades, also.

CmdrShepard said:
and compare him to a quack if you claim you did any x86 code optimization at any point over the last 20+ years.

Oh, somebody is triggered! I didn't say he was a quack, but I've seen him state a couple suspect things and I'm just not sure how much stock to put in everything he says without cross-checking some of his findings with what others are saying or validating them in my own testing (which I unfortunately haven't had occasion to do).

One thing I can say about his site & data is that it seems messy and disorganized to me, which raises a potential red flag. Like, all of his uOps details were mashed into one giant document that's not the easiest thing to jump into and find a specific fact about a given microarchitecture. There's the old stereotype of the disorganized genius, and I can't say that's wrong, but I've had more experience with disorganized people who are also disorganized thinkers. When trying to keeps lots and lots of little details straight, good organization isn't just about cosmetics.

CmdrShepard said:
It's even worse when you ask for proof of my claim that upper part of AVX registers and data bus are being shut off to save power,

Not what I said. I took issue with your use of the term "data bus".

CmdrShepard said:
then you try to dismiss all of it by claiming it's tangential.

It totally is! Let's not forget the only reason we even got onto that topic is that you reached for it to somehow explain why AVX-512 is such a power hog, which itself is already a tangent.

I can assure you SuperPi isn't burning that much power because of an extra 256 bits merely being read from & written back to the vector register file! You can see this yourself, if you write a microbenchmark that just does AVX-512 bitwise operations (i.e. where the amount of actual computation is trivial) in a loop. Run it and measure the power, compared to doing the same bitwise operations at 256-bit width. Now, try fp64 FMAs at 256 and 512 bits. You'll see it's the computation and not the mere data-movement that's the reason AVX-512 can be so energy-intensive.

Yes, they optimized the data-movement aspect (did you not see the multiple times I mentioned VZEROUPPER?), but it's flawed logic to conclude that it constitutes the main power drain in AVX-512.

Ogotai · Jul 1, 2024

CmdrShepard said:
You don't get to make demands because I am not your mom -- I don't have to post anything other than what I already did, nor I have a desire to prove myself to anyone here.

i wasnt demanding anything, it was a reasonable request to see whay your tests showed, to compare to the data that is already available, if that is not reasonable, then i dont know what is.

CmdrShepard said:
gave more than enough information for those who doubt me to try to reproduce what I claim I tested and to compare their results to mine.

what results ? all you posts was text. with no way to verify anything, hence the reasonable request for screen shots and setup of what you did.

CmdrShepard said:
Tough luck, you either accept my results or you can rent a cloud server for whatever smallest unit of time they sell to run some benchmarks and put your money where your mouth is.

ok then, with out screen shots, i wont take your results, and i will assume your results are fud, and false, and take any posts you make like that, as such.

put my money with my mouth is ? the same can be said about you.

CmdrShepard said:
Are you his pillow princess to feel the need to defend him like this? He's a grown up man, I am sure he can handle an argument without you as his advocate.

thanks for the insult... i wasnt being his advocate, i was attempting to prove a point. if he didnt cite his sources, you would be saying the same thing. and as i said, not citing your sources, means your data, could be made up.

CmdrShepard · Jul 2, 2024

Ogotai said:
i wasnt demanding anything, it was a reasonable request to see whay your tests showed, to compare to the data that is already available, if that is not reasonable, then i dont know what is.

You wrote, and I quote:

post screen shots of these tests you claim to have run as well as the parameters of these tests and post em here. then we can all see the data you claim to have from your own tests...

It's been a while since I went to school but when we learned languages that was imperative tone, a.k.a. demanding. It wasn't reasonable since if you can read you already have all details (and if you think you don't you could have asked for specifics), and it wasn't even polite -- it was accusatory.

Ogotai said:
what results ? all you posts was text.

Don't they teach kids how to read nowadays?

Ogotai said:
with no way to verify anything, hence the reasonable request for screen shots and setup of what you did.

Read what I claimed again, it's all there, black on white. You don't need screenshots to run Prime95 small FFT, right? I mentioned exact CPU model many times in the past discussions with bit_user, and given how your request wasn't polite I can't be arsed to repeat myself. If you are interested you can find it in my post history.

Ogotai said:
ok then, with out screen shots, i wont take your results, and i will assume your results are fud, and false, and take any posts you make like that, as such.

You see, there's the problem -- nobody ever asked you to take my results.

You butted into a conversation between two people (rude), started taking sides without trying to understand the context (obnoxious), and started making demands while accusing me of faking results (totally uncalled for because I have nothing to gain either way).

Ogotai said:
i wasnt being his advocate, i was attempting to prove a point. if he didnt cite his sources, you would be saying the same thing. and as i said, not citing your sources, means your data, could be made up.

And a screenshot can't be faked?

You have the power consumption numbers, I said what program I used and what test I run and on which CPU.

If you can't reproduce that simple process without screenshots then you have no business making any assumptions, much less talking about it at all.

Ogotai · Jul 3, 2024

CmdrShepard said:
It's been a while since I went to school but when we learned languages that was imperative tone, a.k.a. demanding. It wasn't reasonable since if you can read you already have all details (and if you think you don't you could have asked for specifics), and it wasn't even polite -- it was accusatory.

another insult.... but granted i could of made the request a little, nicer.

CmdrShepard said:
Don't they teach kids how to read nowadays?

yet another insult.

CmdrShepard said:
Read what I claimed again, it's all there, black on white. You don't need screenshots to run Prime95 small FFT, right? I mentioned exact CPU model many times in the past discussions with bit_user, and given how your request wasn't polite I can't be arsed to repeat myself. If you are interested you can find it in my post history.

and i dont have that hardware to run it on, so, hence the asking for your results.

CmdrShepard said:
You butted into a conversation between two people (rude), started taking sides without trying to understand the context (obnoxious), and started making demands while accusing me of faking results (totally uncalled for because I have nothing to gain either way).

ok sure, if its wrong to actually be interested in a conversation , and to ask to see results that were run then i dont know what to say, posting a message on here is not rude, i have seen other on here jump in all the time. FYI i also wasnt taking sides, just would of like to see your results, and sorry, but you typing them, doesnt prove they are right or wrong,

CmdrShepard said:
And a screenshot can't be faked?

and any graph or table cant be faked as well, point is ?

CmdrShepard said:
If you can't reproduce that simple process without screenshots then you have no business making any assumptions, much less talking about it at all.

which i cant, as i dont have the hardware you do, hence why i asked...

but considering practiclly your whole post was mostly insulting, it shows the type of person you were, my post wasnt meant to be insulting, it was meant to see your results in such a way that could be varified, and i appologize if it came across that way.

but what ever...

CmdrShepard · Jul 3, 2024

Ogotai said:
another insult.... but granted i could of made the request a little, nicer.

No insults there, just facts.

Ogotai said:
yet another insult.

Nope, if I wanted to insult you I would have done it but you ain't even worth getting a ban for.

Ogotai said:
and i dont have that hardware to run it on, so, hence the asking for your results.

So be nicer when you ask people to do you a favor then? If I posted numbers, then you can assume I didn't pull them out of my arse -- I ran the tests.

You are asking me to waste my time to re-run them just because you don't trust my written word but crave screenshots instead.

In my book, asking for me to take some time again to do it is at least worth being polite instead of outright accusing me of cheating and threatening to not take any of my posts seriously because I refused to be bossed around by some anonymous forum user.

News Intel fires back at AMD's AI benchmarks, shares results claiming current-gen Xeon chips are faster at AI than next-gen EPYC Turin

Prominent

Titan

Prominent

Titan

Prominent

Titan

Prominent

Titan

Prominent

Titan

Prominent

Reputable

Titan

Prominent

Titan

Prominent

Reputable

Titan

Prominent

Titan

Reputable

Prominent

Reputable

Prominent

Share this page