And we are done arguing anyway -- it's one thing to be sceptical, but totally another thing to not know who Agner Fog is
Who said I don't know Agner Fog? I've seen his stuff for decades,
also.
and compare him to a quack if you claim you did any x86 code optimization at any point over the last 20+ years.
Oh,
somebody is triggered! I didn't say he
was a quack, but I've seen him state a couple suspect things and I'm just not sure how much stock to put in everything he says without cross-checking some of his findings with what others are saying or validating them in my own testing (which I unfortunately haven't had occasion to do).
One thing I can say about his site & data is that it seems messy and disorganized to me, which raises a potential red flag. Like, all of his uOps details were mashed into one giant document that's not the easiest thing to jump into and find a specific fact about a given microarchitecture. There's the old stereotype of the disorganized genius, and I can't say that's wrong, but I've had more experience with disorganized people who are also disorganized thinkers. When trying to keeps lots and lots of little details straight, good organization isn't just about cosmetics.
It's even worse when you ask for proof of my claim that upper part of AVX registers and data bus are being shut off to save power,
Not what I said. I took issue with your use of the term "data bus".
then you try to dismiss all of it by claiming it's tangential.
It totally is! Let's not forget the only reason we even got onto that topic is that you reached for it to somehow explain why AVX-512 is such a power hog, which itself is already a tangent.
I can assure you SuperPi isn't burning that much power because of an extra 256 bits merely being read from & written back to the vector register file! You can see this yourself, if you write a microbenchmark that just does AVX-512 bitwise operations (i.e. where the amount of actual computation is trivial) in a loop. Run it and measure the power, compared to doing the same bitwise operations at 256-bit width. Now, try fp64 FMAs at 256 and 512 bits. You'll see it's the
computation and not the mere data-movement that's the reason AVX-512 can be so energy-intensive.
Yes, they optimized the data-movement aspect (did you not see the multiple times I mentioned VZEROUPPER?), but it's flawed logic to conclude that it constitutes the main power drain in AVX-512.