So hire we go.
"Yes, I did read all of your posts, in full. I simply don't accept your power and performance claims, based solely on extrapolations of old generation designs, using old process technology. So, if you don't have better data, then let's move on."
Then you have a problem with excepting things.
As a mater a fact I have better data & I am going to discredit you totally now as you asked for it. Will also show you don't have idea what are you talking about.
"Yes, I did read, but they are rather vague about exactly how they compute GOPS, for instance. Anyway, they are clearly talking about 8-bit MACs, of which Knights Landing cores can each do 128/cycle. However, these run at 3x the clock speed, resulting in the equivalent of 384."
Yes they are & not only SIMD ones but combined.
Hate to brake it up to you but AVX 512 bit SIMDs on those Phi's don't support 8 bit precision they only supports single (32 bit) & duble (64 bit) ones (how ever it does support unaligned access but no one will use it) & neither is suitable for deep learning that actually uses 4,8 & 12 bit ones. So it's actually 16 (32 bit ones) per one 512 bit SIMD & 32 per Atom compared to 64 per 512 bit SIMD on that DSP & 256 for quad cluster one. That comes to 8x per clock. You did mention the 3x clock speed of Atoms (even it's not a case it's only 1.9x)& tried to say how that is equal to 3x performance but how ever you aren't counting in architectural disadvantages CISC have compared to DSP's there simultaneous combined MAC's per cycle come in play & they will how ever be vastly faster in any case. Even that AVX implementation on PHI doesn't support 8 & 16 bit (bite & word) precision (how ever their is AVX512BW extension so far implement only on some upcoming Skylark Xeons) it how ever have a patch instructions extension (PF) & that certainly helps a lot it's still a far cry from one find in this CNN block of this DSP.
Source:
https://en.m.wikipedia.org/wiki/Advanced_Vector_Extensions
This actually explains how PHI culd come close to new Nv GPU offering (along with Nv dropping support for FF16 bit math processing). So you actually see how Atoms (Nor Nv GPU's) aren't exactly the best match for deep learning algorithms judging by their architecture.
"Regarding power, Knights Landing can run 72 cores @ 1.5 GHz + 16 GB of HBM + 6-channel DDR4 memory controller + 40 lanes of PCIe 3.0 + the cache-coherent bus tying it all together, in only 245 W. So, while we don't know exactly how many watts each core uses, it's likely in the vicinity of 2 W."
Speculating again aren't you? You really need to work on your math! It's more likely they use a bit over 3W per core as HMB is really lo on power consumption. Really your math is zero (3x clock sped & 2W per core)...
"And, BTW, the 8.2 mm^2 figure you cited excludes FPU" not really relevant in this case"
Now a calculations:
Let's say how this DSP quad core cluster will on the same advanced node (Intel's 14nm FF) use about 45% energy compared to one on 28nm LL planar process (assuming we won't change anything else) that is equal to 585 mW let's say 600mW to keep it fair. As the Atom consumes 3W on 1.9 Frequency & how that is more than 2x then it would consume working on 800MHz... For the sake of calculation let's say it would consume around 1200mW at 800MHz (& I am being generous hire) that is actually twice as much so we can drove conclusion how it have 2x gates so it's twice the size which is actually much worser than I firstly suggested. So 8x more DSP word's per clock x 2 per size equals 16x but that is only theoretical true output. Now we add the architectural difference & say how DSP will have at least 150% performance of SIMD/CISC Atom that makes it 24x. That is much less than 40x but their is no compiler nor math lib that actually can harvest all capabilities on PHI AVX SIMD's & how that will significantly cripple it's performance but on contrary the Synopsys compiler is well optimized & can utilize full potential of their DSP you will easily come to 40x index. If you use open source compilers that are about 2x slower it rises to 80x.
"You might recall that your position was that it's bad, ineffective, and overpriced" so isn't it? You can quote me on that as much as you like in the future.
"My position is simply that it's neither bad, nor ineffective" so your position is deadly wrong. Seams somehow you are stuck with it.
After reconsidering all of this Kings landing is an actual product that is highly over priced, inefficient & that you don't want to have. All do they did make significant improvements compared to previous generation.
"I don't understand the strong distinction you're drawing between DSPs and GPUs, and I think I do know a bit about each. Please enlighten us, if you wish".
I won't really go into any details hire!
All do both of them employee ALU SIMD units their are significantly different from architectural standpoint. Most colorful example I can think of to describe it would be how GPU's are as single way roads wile DSP's are two way highways with lots ins and outs along side. So basically you can drive only one car in the one direction on GPU while you can drive (couple) different cars in different directions simultaneously on DSP & they can jump in & out along the way much faster to be replaced with others. This is much more efficient if you need it if you don't GPU's culd win in basic highest parallel tasks but again their won't as their also contain other buildings blocks needed for graphic processing that specialized digital signal processors don't have. Besides their is only about 3% of all tasks that would benefit from such high level parallelism. Again FPGA's are like open field that you can program like you want (for any task) & helped with DSP blocks they can do MAC's much more efficient than ever before so all in all they are the best possible choice for architecting anything (making your highways & their numbers as you need them).
I hope this is enough for you if not try to find some good scientific works & reed them. All do I didn't find any of that kind with used modern & purpose adapted for deep learning DSP's. You certainly can find ones that will give you better overview in the architecture.
Best regards. This topic is closed considering me.