DCT blocks vary in sizes for video and I am sure with some clever tricks you could do more than one block at a time given the tile size.
I trust you know what a dot product is, right? It has
one answer. A matrix multiply of matrices A and B consists of A_M * B_N dot products, where the size of the operands is A_N (which should be the same as B_M). I'm no math genius, but I don't see a generalized way to build a joint matrix multiply out of that. For an 8x8 matrix, I think it always amounts to 64 dot products of 8-element vectors. Those vectors are so small that even the largest 32x32 blocks can fit in regular AVX registers. AMX has no special advantage, here.
Newsflash: you can do quality audio even with
1-bit ADC.
There's no free lunch. After delta-sigma conversion, your bitrate is usually
even higher and processing is now much more complex! I think most non-trivial processing on delta-sigma bitstreams probably converts back to PCM, before anything else. Delta-sigma was mostly done as a hack to make high-performance DACs cheaper.
But to answer your question, if your dynamic range is controlled you don't really need 16 bits to get good enough quality for speech.
Oh, sure. If speech at the quality of legacy telephone systems is all you want, then 8 bits is plenty. As I said, you can do some audio analysis @ 8-bits (including speech recognition). However, when people speak more broadly of audio signal processing, I figure they're including applications like processing for studio production and live music performance.
It is nowhere near performance of RTX 4090 of course, but it provides good speedup for CPU inference.
Not only for training.
Yeah, but because you can use cheaper client GPUs for inference, and memory size doesn't tend to be such a limiting factor, I think the main use case for AMX is in training.
That's not what I am saying -- think of a hypothetical code completion model for example. You type a character in text editor, it is sent to the GPU along with context, model analyzes it and returns results to display. GPU roundrip is always going to have more latency than CPU processing here and you can't really hide it by batching.
Yes, realtime inferencing in a closed loop is an example of where you can't use batching. Realtime control applications, like robotics or self-driving is where you tend to find examples of that. However, those things typically don't involve a big server CPU. Instead, they tend to use something like these Nvidia AGX systems:
The world's first family of systems for Autonomous Machines, Self Driving Cars, and Medical Imaging.
www.nvidia.com
Considering the prices of such GPUs that seems like a smart move. Not only you control the core count and RAM amount allowing you to scale out as needed,
Oh, but your scaling is effectively limited to 8 sockets. Training needs high inter-processor bandwidth, hence Nvidia's focus on NVLink. Sure, you can scale out to multiple chassis, but it doesn't scale linearly in either performance or especially cost. People do it, as I said, but it's not a panacea.
you also aren't limited to using CUDA only.
Yes, and that's a good thing. The near-term practical benefits of avoiding CUDA in AI are virtually nonexistent, but I want to see us move away from CUDA as much as anyone.
Sierra Forest are E-core Xeons which have different accelerators -- they don't have AVX-512 and AMX.
My point was that Sapphire & Emerald Rapids have AMX accelerators - you're paying for that silicon, even if you don't need them (e.g. for web serving or general database serving, etc.). This problem wasn't solved until Sierra Forest, but that came years late.
There's also another upside (provided that the tech works as advertised) to using CPUs for AI instead of GPUs -- Intel TDX. You can have VMs fully isolated from hypervisor and everything else with encrypted memory. As far as I know GPUs don't support this level of protection yet
You really should give Nvidia more credit, given how long they've been in the cloud computing game. If features like that are important for CPUs, you can bet Nvidia's customers have been asking for the same sorts of capabilities in their GPUs. I don't know much about it, but Nvidia claims to have a feature called Confidential Computing:
"While data is encrypted at rest in storage and in transit across the network, it’s unprotected while it’s being processed. NVIDIA Confidential Computing addresses this gap by protecting data and applications in use. The NVIDIA Hopper architecture introduces the world’s first accelerated computing platform with confidential computing capabilities.
With strong hardware-based security, users can run applications on-premises, in the cloud, or at the edge and be confident that unauthorized entities can’t view or modify the application code and data when it’s in use. This protects confidentiality and integrity of data and applications while accessing the unprecedented acceleration of H200 and H100 GPUs for AI training, AI inference, and HPC workloads."
World’s most advanced GPU.
www.nvidia.com
Let's be clear about something:
even Intel didn't expect AMX would effectively counter GPUs!
I think AMX was part of a hedge on the bets Intel was making in Ponte Vecchio (GPU Max) and Habana Labs (Gaudi). They knew Xeon Phi was going away and they'd have to transition to those new product lines, but they weren't sure how good or quick uptake would be, or how many customers would come along.