AMD has published benchmarks of DeepSeek's AI model with its flagship RX 7900 XTX that show the GPU outperforming both the Nvidia RTX 4090 and RTX 4080 Super using DeepSeek R1.
In other news, Congress begins talks on embargoing exports of the 7900XTX to China.....
Maybe this illustrates the differences when software is explicitly written for one or the other? I'm pretty sure most games are written for Nvidia seeing as they own something like 85% of the market, except there are a few AMD sponsored games that do much better on AMD. Could this be the same effect? It's not too hard to figure out the Chinese might try writing this for AMD since the 4090 is embargoed, and 7900XTX is not.
This should all be taken with a pinch of salt, of course, as we can't be sure how the Nvidia GPUs were configured for the tests (which, again, were run by AMD.
I was testing this in LM studio last week in the LM Studio discord with a 4090 user there is no grain of salt needed its been verified.
on the 7B, 8B, 14B models the XTX is faster. The 4090 alittle faster on the 32B model about 4%
This should all be taken with a pinch of salt, of course, as we can't be sure how the Nvidia GPUs were configured for the tests (which, again, were run by AMD).
It's plausible, since the 7900 XTX has about the same memory bandwidth as the RTX 4090 and better bandwidth from L2 and L3 caches. So, if inferencing these models is bandwidth-limited and not compute-bound, then I could believe the 7900 XTX is holding its own against that GPU.
I didn't find an official number indicating how many TOPS the 7900 XTX is good for, but the number 123 did pop up. This is only 37% as much as the amount of dense TOPS as Nvidia (and halve that, for matrices with optimal sparsity).
It turns out that the WMMA instructions in RDNA 3 are simply microcoded operations that utilize the same vector pipelines as normal shader arithmetic. So, RDNA 3 does not have something akin to Nvidia's Tensor cores in its client GPUs (the CDNA-based server chips do have dedicated Matrix units, however).
It's plausible, since the 7900 XTX has about the same memory bandwidth as the RTX 4090 and better bandwidth from L2 and L3 caches. So, if inferencing these models is bandwidth-limited and not compute-bound, then I could believe the 7900 XTX is holding its own against that GPU.
I didn't find an official number indicating how many TOPS the 7900 XTX is good for, but the number 123 did pop up. This is only 37% as much as the amount of dense TOPS as Nvidia (and halve that, for matrices with optimal sparsity).
It turns out that the WMMA instructions in RDNA 3 are simply microcoded operations that utilize the same vector pipelines as normal shader arithmetic. So, RDNA 3 does not have something akin to Nvidia's Tensor cores in its client GPUs (the CDNA-based server chips do have dedicated Matrix units, however).
The last bit is mostly semantics as matrix workloads run on Nvidia's Tensor Cores don't also use the SM's standard ALU's at the same time. And AMD's added dedicated ASIC hardware to the CU's themselves to run matrix operations via WMMA. It's not just using the normal FP32 pipes, it's more complicated than it. It's still dedicated AI/matrix math hardware, just not completely split apart as totally separate "cores" ala Nvidia post-Turing or CDNA.
Aka, when matrix operations are being calculated on either architecture it's being done (at least partially) on dedicated fixed-function matrix hardware; it's just that AMD put said hardware inside their standard CU's (which ofc does necessitate less capable functionality) whereas Nvidia broke it out into its own, totally separate unit.
AMD's method of running matrix operations on the standard CU's (basically having them pull double duty) instead of using fully dedicated matrix cores saves SIGNIFICANT die-space, but ofc at the cost of peak performance (which tbh makes perfect sense for an architecture aimed at gamers & mainstream consumers 🤷).
Pretty sure that's not true. I think Tensor cores are a separate pipeline from their vector pipes, hence you should be able to overlap vector ops with Tensor ops. Confirmed here:
And AMD's added dedicated ASIC hardware to the CU's themselves to run matrix operations via WMMA. It's not just using the normal FP32 pipes, it's more complicated than it.
That's not how I read it. To me, it sounds like WMMA uses their vector pipeline, but simply short-circuits the VGPR by using dedicated storage for intermediates. This means you can't overlap WMMA and other instructions, in the same CU.