News AMD claims RX 7900 XTX outperforms RTX 4090 in DeepSeek benchmarks

Admin · Jan 29, 2025

AMD has published benchmarks of DeepSeek's AI model with its flagship RX 7900 XTX that show the GPU outperforming both the Nvidia RTX 4090 and RTX 4080 Super using DeepSeek R1.

AMD claims RX 7900 XTX outperforms RTX 4090 in DeepSeek benchmarks : Read more

gg83 · Jan 29, 2025

Amd just trying to pump their stock a little.

Neilbob · Jan 29, 2025

Our AI thing AIs more AI than the AI thing of the other AI with this AI.

I'm just so very weary of everything. And AI - I'm weary of that too.

phxrider · Jan 29, 2025

In other news, Congress begins talks on embargoing exports of the 7900XTX to China.....

Maybe this illustrates the differences when software is explicitly written for one or the other? I'm pretty sure most games are written for Nvidia seeing as they own something like 85% of the market, except there are a few AMD sponsored games that do much better on AMD. Could this be the same effect? It's not too hard to figure out the Chinese might try writing this for AMD since the 4090 is embargoed, and 7900XTX is not.

Makaveli · Jan 29, 2025

This should all be taken with a pinch of salt, of course, as we can't be sure how the Nvidia GPUs were configured for the tests (which, again, were run by AMD.

I was testing this in LM studio last week in the LM Studio discord with a 4090 user there is no grain of salt needed its been verified.

on the 7B, 8B, 14B models the XTX is faster. The 4090 alittle faster on the 32B model about 4%

systemBuilder_49 · Jan 29, 2025

I can hear the howls of anguish as all the NVidia-buyers LOSE THEIR MINDS over this fact ...

bit_user · Jan 29, 2025

The article said:
This should all be taken with a pinch of salt, of course, as we can't be sure how the Nvidia GPUs were configured for the tests (which, again, were run by AMD).

It's plausible, since the 7900 XTX has about the same memory bandwidth as the RTX 4090 and better bandwidth from L2 and L3 caches. So, if inferencing these models is bandwidth-limited and not compute-bound, then I could believe the 7900 XTX is holding its own against that GPU.

I didn't find an official number indicating how many TOPS the 7900 XTX is good for, but the number 123 did pop up. This is only 37% as much as the amount of dense TOPS as Nvidia (and halve that, for matrices with optimal sparsity).

https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F035631c1-4f63-4d46-8a5a-28bdcfa95af9_1169x549.png

Source: https://chipsandcheese.com/p/microbenchmarking-nvidias-rtx-4090

https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6a58a68-4006-4e6a-a4e3-5a4792ed3036_1411x586.png

Source: https://chipsandcheese.com/p/microbenchmarking-amds-rdna-3-graphics-architecture

The article said:
The RDNA 3 architecture the RX 7900 XTX is based on is capable of matrix operations, supporting BF16 and INT8.

It turns out that the WMMA instructions in RDNA 3 are simply microcoded operations that utilize the same vector pipelines as normal shader arithmetic. So, RDNA 3 does not have something akin to Nvidia's Tensor cores in its client GPUs (the CDNA-based server chips do have dedicated Matrix units, however).

Amdlova · Jan 30, 2025

AMD trying to profit... Us authority sign to block the GPU from amd to chinese market. Good move AMD.

nogaard777 · Feb 3, 2025

Why does AMD do this to themselves when it can be proven false so easily?

Cooe · Feb 12, 2025

bit_user said:
It's plausible, since the 7900 XTX has about the same memory bandwidth as the RTX 4090 and better bandwidth from L2 and L3 caches. So, if inferencing these models is bandwidth-limited and not compute-bound, then I could believe the 7900 XTX is holding its own against that GPU.

I didn't find an official number indicating how many TOPS the 7900 XTX is good for, but the number 123 did pop up. This is only 37% as much as the amount of dense TOPS as Nvidia (and halve that, for matrices with optimal sparsity).

Source: https://chipsandcheese.com/p/microbenchmarking-nvidias-rtx-4090

Source: https://chipsandcheese.com/p/microbenchmarking-amds-rdna-3-graphics-architecture

It turns out that the WMMA instructions in RDNA 3 are simply microcoded operations that utilize the same vector pipelines as normal shader arithmetic. So, RDNA 3 does not have something akin to Nvidia's Tensor cores in its client GPUs (the CDNA-based server chips do have dedicated Matrix units, however).

The last bit is mostly semantics as matrix workloads run on Nvidia's Tensor Cores don't also use the SM's standard ALU's at the same time. And AMD's added dedicated ASIC hardware to the CU's themselves to run matrix operations via WMMA. It's not just using the normal FP32 pipes, it's more complicated than it. It's still dedicated AI/matrix math hardware, just not completely split apart as totally separate "cores" ala Nvidia post-Turing or CDNA.

Aka, when matrix operations are being calculated on either architecture it's being done (at least partially) on dedicated fixed-function matrix hardware; it's just that AMD put said hardware inside their standard CU's (which ofc does necessitate less capable functionality) whereas Nvidia broke it out into its own, totally separate unit.

AMD's method of running matrix operations on the standard CU's (basically having them pull double duty) instead of using fully dedicated matrix cores saves SIGNIFICANT die-space, but ofc at the cost of peak performance (which tbh makes perfect sense for an architecture aimed at gamers & mainstream consumers 🤷).

bit_user · Feb 12, 2025

Cooe said:
The last bit is mostly semantics as matrix workloads run on Nvidia's Tensor Cores don't also use the SM's standard ALU's at the same time.

Pretty sure that's not true. I think Tensor cores are a separate pipeline from their vector pipes, hence you should be able to overlap vector ops with Tensor ops. Confirmed here:

https://forums.developer.nvidia.com/t/overlapping-cuda-cores-and-tensor-cores/288774

Cooe said:
And AMD's added dedicated ASIC hardware to the CU's themselves to run matrix operations via WMMA. It's not just using the normal FP32 pipes, it's more complicated than it.

That's not how I read it. To me, it sounds like WMMA uses their vector pipeline, but simply short-circuits the VGPR by using dedicated storage for intermediates. This means you can't overlap WMMA and other instructions, in the same CU.

Search

News AMD claims RX 7900 XTX outperforms RTX 4090 in DeepSeek benchmarks

Admin

Administrator

gg83

Distinguished

Neilbob

Distinguished

phxrider

Distinguished

Makaveli

Splendid

systemBuilder_49

Distinguished

bit_user

Titan

Amdlova

Distinguished

nogaard777

Distinguished

Cooe

Commendable

bit_user

Titan

TRENDING THREADS

Latest posts

Moderators online

Share this page