News Chinese algorithm claimed to boost Nvidia GPU performance by up to 800X for advanced science applications

It makes a lot of sense that China is investing heavily on the software side in order to do more with less.

All of these U.S. government restrictions kind of miss the point when the only restriction is on a top-end model, but then here come software optimizations.
 
Sounds like good work!
Not shocking, this has happened with Fourier transforms and even matrix solution, year after year.
That sort of thing is probably a very minor part of NVDA sales.
 
  • Like
Reactions: Sluggotg
I just have to roll my eyes at this. Any time you see the popular press report on something from a scientific journal, it should be viewed through a lens of skepticism. It's probably well outside the expertise of the article's author, who might not even be very accustomed to this type of research and lacks the context needed to interpret its results. In virtually every case of this I've seen on Toms, they're actually basing their article on an article in another publication, which adds yet another unknown into the mix.

The article said:
Their PD-General framework achieved up to 800x speed gains on an Nvidia RTX 4070 compared to traditional serial programs
Okay, but who's using serial, anyhow? Especially when they talk about the GPU replacing "costly, high-performance computing clusters." This is the same kind of marketing BS we get from Nvidia, where they like to trumpet how much faster a CUDA version of something is than a slow, lame old implementation that no one serious would actually run.

The article said:
... and 100x faster performance than OpenMP-based parallel programs.
Again, raises lots of questions like: which backend - CPU or GPU? If CPU, what kind?

In general, OpenMP isn't usually very good. It's what you use when you've got a bunch of legacy code and you just want a quick, easy, low-risk way of speeding it up on a multi-core or GPU-accelerated machine. So, 100x probably isn't very surprising, here.

I don't know if this is a nothing burger, where someone just got the standard GPU compute-level speedups from porting their code to a GPU, or if there was any novel algorithmic breakthrough that enabled the CUDA version to be quite so much faster.

I will say it's interesting they opted to use a consumer GPU for this. That's because they have pretty lousy fp64 performance, which is usually what scientific and engineering code uses. So, one innovation might be careful management of numerical precision, in order to utilize fp32 arithmetic.

If anyone has a link to the paper, I'd be interested in having a glance at it. During a brief search for it, I found this paper from 2017, claiming a speed up of 12 to 100x relative to sequential code. The link is just to an abstract, so I don't know what kind of GPU they used (but probably a fair bit slower than a RTX 4070):

Accelerating Peridynamics Program Using GPU with CUDA and OpenACC​

J.X.Li, J.M. Zhao , F. Xu , and Y.J. Liu
  1. Institute for Computational Mechanics and Its Applications (NPUiCMA), Northwestern Polytechnical University, Xi’an, 710072, P. R. China.
  2. Mechanical Engineering, University of Cincinnati, Cincinnati, Ohio, 45221-0072, 210072, USA.

https://www.sci-en-tech.com/ICCM2017/PDFs/2404-8302-1-PB.pdf

BTW, OpenACC is a related/derivative of OpenMP.
 
Last edited:
I just have to roll my eyes at this. Any time you see the popular press report on something from a scientific journal, it should be viewed through a lens of skepticism. It's probably well outside the expertise of the article's author, who might not even be very accustomed to this type of research and lacks the context needed to interpret its results. In virtually every case of this I've seen on Toms, they're actually basing their article on an article in another publication, which adds yet another unknown into the mix.


Okay, but who's using serial, anyhow? Especially when they talk about the GPU replacing "costly, high-performance computing clusters." This is the same kind of marketing BS we get from Nvidia, where they like to trumpet how much faster a CUDA version of something is than a slow, lame old implementation that no one serious would actually run.


Again, raises lots of questions like: which backend - CPU or GPU? If CPU, what kind?

In general, OpenMP isn't usually very good. It's what you use when you've got a bunch of legacy code and you just want a quick, easy, low-risk way of speeding it up on a multi-core or GPU-accelerated machine. So, 100x probably isn't very surprising, here.

I don't know if this is a nothing burger, where someone just got the standard GPU compute-level speedups from porting their code to a GPU, or if there was any novel algorithmic breakthrough that enabled the CUDA version to be quite so much faster.

I will say it's interesting they opted to use a consumer GPU for this. That's because they have pretty lousy fp64 performance, which is usually what scientific and engineering code uses. So, one innovation might be careful management of numerical precision, in order to utilize fp32 arithmetic.

If anyone has a link to the paper, I'd be interested in having a glance at it. During a brief search for it, I found this paper from 2017, claiming a speed up of 12 to 100x relative to sequential code. The link is just to an abstract, so I don't know what kind of GPU they used (but probably a fair bit slower than a RTX 4070):

Accelerating Peridynamics Program Using GPU with CUDA and OpenACC​


J.X.Li, J.M. Zhao , F. Xu , and Y.J. Liu​
  1. Institute for Computational Mechanics and Its Applications (NPUiCMA), Northwestern Polytechnical University, Xi’an, 710072, P. R. China.
  2. Mechanical Engineering, University of Cincinnati, Cincinnati, Ohio, 45221-0072, 210072, USA.

BTW, OpenACC is a related/derivative of OpenMP.
Here, I found the 2025 article for you. I’m interested in your opinion on it.

https://doi.org/10.1016/j.enganabound.2025.106133

To me, when I see fantastical numbers like 800x, my first thought is that they are probably talking about a single step in a long chain of steps that they optimized by 800x, and when taken in context of the entire operation minimally reduces compute time. But I’ll dig into it as well to see if I’m right or if this really is a breakthrough.

Edit: after reading the article, it seems that this is simply a targeted optimization for the resource structures of an RTX 4070 and only an RTX 4070. This means that to get the kind of performance improvement in a way that competes with existing methods, the researchers would need to R&D an optimization scheme for each individual GPU used in this market. It’s cool, but software tool providers are not going to spend this kind of time trying to make 50+ unique optimization schemes only compatible with 1 specific GPU, so comparing this to the GPU agnostic software tools on the market is a bit like comparing an ASIC to a CPU’s general computing core.
 
Last edited:
Simulation running in parallel instead of in series? As a mechanical engineer, it would take a TON of testing to trust it. This is OpenMP based? My teams ONLY use that for hacked together solutions to run old sims on more threads on the CPU, and only when they can’t hack it into some form of CUDA code instead. It seems this isn’t actually parallel simulation instead of series simulation.
 
Last edited:
Here, I found the 2025 article for you. I’m interested in your opinion on it.

https://doi.org/10.1016/j.enganabound.2025.106133
Thanks! How very resourceful of you!
; )

Since I'm not about to pay the $25 for the published version and haven't found a preprint copy, I just have to go by what's in that excerpt. Fortunately, it includes a part where the authors discuss prior work. According to them, the best CUDA versions, to date, have achieved 100x and "hundreds of times". So, this narrows their achievement to < 8x, although that's not a trivial win, when improving on an already accelerated solution.

In that section, they also outline their novel contributions, which include actually tuning the GPU-ported algorithm to better align with the hardware structures, which they say hadn't been done in prior efforts. They highlighted the following performance bottlenecks and other limitations, which their implementation targeted:
  1. The memory space allocated for storing neighborhood points does not have a predetermined size, which leads to the inefficient use of thread and memory resources. This results in a wastage of memory and computational resources, making it challenging for GPUs to handle large-scale problems.
  2. Most GPU parallel calculations still heavily rely on global memory and have not fully utilized CUDA's memory structure, resulting in a waste of memory bandwidth.
  3. Most PD parallel algorithm lacks general utility. Some may restrict the size of the neighborhood, handle only uniformly distributed and undamaged discrete structures, or limit the theory of PD.

BTW, what they mean by "global memory" is the GDDR6X. An alternative is to use local SRAM, which has orders of magnitude higher bandwidth.

Some specific optimizations and improvements mentioned in the excerpt include:
  • The algorithm implements a particle parallel mode, as well as a more efficient accessing and storage strategy using register bandwidth. This eliminates the limitation of the number of neighborhood points and enhances the search speed, resulting in significant speedups compared to the serial program and other parallel algorithms. This allows for quick analysis of the deformation and crack propagation in BBPD and the framework is also applicable to other PD theories.
  • the thread blocks are organized as 32 * k (k is a positive integer). The study compares the performance of different thread block organizations in three models, using the computation of internal forces as the baseline.

Edit: after reading the article, it seems that this is simply a targeted optimization for the resource structures of an RTX 4070 and only an RTX 4070.
I think it's not nearly so GPU-specific. Nvidia GPUs have used a warp size of 32 for a while. The compute parameters are pretty consistent over all models sharing the same CUDA Compute Capability and generally don't change much, from one generation to the next.

According to this, the Compute Capability of the RTX 3000 series is 8.6. For RTX 4000, it's 8.9. You have to click where it says GeForce and TITAN Products:

It seems they have yet to update that for RTX 5000, but a web search tells me it's 10.1.

You have put your finger on an interesting area of work, which is the performance-portability of parallel programs. Parallelizing these sorts of programs typically introduces a set of variables that should be tuned to the hardware, in order to achieve optimal performance. It's not something I've closely followed, so I can't say much about the state of the art, in that area.

This means that to get the kind of performance improvement in a way that competes with existing methods, the researchers would need to R&D an optimization scheme for each individual GPU used in this market.
Based on what I've been able to see in the summary, I'd hazard a guess that it's going to run with comparable efficiency on most Nvidia GPUs, from recent generations. You might be right that a few knobs would need tuning for best performance when, going from Ada to Ampere or between smaller and larger GPUs. Still, we're probably not talking about a huge win, but I'd guess efficiency differences something less than a factor of 2.
 
Why is it so hard to write 'scientist optimize large scale physics simulations by 100-800x using off the shelf RTX 4070s' ?

They did not optimise RTX general performance. They do optimize the algorithm, previously running in a highly generic OpenMP environment 100x. Since they also quote 800x comparison with serial execution it's also likely the baseline was CPU, not even GPU.

It's interesting perhaps to show that HPC crowd is slowly embracing more scalable computational methods, but hardly anything new or groundbreaking, and definitely won't make anyone's RTX go faster (at least until someone decides to make a physics accurate Fruit Ninja)
 
Winder whose IP the Chinese and Russians stole this time ?
Why shouldn't it be legit? With 3x the population as the USA, China should have about 3x as many geniuses.

I wouldn't even say it's genius-level work, from what I've seen of it. Mainly, just sensible code optimization. This isn't surprising, because a lot of scientific software is written by people whose expertise is mainly in their scientific discipline and not in software development. So, a lot of scientific software tends not to be terribly well-written or highly-optimized. This situation is improving, but it's taken time.

Their approach was obviously informed by prior work, but that's completely standard and the paper acknowledges this, in detail. Basically every contribution of this nature and level is going to reference prior work. It would be irresponsible not to.
 
Not really a breakthrough here, it's just coding the way the processor is designed to work instead of trying to force it to work the way the code is written traditionally, probably in Fortran. While science moves quickly, scientists move at a glacial pace.
 
Why is it so hard to write 'scientist optimize large scale physics simulations by 100-800x using off the shelf RTX 4070s' ?

They did not optimise RTX general performance. They do optimize the algorithm, previously running in a highly generic OpenMP environment 100x. Since they also quote 800x comparison with serial execution it's also likely the baseline was CPU, not even GPU.

It's interesting perhaps to show that HPC crowd is slowly embracing more scalable computational methods, but hardly anything new or groundbreaking, and definitely won't make anyone's RTX go faster (at least until someone decides to make a physics accurate Fruit Ninja)
Because the truth is not as "shock and awe" as the clickbait article needs it to be.
 
I don't know if this is a nothing burger, where someone just got the standard GPU compute-level speedups from porting their code to a GPU, or if there was any novel algorithmic breakthrough that enabled the CUDA version to be quite so much faster.

I will say it's interesting they opted to use a consumer GPU for this. That's because they have pretty lousy fp64 performance, which is usually what scientific and engineering code uses. So, one innovation might be careful management of numerical precision, in order to utilize fp32 arithmetic.
That it's being done on normal consumer GPUs is the point of it. With sanctions restricting Chinese and Russian access to higher-grade Nvidia GPUs, they looked for a way to improve scientific-calculation performance on GPUs that aren't affected by sanctions. It looks like they succeeded. A head of a Russian software association said elsewhere that this breakthrough means they'll also need to buy fewer Nvidia GPUs.
 
There's nothing to wonder about, they stole the idea from the Nvidia drivers and game code themselves.
Drivers? No. Not even game code, since those use HLSL - not CUDA.

There are like hundreds of books written about CUDA programming, by now, given that it's been around since like 2006. The CUDA-level stuff they did is pretty much textbook CUDA optimization.

Don't believe me? Just ask over on Nvidia's CUDA developers forum. I'm sure the experienced hands over there would tell you it sounds like pretty standard CUDA code optimization techniques.
 
Last edited:
  • Like
Reactions: KyaraM
That it's being done on normal consumer GPUs is the point of it. With sanctions restricting Chinese and Russian access to higher-grade Nvidia GPUs, they looked for a way to improve scientific-calculation performance on GPUs that aren't affected by sanctions.
I dunno. Now that I've seen excerpts of the paper, no mention was made of managing arithmetic precision. So, maybe we're overthinking it.

Aside from that, CUDA is sort of a write-once, run-anywhere kind of technology. You can do your development on pretty much any of their GPUs and run it on the others (within the limits). So, just having an optimized CUDA version doesn't actually mean they won't run it on the 100-series GPUs. In fact you can get P100's and V100's awfully cheap, these days. Probably even in China.
 
  • Like
Reactions: SirStephenH
Drivers? No. Not even game code, since those use HLSL - not CUDA.

There are like hundreds of books written about CUDA programming, by now, given that it's been around since like 2006. The CUDA-level stuff they did is pretty much textbook CUDA optimization.

Don't believe me? Just ask over on Nvidia's CUDA developers forum. I'm sure the experienced hands over there would tell you it sounds like pretty standard CUDA code optimization techniques.
Lol
 
Since I'm not about to pay the $25 for the published version and haven't found a preprint copy,

I've got access to the full paper through my institution.

They basically developed a new algorithm. The paper describes 2 variants of it. The algorithm is well tailored to massive parallelism. After that they applied additional hand tuned optimizations.
They compare 3 implementations running on different platforms.
1 Serial: Runs on a single core CPU. Its not clear if its a previous algorithm or a serial implementation of their proposal. It is possible (my guess) that their proposal is very efficient in parallel systems but performs worse that previous sequential algorithms on a single core.
2 Open MP: Its and OpenMP implementation of their algorithm that runs on a muticore CPU
3 CUDA: A CUDA implementation that runs on a GeForce RTX 4070 GPU

100X is the (best case) speed-up of their OpenMP (2) implementation over the serial one (1)
800X is the (best case) speed-up of their CUDA (3) implementation over the serial one (1). That is, GPU vs single core CPU.

They never compare with previous work on GPUs other than the related work references.

I've seen many applications where the GPU implementation shows speedups in the range 100X to 1000X vs unoptimized serial version on a single core CPU. So nothing to write home about.

Maybe they achieve great speed-up over serial in this particular use case, and thir implementation is more effcient on GPUs that previous ones. But they never boosted Nvidia GPU performance by up to 800X, which is the reason why people are dropping Nvidia stock. If anything, the paper is a reason to sell Intel and buy Nvidia. But i guess the application is so narrow that the actual impact will be Null.

Best.
 

TRENDING THREADS