I just have to roll my eyes at this. Any time you see the popular press report on something from a scientific journal, it should be viewed through a lens of skepticism. It's probably well outside the expertise of the article's author, who might not even be very accustomed to this type of research and lacks the context needed to interpret its results. In virtually every case of this I've seen on Toms, they're actually basing their article on an article in another publication, which adds yet another unknown into the mix.
Okay, but who's using serial, anyhow? Especially when they talk about the GPU replacing "costly, high-performance computing clusters." This is the same kind of marketing BS we get from Nvidia, where they like to trumpet how much faster a CUDA version of something is than a slow, lame old implementation that no one serious would actually run.
Again, raises lots of questions like: which backend - CPU or GPU? If CPU, what kind?
In general, OpenMP isn't
usually very good. It's what you use when you've got a bunch of legacy code and you just want a quick, easy, low-risk way of speeding it up on a multi-core or GPU-accelerated machine. So, 100x probably isn't very surprising, here.
I don't know if this is a nothing burger, where someone just got the standard GPU compute-level speedups from porting their code to a GPU, or if there was any novel algorithmic breakthrough that enabled the CUDA version to be
quite so much faster.
I will say it's interesting they opted to use a consumer GPU for this. That's because they have pretty lousy fp64 performance, which is usually what scientific and engineering code uses. So, one innovation might be careful management of numerical precision, in order to utilize fp32 arithmetic.
If anyone has a link to the paper, I'd be interested in having a glance at it. During a brief search for it, I found this paper from 2017, claiming a speed up of 12 to 100x relative to sequential code. The link is just to an abstract, so I don't know what kind of GPU they used (but probably a fair bit slower than a RTX 4070):
Accelerating Peridynamics Program Using GPU with CUDA and OpenACC
J.X.Li, J.M. Zhao , F. Xu , and Y.J. Liu
- Institute for Computational Mechanics and Its Applications (NPUiCMA), Northwestern Polytechnical University, Xi’an, 710072, P. R. China.
- Mechanical Engineering, University of Cincinnati, Cincinnati, Ohio, 45221-0072, 210072, USA.
BTW, OpenACC is a related/derivative of OpenMP.