This is a good question and I'm not aware of anyone having done that analysis. I think Michael (the author at Phoronix) knows which tests tend to scale well.
An interesting distinction might also be whether they scale well to multiple, physical CPUs. That subset is easy to eyeball by just looking for tests where 2P configurations score about double (or half) of a 1P test setup. Among these, I see:
- Algebraic Multi-Grid Benchmark 1.2
- Xcompact3d Incompact3d 2021-03-11 - Input: input.i3d 193 Cells Per Direction
- LULESH 2.0.3
- Xmrig 6.18.1 - Variant: Monero- Hash Count: 1M
- John The Ripper 2023.03.14 - Test: bcrypt
- Primesieve 8.0 - Length: 1e13
- Helsing 1.0-beta - Digit Range: 14 digit
- Stress-NG 0.16.04 - Test: Matrix Math
I think categorizing the benchmarks in Phoronix Test Suite (which allegedly takes a couple months to run in its entirety!), based on things like multi-processor scalability, sensitivity to memory bottlenecks, intensity of disk I/O, etc. would be fertile ground for someone to tackle. This would also let you easily run just such a sub-category, based on what aspect you want to stress.
It would also be interesting to do some clustering analysis of the benchmarks in PTS, so that you could skip those which tend to be highly-correlated and run just the minimal subset needed to fully-characterize a system.
Something else that tends to come up is how heavily a benchmark uses vector instructions and whether it contains hand-optimized codepaths for certain architectures.
As PTS is an open source project, it might be possible to add some of these things (or, at least enough logging that some of these metrics can be computed), but I have no idea how open Michael might be for such contributions to be upstreamed.