Info Examining Per Thread CPU usage in games

Alternative Title: Why single core performance still matters

A question popped up in my head recently: does Windows have some tool or utility to monitor CPU time of an applications threads? The primary reason to ask this question and find out if there's a way to see a trend in games on their thread utilization. Some of us here claim that despite consumers having access to 8+ cores at a relatively affordable price, games still rely on a small number of threads and hence, all those cores really don't matter and what matters is the single core performance. In other words, games haven't been "properly" multithreaded. Of course, rather than just make that claim, why not put, at least, my money where my mouth is?

One concern is this sort of deep dive tends to be limited to development environments, but Windows does keep tabs on threads so maybe there was hope. And there was! Process Explorer can show CPU time per thread if you double click on the application and select the "Threads" tab. Though the metric of importance is "Cycles Delta", but it's not enabled by default. Why is this important? Because it tells you how many cycles the thread was on the CPU since the last report. If this value floats around a specific range, it's indicative of how busy it tends to be over time. There's also another way through Performance Monitor, but setting that up is a pain.

So I ran a few games to see how their per-thread usage looks like. Most of them I just let it sit there so this isn't indicative of say when things get busy, but I think it's useful to provide a baseline at least. The games I ran:
  • Black Mesa Source: I just wanted a Source Engine game and this was installed
  • Cities Skyline: A popular simulation game, so it'd be interesting to see its pre-thread usage
  • Call of Duty: Modern Warfare 2 (2022). Spent a few minutes in one of the Single Player levels
  • Cyberpunk 2077: Just to throw an open world modern game in there
  • F1 2019: Another simulation game
  • Final Fantasy XIV: An MMORPG
  • Outer Worlds: I wanted an Unreal Engine 4 game, and this is one I happened to have installed
  • Quake 2 RTX: It's built off the original Quake 2 game, but otherwise this is to see what a really old game look like
  • Resident Evil 4 Chainsaw Demo: Also just wanted to throw some semblance of a modern game
  • Stray: Also another Unreal Engine 4 game
wYuoVLp.png


This is going to be a common thing: despite there being a ton of threads spawned, a good number of them are sitting there doing nothing. In any case, you can see that three threads dominate the CPU time. The most I can gather what "tier0.dll" is is it's tied to Source games, which I'm guessing because of how Source games actually work, is a server that's running the game logic. The other two represent the main game executable and the NVIDIA driver.


hwiyVQj.png


Cities Skylines has been claimed by others that it relies heavily on a single thread to do everything. And this pretty much proves it. The map I used was https://steamcommunity.com/workshop/filedetails/?id=785933283 , which is a highly populated map. As a point of comparison, here's the usage from a map that barely has anyone in it.
H58DxDZ.png


qyMSF6k.png

For this one, about 5 threads dominate the CPU time, with one of them showing a higher usage than the rest. Unfortunately the "Start Address" doesn't seem to point to anything useful, so it's hard to tell what's what.

So in this one I wanted to try something different. Run the game at RT Ultra settings at 1440p, then drop it down to Low quality settings at 720p and see the difference.

So here's the usage at 720p with low quality settings
VMaj5Ik.png


And here it is as 1440p RT Ultra quality
IQdBLbl.png


So the one interesting to note is that 11 threads dropped in their CPU time. Given that Cyberpunk 2077 is a DX12 game, it's possible these are threads for rendering graphics.

Either way, this illustrates why lowering the resolution puts more strain on the CPU.


I took this while running the benchmark, which simulates a race.
DH8iALB.png

The interesting thing is there's 8 threads running fairly evenly. This makes me think these are for the AI racers (even though there's 20 total racers or so). I also ran this in DX11, which would explain the busy driver thread

I took readings from several scenarios this time:

ruUUjWd.png

720p High Desktop quality, non busy area

vSR4loQ.png

1440p High Desktop quality, non busy area

NX9q2XH.png

1440p High desktop quality with a 120FPS cap, non-busy area

2fTEXxK.png

1440p High Desktop, Limsa Lominsa Aetheryte plaza (one of the busiest areas in game)

Strangely enough, the game's main thread increases in activity at lower resolution while the driver activity decreases. Either way, these two dominate the CPU usage. The third one is likely the network handler or something related to it since the main thread didn't jump up.

NteEVX0.png

So this is where things get interesting. Unreal Engine 4 games (the other game I tried also does this) seem to spawn a ton of threads that do a lot of work. However, it's also clear that there's still one thread that dominates the entire game.

p0k6GQk.png


It's funny that the driver thread completely over takes the main game thread. I figured this was going to happen, but I thought it'd be interesting to throw it in anyway

kwy5w4J.png


Two threads still dominate this game, but there seems to be quite a handful that gets a non-trivial amount of work.

tEXErG8.png


I ran this with the -dx12 option on. But similar to Outer Worlds, a ton of threads got spawned and they seem to be doing something. But in the end, two threads dominate the game

Link to the album: View: https://imgur.com/a/e5SszIt


Conclusion
So basically, all of these games have typically one or two threads that tend to be busy all the time, so much more than the other threads they spawn. And note just because there are other threads the game spawned doesn't mean they actually are ready to run. You can't look at one of the Unreal Engine 4 games and go "look! it spawned 30+ threads, clearly this should run great on an Threadripper!", because those threads aren't running all at the same time. And depending on the order of how these threads run, it's conceivable that you can get similar performance on a CPU with fewer cores/threads because you could just run them back to back, taking the same amount of time as one of the threads that took a lot more CPU time.

In any case, this is why single threaded performance in games is still important. Games are still designed in a way that there's not a whole lot of work to do at once, and things tend to be shoved into a single thread.
 
if you look at delta cycles, then only cities utilises 4GHz of your single core CPU, rest of games have headroom
That's not how it works.

If I did an estimated cycles count on Cities Skylines, there was about 15,561,014,000 cycles. If we looked at say Cyberpunk 2077 in 1440p, it had 22,994,165,000 cycles. If the thread had a "Cycles Delta" value, it ran during the last sampling period. So all of those threads are going to run anyway.
 
That's not how it works.

If I did an estimated cycles count on Cities Skylines, there was about 15,561,014,000 cycles. If we looked at say Cyberpunk 2077 in 1440p, it had 22,994,165,000 cycles. If the thread had a "Cycles Delta" value, it ran during the last sampling period. So all of those threads are going to run anyway.
I think he means that you don't max out your single thread speed so it also doesn't matter since you are basically wasting it.
Games are coded for consoles that are much weaker than high end CPUs.

Cycles alone is a bad metric anyway since in one cycle the workload might be very low and in another very high.
Many cycles at low workload could still produce less work than fewer cycles with higher workload.
You can see that in this pic 12584 has a super high cycles count but doesn't do anything much (main game loop? ) .

Intel has a little tool called pcm-core that will show you instructions per cycles for every core both individually and as IPC.
YDEzxpL.jpg
 
I think he means that you don't max out your single thread speed so it also doesn't matter since you are basically wasting it.
Keep in mind that the thread isn't running from start to finish or whatever. It's not spending, in the Cities Skylines example, 4 billion cycles on that thread, then moving on to another thread. The metric is simply how many CPU cycles that thread spent on the CPU. It provides no other context.

Cycles alone is a bad metric anyway since in one cycle the workload might be very low and in another very high.

Many cycles at low workload could still produce less work than fewer cycles with higher workload.
You can see that in this pic 12584 has a super high cycles count but doesn't do anything much (main game loop? ) .
I'd like you to explain to me what you're talking about. I have an idea, but I mean this explanation in a vacuum makes very little sense to someone who's actually studied CPU microarchitectures.

Anyway the point of this is solely to find out if the activity in games tends to be bunched up into a handful of threads, rather than spread out evenly. Ignore the value the thread actually got, but pay attention to the fact that for a lot of the games I did this on, it looked like at least 50% of the cycles were spent on 2 or 3 threads.

Intel has a little tool called pcm-core that will show you instructions per cycles for every core both individually and as IPC.
That doesn't help because it's per-core. I'm not interested in that. I'm interested in the per-thread breakdown of a game. Per-core metrics are polluted by the rest of whatever's running on the system and per-process doesn't really tell me anything either because I'm trying to see if only a handful of threads the busy ones. So unless that tool has a way to get a per-thread breakdown, it's useless for what I'm doing.
 
Keep in mind that the thread isn't running from start to finish or whatever. It's not spending, in the Cities Skylines example, 4 billion cycles on that thread, then moving on to another thread. The metric is simply how many CPU cycles that thread spent on the CPU. It provides no other context.
Yes, that's the main issue here.
You do show that a few threads do run more often than others but you do not show the threshold, the point at which the game starts to run slower/doesn't run any faster.
To make a point for "Why single core performance still matters" you have to show that it matters by showing how many FPS you lose by not having enough of it.
I'd like you to explain to me what you're talking about. I have an idea, but I mean this explanation in a vacuum makes very little sense to someone who's actually studied CPU microarchitectures.

Anyway the point of this is solely to find out if the activity in games tends to be bunched up into a handful of threads, rather than spread out evenly. Ignore the value the thread actually got, but pay attention to the fact that for a lot of the games I did this on, it looked like at least 50% of the cycles were spent on 2 or 3 threads.
Ever read about how HTT manages to increase performance?!
Here it shows you an CPU that can run 4 things at once on the same core
But it doesn't need to run 4 things on every line/cycle
So cycles alone is a bad metric because it doesn't show you how many instructions it runs each cycle.
https://www.analyticssteps.com/blogs/hyper-threading-technology-advantages-and-disadvantages
eNY2bouD3xvlO6nwUDxcHRsj1NkJY6AsJE1PGbcalv7vfD_aQNxY8rWGMfxNyvyuz-Ts7YC1npKFAH0J8T6LuacJInS0DMfr1Nfb5oAO5VEOYAc8T9BVs7sbBIwEbkgykvbTvV2LA-GgjZZebVsXzyfcBbIOMp6i7i373McEW94nLXY1KzZ2dW1k



Bottom line:
To make a point about why single matters you have to show that it needs to run at every available cycle, if it runs at fewer cycles/lower clocks then available, then it matters less.

As it stands you are showing that your GPU or something else is limiting your FPS and so you would have the exact same FPS at much lower single core performance.
It's the same thing the advocates of multi-threaded do, saying you absolutely need such and such many cores and then show gameplay with their CPUs at 10% load or something...
 
Yes, that's the main issue here.
You do show that a few threads do run more often than others but you do not show the threshold, the point at which the game starts to run slower/doesn't run any faster.
To make a point for "Why single core performance still matters" you have to show that it matters by showing how many FPS you lose by not having enough of it.
And finding that relationship would only really matter for this particular CPU I'm running. Cycle count can still give you an overall picture of how busy a thread is regardless of microarchitecture, because the higher the number, the more it implies it had more work to do.

Ever read about how HTT manages to increase performance?!
Here it shows you an CPU that can run 4 things at once on the same core
But it doesn't need to run 4 things on every line/cycle
So cycles alone is a bad metric because it doesn't show you how many instructions it runs each cycle.
https://www.analyticssteps.com/blogs/hyper-threading-technology-advantages-and-disadvantages
eNY2bouD3xvlO6nwUDxcHRsj1NkJY6AsJE1PGbcalv7vfD_aQNxY8rWGMfxNyvyuz-Ts7YC1npKFAH0J8T6LuacJInS0DMfr1Nfb5oAO5VEOYAc8T9BVs7sbBIwEbkgykvbTvV2LA-GgjZZebVsXzyfcBbIOMp6i7i373McEW94nLXY1KzZ2dW1k
Instructions ran per cycle isn't a useful metric either, because what gets executed on the processor at any given point in time is for the most part, random. You can run a benchmark a million times and your "instructions per cycle" values will be vastly different for each cycle on each run. The only time you can get a consistent reading for instructions per cycle is if the application runs on one thread and is the only thing running on the entire computer. And it also doesn't require any input from some source to run, be it a user or another computer/thing/whatever.

The number of cycles the thread spent on a CPU is not a useless metric because of what the end goal of my assessment is: to figure out which threads are busy. It doesn't matter if there's no "real work" being done on the thread or not, because as long as that thread is occupying the CPU for that cycle, no other thread is running (although this is highly simplified).

Bottom line:
To make a point about why single matters you have to show that it needs to run at every available cycle, if it runs at fewer cycles/lower clocks then available, then it matters less.

As it stands you are showing that your GPU or something else is limiting your FPS and so you would have the exact same FPS at much lower single core performance.
It's the same thing the advocates of multi-threaded do, saying you absolutely need such and such many cores and then show gameplay with their CPUs at 10% load or something...
There's no reason for me to go profile my computer at various CPU performance levels to show that single core performance strongly correlates to faster maximum FPS most of the time. Plenty of websites do this. What I was after was providing an explanation as to why this is the case. Sure you could point out that IPC could make the "Cycles Delta" metric pointless, but flip it around. A CPU that can get on average 2 IPC means that for a thread with 4 billion cycles on the CPU in a second , 8 billion instructions were executed. It doesn't matter what CPU you throw on it, you have to execute 8 billion instructions. And if these instructions aren't spread around throughout the cores through multithreading, the maximum performance (if we measure performance by FPS) of the game is likely going to be limited by how quickly you can get this single thread processed.
 
And finding that relationship would only really matter for this particular CPU I'm running. Cycle count can still give you an overall picture of how busy a thread is regardless of microarchitecture, because the higher the number, the more it implies it had more work to do.
But the only thing you are showing with this is that you do need individual cores...and I don't think anybody ever argued that you don't.

Single core performance matters...in that you do need single (individual) cores...

Yes, there are cores that run threads and some are more active...that's the only thing you are showing.
Everybody knows that no game is perfectly multithreaded and the last time we had a perfectly single threaded game was in dos times.

If you want to show that single core matters find a really well multithreaded game and run it at high clocks all-core compared to low clocks all-core which will show that single core performance still increases FPS.
But then again that game on more cores at low clocks will also run faster compared to fewer cores at the same clocks.