News Linus Torvalds: Linux Scheduler Not To Blame For Google Stadia Port Issues

Nice Torvalds quotes.

I read his posts and agree with this summary. It's kinda sad to see someone trying so hard to optimize a fundamentally wrong-headed approach, as that original developer had done.

Anyone who has a clue what a mutex or a spinlock is would probably benefit from reading Torvalds' posts:
  1. https://www.realworldtech.com/forum/?threadid=189711&curpostid=189723
  2. https://www.realworldtech.com/forum/?threadid=189711&curpostid=189755
  3. https://www.realworldtech.com/forum/?threadid=189711&curpostid=189759
 
Last edited:
Why didn't these issues crop up on all the other platforms?
Different schedulers are optimized for different use cases. Linus makes this point, several different times. His posts are really worth reading, if you can follow the arguments.

He supposes that the developers originally did some profiling and tuned their code for their main platform (i.e. either Windows or games consoles). If you think about it, games consoles' schedulers were probably modeled on the Windows schedulers of the day, in order to smooth over these kinds of issues.

And the only other analogous cases would be Mac & mobile ports. But, Mac gaming is pretty niche, and there are probably even bigger issues to deal with, on mobile ports.
 
He supposes that the developers originally did some profiling and tuned their code for their main platform (i.e. either Windows or games consoles). If you think about it, games consoles' schedulers were probably modeled on the Windows schedulers of the day, in order to smooth over these kinds of issues.
I'm pretty sure PS4's stock OS is FreeBSD based, seems like the Rage 2 codebase was running on fairly diverse hardware and software combinations without issue. I'm not saying these issues aren't caused/exacerbated by the game code... but rather I agree with your "Different schedulers are optimized for different use cases" statement. MS has tuned and retuned their scheduler multiple times in recent history, for example. I don't think the Linux scheduler is perfect in all cases, and the attitude of "you're coding it wrong" makes me chuckle a bit when other schedulers aren't having any such issues with the same implementation. Maybe the truth lies somewhere in the middle... the code is flawed (well, what complex codebase isn't actually), but maybe Linux could handle it better.
 
I'm pretty sure PS4's stock OS is FreeBSD based,
That says nothing about its scheduler, though, which they certainly at least tweaked and possibly completely replaced.

I'd certainly expect that a scheduler designed for general-purpose computing wouldn't be ideal for latency-sensitive or soft-realtime workloads. There's a basic tradeoff, that Linus repeatedly touches upon, between latency and throughput. For desktop computing, you'd probably try to balance the two. For server or HPC workloads tasks, you'd typically want throughput-oriented behaviors. The other extreme is a hard-realtime systems, where you make significant tradeoffs in throughput and efficiency, in order to have low & deterministic response times.

seems like the Rage 2 codebase was running on fairly diverse hardware and software combinations without issue.
It might not utilize any userspace spin-locks. While I get the impression they're not uncommon, among games, they have various downsides and aren't exactly a "best practice". They're best-characterized as a fragile optimization that sometimes isn't.

MS has tuned and retuned their scheduler multiple times in recent history, for example. I don't think the Linux scheduler is perfect in all cases,
I don't know the history of Linux' default thread scheduler, but Linus implied that it's been the subject of much tweaking and experimentation, over the years, I'd be surprised if that weren't the case, considering how much both workloads and computer architectures have evolved.

the attitude of "you're coding it wrong" makes me chuckle a bit when other schedulers aren't having any such issues with the same implementation.
You can chuckle all you want, but if you really dig into the argument, I think that's the only logical conclusion. I won't try to summarize his argument - you can read it if you care. But, I'd caution you against taking a position without understanding the case for/against.

In the previous post, I was suggesting that games tuned their spinlocks against Windows' scheduler's behavior, and that consoles probably didn't deviate much from that, for the sake of improving portability. I think that adequately explains why "other schedulers aren't having any such issues with the same implementation", though we don't really know whether or how much locking typically changes with ports to/from consoles (though I've heard that the original game developers aren't the ones who typically do ports - so, maybe changing out the locks is a standard thing these porting houses are just used to doing). So, it's also risky to generalize from very little information about the subject.

Maybe the truth lies somewhere in the middle...
That's a common response to complex issues, but I doubt it holds true any more often than in simpler disputes.

the code is flawed (well, what complex codebase isn't actually), but maybe Linux could handle it better.
No, I disagree with that. The userspace code is simply operating on assumptions about kernel scheduling behavior that it has no business making, and not actually communicating with the kernel to let it know how the code wants to be scheduled. When you're waiting on something, you should tell the kernel what you're waiting for. The fundamental problem with userspace spinlocks is that they don't do this.

But, here I go, doing what I said I wouldn't. Just read Linus' posts. They're longish, but to-the-point and pretty straight-forward.
 
Last edited: