If an algorithm splits a problem into 16 equal chunks for multi-threading, your ultimate speedup ends up limited by the thread spending the most time on a slow core.
But that's basically what I mean by "poor form". Normally, if you had 16 worker threads (which should be <= the number of hardware threads), you'd probably split the work into more like 64 or even 256 chunks. Often, the time needed to complete each chunk is variable, even in isolation of all other factors.
scheduling rotates cores unless core affinity was assigned and the overall impact is still fairly even,
Not to my knowledge. There's no reason it should, either. The more often you context-switch on a core, the worse your L2 hit rate goes. So, a compute-bound job usually stays on the same core for consecutive timeslices.
What the OS
should be trying to even out is the amount of
execution time each thread gets! And it doesn't need to move a job from one core to another, in order to make
that happen. If a job has been using more than its share of time, simply don't run it for some number of timeslices. Maybe the next time it's run, it gets assigned to the same core, or maybe a different one - by then, the L2 cache contents has probably been replaced, so it doesn't much matter (as long as you keep it on the same NUMA node). But, it's no good just moving it from one core to the next, if there's no other reason to suspend it. That doesn't help anyone.
As for how the OS can tell what should and shouldn't be scheduled on low-speed cores, I'd say the thread priority API already got some chunk of that covered:
Most application code doesn't tweak priorities, in my experience. The downsides of potentially introducing some priority-inversion bug or related performance problems far outweigh the potential upsides. AFAIK, thread priorities are really the province of realtime embedded code, running on a proper RTOS.
Now, what happens at the process-level is a different story. Something like a virus scanner (i.e. doing a full scan) will frequently run at a lower priority. On Linux, people frequently use 'nice' to run similar backgronud jobs at a low priority, and I've even used it to keep long-running simulations from compromising system responsiveness.