I think this is more a case of how the NT scheduler is handling thread assignments. I noticed earlier that when running one heavy thread (Single Thread CB) that NT wouldn't keep it on one core and kept shifting it around. Instead of one core at 100% I had four cores at 25%. This is really bad for two reasons, one being that boost technology activated based on individual core thermal loading, by constantly moving a thread around it prevents this from activating. When I force a program to onto a core then it immediately boosts to 2.6Ghz and stays there vs constantly shifting from 800 to 1.9 then to 2.6 as the thread gets bounced all over the place. The second reason is how CPU's execute code, CPU's can not do multiple instructions at once, not really. They execute one stream of instructions at a time (per individual processor target) and in order to execute a different stream of instructions you must first save the processor state and use sliding register windows / register rename files to give the new code stream a clean set of registers. This is know as a state change, all the processor's state info is saved off to cache and either a blank state is loaded or a previous state is fetched and reloaded. This takes time and stalls the CPU out. Constantly moving a thread across multiple cores would cause a ton of unnecessary state changes and needless stalling. Finally L2 cache is dedicated to each core (or shared across two with BD). Cache contents are context sensitive to the code that's being executed. Moving code from one core to a different core will cause the contents of the L2 cache to be invalidated, causing an immense number of miss's and reloads.
Core 0 => executing Thread A, OS priority interrupt, Task Switch (code moved to Core 3)
Core 3 => executing Thread A, Thread A's data was in Core 0's L2 cache, not available in Core 3's L2 cache, need to fetch it and invalidate Core 0's L2 cache.
Now multiple this for four or eight threads being constantly moved around to different cores, would be a nightmare on the caching system. This is where Intel's approach to inclusive L3 makes the most sense, they practically designed the CPU for poorly scheduling OS's. Core 3 would be able to reload the L3 copy of Core 0's L2 cache upon the thread being task switched vs Core 3 having to issue an interrupt to Core 0 to read it's L2 cache (if allowed) or having to (heaven forbid) go to main memory to fetch the data.