The L1D cache is far too small and needs more associativity, at least according to various opinions that I've seen. A fast L1 is only good if it's accomplishing something. From what I've read
here, there's a lot of cache writing going on. However, fixing Bulldozer in this method looks like a complete redesign.
For a second, let's say that FC3 is limited to four cores (logical or otherwise). HT would be a major disadvantage here, but it seems the Intel CPUs are able to hit the GPU limit. Judging by the i3-2100's performance, overclocking the 3960K here is meaningless. Dropping to low details - or overclocking the graphics card - would alleviate the GPU bottleneck (you'd hope, considering this is a 7970) and as such we could potentially see a situation where the i3 results only improve a little, and the i5 takes the lead due to its four physical cores against the i7's HT setup which would make more sense to disable, especially considering the software can see 12 cores but will only work with four, HT or no HT. Even overclocking heavily here wouldn't bring the 3960K on par with the i5-3550. So, the 3550 and an HT-less 3960K would be far in front of the i3 which barely leads the 8350.
Any chance of a low details test please, Toms?