You are incorrect about the caching setup. L1D, and L2 are different. L3 is a victim cache and also contains different data, but only data shared. RealWorld did a real in-depth analysis recently - someone posted the link - and it shows that the way Barcelona handles it.
It only has to keep the L3 coherent so that's only 4 caches.
In order for coherency among the 4 L3's to be sufficient, the L3 must act as a write cache for all data from 4 cores, which is the reverse of the K10 victim read-only design - any modified L3 data is written back into L1D or L2, then that portion in L3 is flushed.
There are 16 independent threads, each of which touches an individual L1D and/or L2, so the minimum number of caches to worry about is 32 - assuming the L3 never steps in. The processors don't know these are independent threads because there is no central manager among the 16 cores telling them so.
The reason 4S Clovertown gets this down to 8 caches is that (1) the L1D is included fully in L2 and (2) each L2 contains
all the cached data for two cores and handles all the read/write load. Neither method is present in K10, or else we'd see only L1's and a single, large L2 per die.
I don't think there's anything about cache here, since rendering as a process is more or less a streaming job, where cache is more or less cut off and only FP efficiency and somehow branching count, especially in the AMD ( K8 ) arch. You can see this comparing in the CPU charts the render time of a 2.0GHz / 1M L2 Athlon64 and a 20.GHz / 128K L2; the difference is insignificantly 0.58% or 1sec while the L2 differs by a factor of 8X:
This is not about the amount of cache but the simple yet tedious problem of keeping so many different caches in sync.
Again you should look at some of the Barcelona analysis articles. The way it works is AMD uses a MOESI protocol to determine if a query is needed. In K8 all cores were always queried, but Barcelona has a new method where if a cache line is not marked shared then the query doesn't happen.
Even then a 4P Opteron has seriously low NUMA latency. They also don't say if the Barcelona is using DDR2-800.
I gues the difference between what I think and what others think is that most of us ca't have it faster than C2Q because you all talk too much (include yourself if you want).
I believe that AMD has Itanium, Alpha, and K7 engrs/designers so they will do what they set out to. Beat the crap out of
OPTERON.