In what circumstances would a computer with a larger L2 cache execute a program faster than a computer with a smaller L2 cache and why?
All else being the same, a larger L2 cache is preferable. However, all else is rarely ever the same.
Cache architecture has many tradeoffs. In particular, there is a relationship between cache size, cache associativity, cache access time, and cache power consumption.
L1 caches are very small. This keeps the access time low. They are also highly associative, this ensures that the most relevant data is in the comparatively small cache and that the least relevant data is preferentially ejected. However, greater associativity increases logic complexity and power consumption.
Holding associativity the same, increasing the size of the L1 cache may necessitate increasing the access time. This adds an additional stage to the minimum instruction pipeline depth (two if both caches are increased), which increases the complexity of the core frontend, and increases the penalty incurred on an incorrect branch prediction. Intel's Pentium 4 microarchitecture had a terribly complex pipeline (20 stages for Willamette and Northwood, 30+ for Prescott) and this is often cited as a reason for their poor performance. Increasing the size of the L1 cache also reduces the cache miss rate, but this benefit may not be sufficient to outweigh the detriment to the pipeline. A 64KiB L1 I/D cache (32KiB each) with a three or four cycle access time seems to be the sweet spot as far as Intel is concerned as they've used this layout for many years.
The same logic applies to the L2 and L3 caches. Each higher level cache is usually larger and less associative than the cache below it. Reducing the cache's internal storage and comparison logic leaves more room for actual data and lowers power consumption without having a drastic impact on the cache lookup time. Whereas the L1 instruction cache is accessed on every cycle that is not stalled due to a structural hazard, the L2 cache is only accessed when data needs to be loaded into the L1 cache. The L1 cache has an access time of three or four cycles, but the L2 cache usually has an access time of ten or eleven cycles. The L3 cache access time is around thirty cycles.
Intel seems to feel that the sweet spot for L2 cache is 256KiB per core, and they've used this for over five years. Increasing the L2 cache from 256KiB to say, 1MiB may increase the latency from ten cycles to fifteen cycles while increasing the L2 hit rate from say 85% to 90%. This reduces L3 lookups by 25% (if an L3 cache exists), and by all means is a pretty respectable improvement. Indeed this design was used in the Core 2 microarchitecture which had a large L2 cache and no L3 cache. However, a modern dynamically scheduled microprocessor can easily mask the latency of a L1 cache miss by setting the missed instruction aside and continuing on, reordering everything at the end.
In many cases, too large of a cache really just works as a detriment to power consumption ; a smaller, more associative cache works better for dynamically scheduled machines. On the other hand, statically scheduled machines such as Intel's Itanium microprocessors need massive caches as cache misses are much more deadly.