Ags1

Honorable
Apr 26, 2012
255
0
10,790
I've noticed that in the past few years CPU cache size has started increasing more rapidly. Does anyone know if there is a theoretical limit to the amount of cache available to the CPU? I would imagine that memory is bound to the physical number of transistors on the chip (double the cache = double the number of required transistors) so I guess that Intel/AMD cannot simply continue increasing the cache size.

On the other hand, if a core can be sacrificed in favor of a integrated GPU, why not sacrifice a core for a massive cache instead?
 

monu_08

Distinguished
May 26, 2011
1,060
0
19,310
because competition getting tough bw amd and intel just like ati and nvida till now best gaming processor is amd fx 8150 and i5 2500k and i7 2600 k these till now
 

xtreme5

Distinguished
the cache doesn't really matter in dual core series which have 2mb or 4mb cache but these cache are in good amount for those cpu's..
Cache is just like A RAM but cache are indeed extremely fast memory than ram. The manufacturer uses lower cache in the development of newer cpu's if they uses higher cache then it will be just a waste an example of core 2 quad q9650 it has 12mb of cache however i3 2100 uses only 3mb of cache but 100mhz more clock speed, now if we compare those two cpu's the q9650 is beaten by i3 2100 in some benchs, even when there is a huge difference 12mb beats by 3mb in some way.
www.anandtech.com/bench/Product/289?vs=49

on the other hand q9650 has 4 cores where as i3 uses 2 cores
 

monu_08

Distinguished
May 26, 2011
1,060
0
19,310
it becAUSE i3 have better technology its have sandy bridge 30 nm which make faster and core 2quad is 45nm chip ddr2 support i 3 start from ddr3 so its perform well
 

mathew7

Distinguished
Jun 3, 2011
295
0
18,860
Keep in mind the following guidelines:
- cache size and latency are proportional (inceasing cache increases latencies)
- cache size and manufacturing costs are proportional (a smaller cache can make a whole waffer contain more CPUs, thus more units at a certain price)
- cache size and power consumption are ...well I don't know if proportional or exponential...no data to back it up.

So when they test what cache size to use, they consider whole power usage, chip size and speed benefit. Because going from 1->2MB will bring more performance than going 2->4MB. So while 4MB will be best performer, the other factors may keep it at 2MB (for power, maybe an increased clock will get better performance overall).

As for Core2Duo and i3 (and any i's for that matter) the comparison of only cache cannot be done. i3 has an onboard memory controller, thus reducing RAM-cache latency. Also it has improvements in the execution engine.

PS: Core2quad/duo can use DDR3, as the memory controller is integrated in the MB, not CPU (as i3).
 

Ags1

Honorable
Apr 26, 2012
255
0
10,790


Thanks Matthew, that's good info. It confirms what I was thinking - that the cache cost (in acreage, energy and cash) goes up faster than the benefits, so caches are self-limiting. The background is I am writing a benchmark utility and I don't want to have to future proof it against theoretical 64MB caches.
 

gurgadrgen

Reputable
Mar 26, 2015
2
0
4,510


While I don't disagree that obnoxiously large cache sizes does increase latency, there's two problems with your comparison of the core 2 quad q9650 and the i3 2100:

1. You're assuming clock speed = performance. All clock speed means is it executes that many cycles per second. It does not in ANY way indicate how many instructions it performs in that time, or, for that matter, how much work one instruction actually does (though instructions-per-second does tend to be a better metric for modern CPUs than pure clock rate). If CPU A performs at 5 times the clock speed of CPU B, but CPU B can perform all instructions in one clock cycle, but CPU A takes 6 clock cycles at minimum, B has 20% higher performance than A, despite cycling 5 times "slower."

2. You're linking the clock speed to the cache access speed. Cache is memory. Registers are also memory, but they are SIGNIFICANTLY faster than cache. In fact it is entirely possible to keep an entire program in registers, and execute some VERY complex code without once pushing data into or accessing data from memory. This is most easily done with assembly language, where you can directly command the system to more or less access things in a very specific pattern. That being said, there is still SOME cache accessing, but not in the way you would think. The best you can get is to restrict it to instruction-level accessing, if done right, most CPUs will know how to pre-load all that into the instruction cache, which tends to be L1 and L2 cache (the fastest cache levels). I promise you, not ONE company is going to measure the metrics of their CPU's core clock rate (e.g. clock rate) based on higher-level data-cache latency. Not only is it too specific of a problem (since it's not exactly always necessary, and it's not really related to the base-level speed of the CPU) but it also tends to make them LOOK bad, because it's going to be slower. Cache latency and cache architecture are fundamentally separate from CPU clock rate, and are not to be confused.

That all being said, a bigger cache can still quite often be a much better thing. You might wonder why that is, and the short answer would be simply because RAM access is almost ALWAYS going to be CONSIDERABLY slower.

You might be wondering, "well, how much slower?" The answer is, as one might expect, it varies. However, on average, modern RAM takes about 3 times longer than the SLOWEST cache level. This cache level is, on many architectures, about 9 times slower than the FASTEST cache level, making memory access a whopping 27 times slower than the fastest cache access.

I don't know about you, but I'd definitely be on the boat for even half the speed of the highest level cache if it meant I was getting a 30% speed boost in memory access time, and preventing cache misses in well-designed software. I specify well-designed because it is fairly easy to build software to pretty much miss cache completely on EVERY instruction, and have to constantly be going out to main memory.

To put this in perspective, on an average CPU could perform a square-root operation in approximately 27 cycles. This operation is NOTORIOUSLY slow, as a standard instruction, in the programming community. However, how long does a cache miss take? Somewhere in the ballpark of 200-500 cycles. I could perform almost 30 square roots in the time it takes for you to simply ACCESS something. However, if that thing was already cached, depending on how deep in the caching layers it was, you could have it as soon as one square-root operation.

At the end of the day, smaller cache size does not mean faster run speed. It simply means faster cache access speed. If I had a cache with only enough space for a single "word" of data (4 bytes, or 32 bits, by x86 standards, at least) it would likely be comparable in speed to a register, which is the fastest form of "memory" that basically passes data around within the CPU's actual processing architecture, such as the ALU or the "arithmetic logic unit." However, it would be virtually pointless to use because the point of cache is not to make it as fast as humanly possible, because what would be even faster would be to simply not use memory at all, but that's simply not a reasonable assumption from a programming standpoint, as it would require us to limit our programs to using ONLY as much memory as could be contained in registers (which is in the order of less than one half of one kilobyte on the most elaborate of CPUs). So why do we use cache? Again, it's WAY faster than main memory. So, more or less, larger cache = faster run speed (usually).

However, there is actually somewhat of a cap to this speed boost. That cap happens to be the clock rate. Basically, what it comes down to is the fact that if you're running down a cache-line (cache is organized into "lines" of memory, by the way) as you're running your program, the CPU will generally notice this, and will automatically go out and grab the next line FOR you. This means that the cache will be ready to go or at the very least already on the way there when your CPU reaches the end of the line. So to that effect, there is a point where having MORE cache wouldn't really do you much good anyway, because it would take more time to grab it than it would to run through it, or significantly more time to run through it because it's simply so long.

Intel (if it's not terribly hard to notice) has come into a sort of niche of cache size for the last few generations. While the caches were actually getting LARGER about four or five generations of CPU ago, they are now stuck at roughly 4MB. And that's not the only thing - they also have it split into more levels than before. This way, missing in one level of cache more linearly increases time waiting, and we don't get so many leaps in performance loss.

So while yes it is somewhat true that a smaller cache is faster than a bigger one, again, it's not about how fast the cache is. It's about how well it performs with regards to not MISSING cache lines and needing to walk AAALLL the way over to main memory and get the data back to the cache.
 
You're linking the clock speed to the cache access speed. Cache is memory. Registers are also memory, but they are SIGNIFICANTLY faster than cache.

The L1 instruction cache is hit on pretty much every single cycle with a 4-5 cycle pipeline delay on Intel's microarchitecture. The L1 data cache is not much different.

In fact it is entirely possible to keep an entire program in registers, and execute some VERY complex code without once pushing data into or accessing data from memory.

No it isn't. There's no mechanism in x86 to execute from registers. The general purpose registers are ill-equipped to store instruction words.

That being said, there is still SOME cache accessing, but not in the way you would think. The best you can get is to restrict it to instruction-level accessing, if done right, most CPUs will know how to pre-load all that into the instruction cache, which tends to be L1 and L2 cache (the fastest cache levels).

On x86, caching policies are established page-wise and handled block-wise. Certain pages, such as those that decode to PCI MMIO ranges, need to be marked as non-cachable or write-through cachable.

I promise you, not ONE company is going to measure the metrics of their CPU's core clock rate (e.g. clock rate) based on higher-level data-cache latency.

I'm not even sure what this means. All companies design their caches to maximize the aggregate hit rate and minimize the aggregate miss penalty. That's what optimization simulations are for.

Cache latency and cache architecture are fundamentally separate from CPU clock rate, and are not to be confused

Um, no they're not. The CPU pipeline is inextricably linked to the low level CPU caches. Intel didn't decouple the CPU core and L3 cache clock domains until Haswell. This added about a 6 cycle (CPU side) penalty to accessing the L3 cache. Furthermore, the micro-ops that handle load and store instructions are designed to interact with the appropriate cache.

That all being said, a bigger cache can still quite often be a much better thing. You might wonder why that is, and the short answer would be simply because RAM access is almost ALWAYS going to be CONSIDERABLY slower.

Bigger caches certainly can be better but they are rarely ever contiguous and uniform in latency. Intel actually breaks the L3 cache up such that some cores have lower latency to certain cache blocks than others. The wonky cache architecture on AMD's Bulldozer CPUs is partly responsible for the microarchitecture's performance problem. Bulldozer has a 2MiB L2 cache per module and the latency is in the range of 25-27 cycles; compare this to the 256KiB per core on Intel's i7 series microprocessors that has a 12 cycle latency. The miss penalty on Bulldozer is enormous and this is a result of the higher level caches being too big and the lower level cases not being associative enough.

You might be wondering, "well, how much slower?" The answer is, as one might expect, it varies. However, on average, modern RAM takes about 3 times longer than the SLOWEST cache level. This cache level is, on many architectures, about 9 times slower than the FASTEST cache level, making memory access a whopping 27 times slower than the fastest cache access.

Tck for a 4Ghz microprocessor is 1/4E9 seconds, or 250 picoseconds. Median access time for an L3 cache hit is 30 cycles on Nehalem/Sandybridge/Ivybridge, and about 36 cycles on Haswell. That's 7.5 nanoseconds to 9 nanoseconds. During this time, the microprocessor keeps chugging along thanks to out-of-order execution and speculation. The median round trip time for a read to DDR3-SDRAM is about 53 nanoseconds on top of the 30-36 cycles. The SDRAM is 6-7 times slower than the L3 cache when the cache is operating at its normal clock rate, not 3 times.

To put this in perspective, on an average CPU could perform a square-root operation in approximately 27 cycles. This operation is NOTORIOUSLY slow, as a standard instruction, in the programming community. However, how long does a cache miss take? Somewhere in the ballpark of 200-500 cycles. I could perform almost 30 square roots in the time it takes for you to simply ACCESS something. However, if that thing was already cached, depending on how deep in the caching layers it was, you could have it as soon as one square-root operation.

Are you talking about the fsqrt instruction? If so, that instruction is performed entirely in microcode and takes about 50-60 cycles depending on the microarchitecture. Some language libraries may lean on the CPU instruction, while others may implement it entirely in software. It's possible to get a faster approximation by using a fast inverse square root and then taking the reciprocal but this is only an approximation.

Intel (if it's not terribly hard to notice) has come into a sort of niche of cache size for the last few generations. While the caches were actually getting LARGER about four or five generations of CPU ago, they are now stuck at roughly 4MB. And that's not the only thing - they also have it split into more levels than before. This way, missing in one level of cache more linearly increases time waiting, and we don't get so many leaps in performance loss.

They've fallen into a standard (not niche) cache hierarchy because that's where their simulations have placed optimal performance.
 

gurgadrgen

Reputable
Mar 26, 2015
2
0
4,510
The L1 instruction cache is hit on pretty much every single cycle with a 4-5 cycle pipeline delay on Intel's microarchitecture. The L1 data cache is not much different.

And registers have as little as 0 cycle latency with some instructions. Please explain how 0 is greater than 4. And no, the L1 instruction cache is not hit on pretty much every single cycle. Also, please explain to me how it is you almost NEVER get a cache miss from L1 cache when in order to get ANY cache misses you would have to miss the first level of cache? If there were hardly any cache misses on the L1, then why would we even have the L2 and L3? Seriously, if it were that rare, going to memory every once in a great while would be like nothing.

Run a profiler on an infinite loop with a branch in it for about ten seconds, and set the profiler to pick up cache misses. Then tell me there are "hardly any" instruction cache misses.

Now take a 2D array of size 1000x1000. Access it in column-major order and record the cache misses. I'll be back on in a week to see if your computer managed to finish it yet. You can tell me how rare your L1 data cache misses are when it misses virtually every call to another row :p.

No it isn't. There's no mechanism in x86 to execute from registers. The general purpose registers are ill-equipped to store instruction words.

Please, allow me to clarify what I am talking about when I say data. See, I'm talking about data. Does that make it more clear? Because instructions, last I checked, especially when referring to the cache, are not data. As in data cache. As in loading data into the program. Not loading the program itself. My apologies for not making that more clear.

On x86, caching policies are established page-wise and handled block-wise. Certain pages, such as those that decode to PCI MMIO ranges, need to be marked as non-cachable or write-through cachable.

Correction: should be marked as non-cacheable. It wouldn't make sense to leave response to an external PCI event sitting in cache, anyway. Also, cache is loaded in cache lines. If it wasn't clear, that's what I was referring to. As far as I know (I may be wrong), cache lines have been used since well into the era of x86 architectures.

I'm not even sure what this means. All companies design their caches to maximize the aggregate hit rate and minimize the aggregate miss penalty. That's what optimization simulations are for.

It means a company is not going to report cache latency as their clock rate? They will run the CPU with the most intense instructions, most likely, and attempt to bog it down, but cache isn't used to measure the clock rate. If you programmed your system to constantly need to drop down to L2 and L3 cache, or god forbid memory... *shudder*... and fail to allow for any sort of prefetch, you wouldn't get a very accurate depiction of the maximum speed of that CPU, now would you? The absolute maximum speed of the CPU is going to be virtually unhindered by the cache (hence it being the "absolute maximum"), not to mention the fact that most CPUs CAN clock faster than they are set to, but this doesn't mean they would actually run significantly better. Faster clock speed isn't going to have an affect on cache latency. That would be like saying that driving a Ferrari makes a red light turn green faster. While the car is certainly CAPABLE of moving more quickly, I promise you they didn't test the maximum speed of the car at a series of stop-lights. They probably got a big, open road cough cache lines cough and ran straight down it.

And again, I come back to the fact that clock speed is simply a result of how the instruction set is executed. If you watch instruction set changes even within intel architectures, you'll notice that latency and throughput change. This means that clocks-per-instruction also change. There are instructions that took 1 cycle in the past five architectures that now take 3, and I know for a fact that intel's architecture has not improved average clock speed by a factor of 3 in the last few architectures.

Um, no they're not. The CPU pipeline is inextricably linked to the low level CPU caches. Intel didn't decouple the CPU core and L3 cache clock domains until Haswell. This added about a 6 cycle (CPU side) penalty to accessing the L3 cache. Furthermore, the micro-ops that handle load and store instructions are designed to interact with the appropriate cache.

Only in that it requires memory to pass through them. It does not imply that the instructions are loaded directly from the low-level cache. They are loaded in from the L1, then, assuming there is a miss, the next level must be checked, and so on until the last level, then main memory. While it is linked, that doesn't mean prefetching isn't happening in the background, making all latency issues seem almost nonexistent between cache levels. Again, if you're going to test one particular aspect of your CPU, you'd best not sully it up with test code that doesn't actually test its best performance. I.e. run down the cache line so the pre-fetcher will work properly, and you will get very cache-friendly speeds, with very few misses, even on higher levels.

Bigger caches certainly can be better but they are rarely ever contiguous and uniform in latency. Intel actually breaks the L3 cache up such that some cores have lower latency to certain cache blocks than others. The wonky cache architecture on AMD's Bulldozer CPUs is partly responsible for the microarchitecture's performance problem. Bulldozer has a 2MiB L2 cache per module and the latency is in the range of 25-27 cycles; compare this to the 256KiB per core on Intel's i7 series microprocessors that has a 12 cycle latency. The miss penalty on Bulldozer is enormous and this is a result of the higher level caches being too big and the lower level cases not being associative enough.

That's all well and good, but you said it yourself: "Can be better." That was what I was getting at. They have the POTENTIAL to be better for the simple fact that memory access is still worse, and, in an ideal world, having more space for cache would mean less cache misses, as a program progresses its execution.

Tck for a 4Ghz microprocessor is 1/4E9 seconds, or 250 picoseconds. Median access time for an L3 cache hit is 30 cycles on Nehalem/Sandybridge/Ivybridge, and about 36 cycles on Haswell. That's 7.5 nanoseconds to 9 nanoseconds. During this time, the microprocessor keeps chugging along thanks to out-of-order execution and speculation. The median round trip time for a read to DDR3-SDRAM is about 53 nanoseconds on top of the 30-36 cycles. The SDRAM is 6-7 times slower than the L3 cache when the cache is operating at its normal clock rate, not 3 times.

4GHz processor? (it's Hz, by the way, not hz, if we're really going to be playing semantics here :p). Since when is the most recent, ridiculously high-power CPU "average"? Intel JUST released its FIRST 4GHz processor last year. I was using figures from average computers. Not factory-new, first shipment off the line computers. Either way, my point still stands: Main memory is slow. Cache is not necessarily ideal (all the time) but it's certainly faster than main memory.

Also, I said "many architectures." Not "the newest architectures."



Are you talking about the fsqrt instruction? If so, that instruction is performed entirely in microcode and takes about 50-60 cycles depending on the microarchitecture. Some language libraries may lean on the CPU instruction, while others may implement it entirely in software. It's possible to get a faster approximation by using a fast inverse square root and then taking the reciprocal but this is only an approximation.
Honestly I don't even remember where I got that figure... You've got me on that one :p However, sqrt is still significantly faster than running out to memory XD

They've fallen into a standard (not niche) cache hierarchy because that's where their simulations have placed optimal performance.
Niche simply means a position. You could also call it a standard, but it would not be incorrect to call it a niche. Let's not argue semantics, please.
 
And registers have as little as 0 cycle latency with some instructions. Please explain how 0 is greater than 4. And no, the L1 instruction cache is not hit on pretty much every single cycle. Also, please explain to me how it is you almost NEVER get a cache miss from L1 cache when in order to get ANY cache misses you would have to miss the first level of cache? If there were hardly any cache misses on the L1, then why would we even have the L2 and L3? Seriously, if it were that rare, going to memory every once in a great while would be like nothing.

The register file and the caches are accessed at different stages in the pipeline. Neither one is "faster" because "fast" is not a defined unit of measurement in that scope. The L1 instruction cache is accessed on almost every single cycle by the x86 CISC instruction decoder, and the micro-op cache is accessed on pretty much every cycle as well. The aggregate L1 hit rate (for cachable memory) for a well designed cache architecture and a well optimized program is well over 90%.

Now take a 2D array of size 1000x1000. Access it in column-major order and record the cache misses. I'll be back on in a week to see if your computer managed to finish it yet. You can tell me how rare your L1 data cache misses are when it misses virtually every call to another row :p .

That's why there's an explicit prefetch instruction and non-temporal write instructions. Calculate the stride of the array, manually prefetch the non-unit-stride data, and watch those misses disappear.

Please, allow me to clarify what I am talking about when I say data. See, I'm talking about data. Does that make it more clear? Because instructions, last I checked, especially when referring to the cache, are not data. As in data cache. As in loading data into the program. Not loading the program itself. My apologies for not making that more clear.

I figured that's what you meant, but you did write program which implies both the instruction and data components. You're absolutely right though in that compact arithmetic operations can be performed entirely in registers after filling with little to no spilling.

Correction: should be marked as non-cacheable. It wouldn't make sense to leave response to an external PCI event sitting in cache, anyway. Also, cache is loaded in cache lines. If it wasn't clear, that's what I was referring to. As far as I know (I may be wrong), cache lines have been used since well into the era of x86 architectures.

Using write-back caching on an MMIO range is a good way to cause a kernel panic. write-through caching requires very careful handling because that line may be cached for reading. Common practice is to write to an MMIO address and then read back from it after the hardware has done its thing, caching can screw that up.

As for the cache blocks, you're right. Modern x86 architectures use 64-byte (aligned) cache blocks. I believe that these blocks are virtually indexed, physically tagged at the L1 and physically indexed, physically tagged L2 and above.

Faster clock speed isn't going to have an affect on cache latency.

Sure it is. The L1 and L2 cache are on the same clock domain as the rest of the core. They're synchronous. On Intel's Nehalem/Sandy/Ivy/Haswell microarchitectures it takes four cycles to access the L1 instruction caching using a simple pointer (register indirect, or immediate indirect) or five cycles using an offset (thanks to an extra addition). Lowering the clock rate to save power doesn't adjust the pipeline, it stays at 4-5 cycles and the real-time latency changes accordingly. Similarly, cache stability may place an upper limit on clock rate, requiring the cache and fetch component of the pipeline to be reworked. This is what happened during the Pentium 4 era when Prescott jumped to 31 pipeline stages; the backend got a lot faster than the frontend.

4GHz processor? (it's Hz, by the way, not hz, if we're really going to be playing semantics here :p ). Since when is the most recent, ridiculously high-power CPU "average"? Intel JUST released its FIRST 4GHz processor last year.

Intel had a commercially available 3.8Ghz microprocessor available more than a decade ago. They were working on a 4Ghz Pentium 4 model for a while too but I have no idea what happened to it. Many consumer i7 microprocessors can turbo boost into the 3.8-3.9GHz range. AMD also has many microprocessors in the 4Ghz+ range. It's just an example anyway