AMD Piledriver rumours ... and expert conjecture

Reynod · Oct 27, 2011

We have had several requests for a sticky on AMD's yet to be released Piledriver architecture ... so here it is.

I want to make a few things clear though.

Post a question relevant to the topic, or information about the topic, or it will be deleted.

Post any negative personal comments about another user ... and they will be deleted.

Post flame baiting comments about the blue, red and green team and they will be deleted.

Enjoy ...

lilcinw · Jan 27, 2012

@bawchicawawa Because you are arguing semantics. They are based on the same core. Inclusion of the L3 cache has little impact on performance. Besides, there should be a direct correlation when comparing llano to PhII and trinity to PD as they are the same arch without the corresponding L3.

If Ph II > llano, PD > trinity and trinity > llano it should follow that PD > Ph II.

Looking back at benchs it looks like Ph II beats out llano by ~20% (mostly due to clocks being ~17% higher). If the rumored 20-25% improvement is true for the CPU side then PD should be a clear improvement over Ph II.

-Fran- · Jan 27, 2012

But if the CPU side of the equation gets worse, Trinity becomes very hard to recommend. For its intended use [IE: Not gaming], you are more likely to notice a weak CPU over a weak GPU.

My argument is a simple one: If BD < PII, and Llano is based on Athlon, then its quite possible Trinity could be weaker then Llano, at least as far as the CPU goes.

Not quite, actually.

The intended use for Llanos, actually include light gaming and HTPC duty. You don't think that of Intel CPUs, because they just can't do it at the same price points. You'd need a discrete card, adding like 40usd+. I've read in phoronix about the "HD4000" that Ivy sports... Looks like it's like 60%-70% current Llano performance still: http://phoronix.com/forums/showthread.php?68294-50-slower-than-AMD!-Intel-FAIL-again-with-the-HD4000-graphic-hardware

And it's even more pronounced in notebooks the difference in price and value. Llanos are dirt cheap compared to their Intel pairs and perform better when gaming. Youtube has a lot of examples of it. So, I'd say this is not about the CPU arch inside alone; it's the balance they have.

Like I said, the thing is that, if the PD modules (CPU side) doesn't deliver enough to push the GPU side, Trinity won't become a better value proposition than Llano and THAT will be bad.

Cheers!

fazers_on_stun · Jan 27, 2012

http://www.anandtech.com/show/5465/amd-q411-fy-2011-earnings-report-169b-revenue-for-q4-657b-revenue-for-2011

In terms of product shipments, Q4 marked the launch of AMD’s Bulldozer architecture. AMD technically began shipping Bulldozer products for revenue in Q3, but Q4 was the first complete quarter. For that reason server and chipset revenue grew by double-digits over Q3, while desktop Bulldozer sales went unmentioned in AMD’s report. Meanwhile compared to Q4 of 2010 AMD’s CPU & chipset revenue was up slightly, with the bulk of the difference due to higher mobile CPU (Brazos and Llano) and chipset sales. Unfortunately for AMD this didn’t do anything to help their ASP for the quarter, and a result it’s flat versus 2010.

So my guess is the reason all those desktop BD's get sold out is short supply due to (1) GloFlo yield issues continue to some extent, and (2) AMD making mostly server versions as they can charge more $$..

g4114rd0 · Jan 27, 2012

- "Is the just or the unjust the hops?"
…really don't believe APU$ , until you see one on System Builder Marathon, If the purpose of this is to increase your fps 4 Desktop gaming.
It is simple, at the moment CPU+GPU will be better than APU+ GPU, maybe that wasn't such a good example.
The modest G850 + GPU vs your favorite APU + low cost GPU.

lilcinw · Jan 27, 2012

APUs are not for serious desktop gaming rigs. They are to provide a cheap CPU + GPU on a single die. OEMs love them and they give better graphics prowess than any IGP that came before.

The office I work in is a complete Dell shop and the only computers that have a discrete GPU are 'workstation' class with workstation class GPUs.

APUs are a cheap value add for those that are going to buy a box that they never open.

gamerk316 · Jan 27, 2012

Also, second gen BD will have improvements from the first gen, so your argument isn't really valid...

Someone, please, tell me what these improvements are? I hear this statement a lot, but I've yet to hear it quanitified.

gamerk316 · Jan 27, 2012

Also: http://www.anandtech.com/show/5448/the-bulldozer-scheduling-patch-tested/3

I'm very torn about what MS has to do to the scheduler to make BD [and designs like it] work better. Schedulers should remain simple, for fairly obvious reasons.

-Fran- · Jan 27, 2012

Someone, please, tell me what these improvements are? I hear this statement a lot, but I've yet to hear it quanitified.

Well, even if they haven't done anything with the arch itself, at least they should have a better 32nm process. If the sheets showing that Trinity will be sporting like 3.2-3.4Ghz using the same 32nm and having around the same GPU consumption, it can be considered an improvement... I guess... lol

But, come on gamerk, at least give them the benefit of doubt.

Also: http://www.anandtech.com/show/5448/the-bulldozer-scheduling-patch-tested/3

I'm very torn about what MS has to do to the scheduler to make BD [and designs like it] work better. Schedulers should remain simple, for fairly obvious reasons.

Schedulers are long gone from being "simple" IMO. You can keep the principle behind it as "simple", but when looking at the implementation, they're usually around the same. I recall taking a look at the linux scheduler and it was not "simple" xD

Damn hardware makers! >_<'

Cheers!

noob2222 · Jan 27, 2012

@bawchicawawa Because you are arguing semantics. They are based on the same core. Inclusion of the L3 cache has little impact on performance. Besides, there should be a direct correlation when comparing llano to PhII and trinity to PD as they are the same arch without the corresponding L3.

If Ph II > llano, PD > trinity and trinity > llano it should follow that PD > Ph II.

Looking back at benchs it looks like Ph II beats out llano by ~20% (mostly due to clocks being ~17% higher). If the rumored 20-25% improvement is true for the CPU side then PD should be a clear improvement over Ph II.

L3 actually does impact performance, especially in games.

http://www.techspot.com/review/458-battlefield-3-performance/page7.html

athlon II 625 is borderline useless in single player, while the PII 560 is capable of hanging in with the rest, both are 3.3ghz

http://www.techpowerup.com/reviews/AMD/Phenom_II_X4_840/9.html

945, 3.0 ghz full l3 cached chip beating the snot out of the 3.2 ghz PII 840 cacheless chip. The only exception was dirt 3, so yes L3 cache does matter, a lot.

esrever · Jan 27, 2012

Also: http://www.anandtech.com/show/5448/the-bulldozer-scheduling-patch-tested/3

I'm very torn about what MS has to do to the scheduler to make BD [and designs like it] work better. Schedulers should remain simple, for fairly obvious reasons.

the scheduler can be very simple in terms of computing and be very complex to understand. Adding a lot of lines of code to the scheduler will not affect the performance of the OS because of how light it is compared to the actual code trying to run.

g4114rd0 · Jan 28, 2012

The whole scene It should pulse with scheduler issues & the futile exotic circuit techniques.

palladin9479 · Jan 28, 2012

L3 actually does impact performance, especially in games.

http://www.techspot.com/review/458-battlefield-3-performance/page7.html

athlon II 625 is borderline useless in single player, while the PII 560 is capable of hanging in with the rest, both are 3.3ghz

http://www.techpowerup.com/reviews/AMD/Phenom_II_X4_840/9.html

945, 3.0 ghz full l3 cached chip beating the snot out of the 3.2 ghz PII 840 cacheless chip. The only exception was dirt 3, so yes L3 cache does matter, a lot.

I've corrected this many times over again. L3 will only have an impact if your CPU's prediction is crap. If your CPU's prediction is good then it'll rarely ever hit L3 cache. L3 is a very very BAD place to be at, the latency to access it guarantee's your CPU will be stalled.

Another misconception is that L3 is some sort of "buffer" for main memory, this is not true. L2 cache is filled from main memory by the branch prediction / data prefetcher. L3 is used as a pool for data that won't fit into L2 and that ~might~ be needed. Typically when you do a L2 cache fill you grab nearby memory and put that in L3, "just in case". This only comes into place once your running code / data that can't fit into your L2 cache, or your predictor is so bad that it can't keep in L2 what needs to be in L2.

In short, L3 acts as a safety net to try to catch cache miss's before they hit main memory, if there are few or no cache miss's then L3 sits there doing absolutely nothing. L3 also takes inordinate amount of die space for what little performance it provides. Better to add more L2 then to resort to L3's. The only reason I can think for AMD devoting that much die space to L3 is they knew the predictor / prefetch unit was having issues and implemented the L3 cache as a mitigation strategy. If your looking to conserve die space, the first thing that needs to go is L3.

palladin9479 · Jan 28, 2012

Also: http://www.anandtech.com/show/5448/the-bulldozer-scheduling-patch-tested/3

I'm very torn about what MS has to do to the scheduler to make BD [and designs like it] work better. Schedulers should remain simple, for fairly obvious reasons.

Umm what?

I don't want to sound like an a$$, but that's not how schedulers work. Their not some three lines of C+ code somewhere. Their complicated as far as logic goes, just not nearly as massive as other components.

All MS did was alter thread priorities depending on their originating process so as to prevent heavy workload threads from being scheduled on the same die and to ensure that a light workload thread was put with other threads of the same memory space. It's a set of logic that only kicks in if the OS detects a BD uArch CPU installed. MS had to do something similar when Intel first introduced HT.

Also, adding more lines of code does NOT slow a program down. Only the binary code is executed, and only the code that is called is executed. The worst it does is make it harder to debug that particular section for errors as there is more places for an error to occur.

esrever · Jan 28, 2012

I would like to see AMD eventually adding eDRAM to the fusion chip. Would be a good idea to increase performance without the need of discreet high performance memory.

noob2222 · Jan 28, 2012

I've corrected this many times over again. L3 will only have an impact if your CPU's prediction is crap. If your CPU's prediction is good then it'll rarely ever hit L3 cache. L3 is a very very BAD place to be at, the latency to access it guarantee's your CPU will be stalled.

Another misconception is that L3 is some sort of "buffer" for main memory, this is not true. L2 cache is filled from main memory by the branch prediction / data prefetcher. L3 is used as a pool for data that won't fit into L2 and that ~might~ be needed. Typically when you do a L2 cache fill you grab nearby memory and put that in L3, "just in case". This only comes into place once your running code / data that can't fit into your L2 cache, or your predictor is so bad that it can't keep in L2 what needs to be in L2.

In short, L3 acts as a safety net to try to catch cache miss's before they hit main memory, if there are few or no cache miss's then L3 sits there doing absolutely nothing. L3 also takes inordinate amount of die space for what little performance it provides. Better to add more L2 then to resort to L3's. The only reason I can think for AMD devoting that much die space to L3 is they knew the predictor / prefetch unit was having issues and implemented the L3 cache as a mitigation strategy. If your looking to conserve die space, the first thing that needs to go is L3.

Sb-E must have the worst branch prediction in cpu history if that's the case.

palladin9479 · Jan 29, 2012

Sb-E must have the worst branch prediction in cpu history if that's the case.

Umm what?

Intel just added a ton of L3 as a selling point.

Can't stress this enough, L3 doesn't do squat unless your taking cache miss's on L2, which is a very bad place to begin with. We've had "L3 Cache" since the mid 90's, AMD was actually the first to create it. It didn't do squat then and it's not doing squat now.

MU_Engineer · Jan 29, 2012

Umm what?

Intel just added a ton of L3 as a selling point.

Can't stress this enough, L3 doesn't do squat unless your taking cache miss's on L2, which is a very bad place to begin with. We've had "L3 Cache" since the mid 90's, AMD was actually the first to create it. It didn't do squat then and it's not doing squat now.

Not quite. L3 cache as we know it- a third level of cache on the CPU die- started around 2000 for x86 processors with the Xeon MP version (Foster MP) of the Pentium 4 Willamette. AMD did have a "TriLevel Cache" as they put it, with a combination of an L2-containing K6-2+/K6-III/K6-III+ and a Super Socket 7 motherboard with on-motherboard cache SRAM cells. The cache SRAM was supposed to act as an FSB-speed L2 cache for L2-less Socket 7 CPUs like Pentium MMXes and K6/K6-2 CPUs, but would become a third level of cache with an L2-containing CPU. It ran at FSB speed and had a lot less performance relative to the L2 cache than a modern on-die L3 cache has.

Also, L3 cache does have its uses. Server CPUs have large amounts of L3 cache because the L3 cache helps them handle the massive amount of bus traffic and congestion. The old NetBurst Xeon MPs pretty much had to have L3 cache as the relatively low-clocked shared FSB architecture caused massive bottlenecks, which the large L3 caches helped to improve noticeably. Current IMC-equipped multi-socket CPUs benefit from L3 cache as the L3 helps with keeping cache coherence between the various CPUs. A good example of that is AMD's HyperTransport Assist feature on newer Opterons, which devotes a MB or two of each L3 cache to keeping coherency data, allowing for better scaling. Single-socket CPUs may not always benefit as much from L3 cache as multiprocessor CPUs, but rarely do you see an L3-less CPU perform better than an otherwise similar CPU with loads of L3 cache.

The continued integration of more and more formerly discrete functions such as GPUs onto CPU dies will continue to drive bandwidth demands up significantly faster than RAM bandwidth is improving. Adding yet more cache to the CPU certainly seems like a more workable solution than having yet more massively interleaved memory controllers due to chip complexity, board complexity, and RAM costs. Maybe that cache will be something closer to VRAM than L3 cache, but trying to put something like an 8-channel DDR3-2400 controller on a CPU is not very workable.

palladin9479 · Jan 29, 2012

Not quite. L3 cache as we know it- a third level of cache on the CPU die- started around 2000 for x86 processors with the Xeon MP version (Foster MP) of the Pentium 4 Willamette. AMD did have a "TriLevel Cache" as they put it, with a combination of an L2-containing K6-2+/K6-III/K6-III+ and a Super Socket 7 motherboard with on-motherboard cache SRAM cells. The cache SRAM was supposed to act as an FSB-speed L2 cache for L2-less Socket 7 CPUs like Pentium MMXes and K6/K6-2 CPUs, but would become a third level of cache with an L2-containing CPU. It ran at FSB speed and had a lot less performance relative to the L2 cache than a modern on-die L3 cache has.

Also, L3 cache does have its uses. Server CPUs have large amounts of L3 cache because the L3 cache helps them handle the massive amount of bus traffic and congestion. The old NetBurst Xeon MPs pretty much had to have L3 cache as the relatively low-clocked shared FSB architecture caused massive bottlenecks, which the large L3 caches helped to improve noticeably. Current IMC-equipped multi-socket CPUs benefit from L3 cache as the L3 helps with keeping cache coherence between the various CPUs. A good example of that is AMD's HyperTransport Assist feature on newer Opterons, which devotes a MB or two of each L3 cache to keeping coherency data, allowing for better scaling. Single-socket CPUs may not always benefit as much from L3 cache as multiprocessor CPUs, but rarely do you see an L3-less CPU perform better than an otherwise similar CPU with loads of L3 cache.

The continued integration of more and more formerly discrete functions such as GPUs onto CPU dies will continue to drive bandwidth demands up significantly faster than RAM bandwidth is improving. Adding yet more cache to the CPU certainly seems like a more workable solution than having yet more massively interleaved memory controllers due to chip complexity, board complexity, and RAM costs. Maybe that cache will be something closer to VRAM than L3 cache, but trying to put something like an 8-channel DDR3-2400 controller on a CPU is not very workable.

The logic behind AMD's L3 cache and the later released on-die L3 cache was the same, it's used as an emergency net for when you miss on L2 caching.

In the server world it's a bit different, not due to the I/O but due to the sheer amount of independent process's going on. With so much happening even the best predictor / prefetch engine won't be able to fit it all into L2, so then L3 acts as a good buffer region.

In the desktop world, since that's usually what we're talking about, L3 gives next to no performance benefit for the large amount of die space it takes up. We're talking less then 11~15% increase from doubling the size, and that's on tasks orientated to utilize the large amounts of L3. On something like a video game / AV processing / archiving / office automation you'll get next to no improvement. Best example is I5 2500K clocked @3.6Ghz vs I7 3930K Ghz @3.6 (make them the same speed / core count). The 3930 has double the L3 cache (12MB vs 6MB) yet rarely shows much improvement over the I5. Now given the exact same CPU uArch, having some L3 > having no L3 as you will always have L2 miss's sometimes, the more the miss's the more valuable the L3 cache becomes. But if you need die space for something like a IGP, then the first thing to remove is the L3 as it provides the least performance benefit.

MU_Engineer · Jan 29, 2012

In the desktop world, since that's usually what we're talking about, L3 gives next to no performance benefit for the large amount of die space it takes up. We're talking less then 11~15% increase from doubling the size, and that's on tasks orientated to utilize the large amounts of L3. On something like a video game / AV processing / archiving / office automation you'll get next to no improvement. Best example is I5 2500K clocked @3.6Ghz vs I7 3930K Ghz @3.6 (make them the same speed / core count). The 3930 has double the L3 cache (12MB vs 6MB) yet rarely shows much improvement over the I5. Now given the exact same CPU uArch, having some L3 > having no L3 as you will always have L2 miss's sometimes, the more the miss's the more valuable the L3 cache becomes. But if you need die space for something like a IGP, then the first thing to remove is the L3 as it provides the least performance benefit.

L3 does take up some die space, but not an enormous amount on most desktop chips. The typical "core" of logic + L1 + L2 takes up much more space than the L3, and if you have a reasonable amount of L2 cache, especially so. Video games often do like L3 cache as the Athlon II vs. Phenom II tests show. Encoding, not so much. Yes, the performance difference between L3-less Athlon IIs and L3-equipped Phenom IIs is often in that 10-15% range. But "enthusiasts" often say that one chip "absolutely slaughters" another when the performance delta is far less than 10-15%. I would expect that the enthusiasts would be all up in arms if Intel or AMD decided to drop L3 from all of their chips, claiming it makes them "glacially slow" or some other hyperbole...

Comparing a 3.6 GHz i5-2500K to a 3.6 GHz i7-3930K and saying they differ just in cache size is about like comparing apples to oranges. You have a four-core, four-thread chip with a 128-bit IMC versus a six-core, 12-thread chip with a 256-bit IMC. Any program that runs similarly on both chips is single-threaded or very close to it, and essentially just looks at single-core throughput and maximum Turbo Boost clock speed. That is a shrinking number of programs these days, and that number will continue to shrink in the future.

Also, die space isn't really that big of a concern. The general argument about larger dies being bad essentially boils down to a higher cost to make larger dies. That is true, to a point as larger dies have higher intrinsic costs due to fewer possible candidate dies per wafer due to sheer die area, as well as lower percentage yields of larger dies compared to smaller ones. However, it costs a lot of money to make a separate smaller-die mask and production line, and most people forget that. It is often LESS expensive to use the great big old server die with the mountain of L3 rather than make a special smaller mask with a small or no L3 that gives overall similar performance. How do I know this? We have seen many, many parts from both makers where this is the case. That i7-3930K you mentioned above is a 415 mm^2 monstrosity of a die, and a decent chunk of the die is actually fused off, as the die physically has 8 cores and 20+ MB of L3 cache onboard! Other big wastes of space on desktop chips are the massively wide memory interfaces seen in the LGA2011 desktop chips and the disabled HT/QPI interfaces that are not active on anything but the 4+ socket versions of those chips. At least the L3 may help performance a little, disabled parts do nothing at all. IGPs also do essentially nothing on enthusiast chips as well, just ask anybody with an unlocked LGA1155 Sandy Bridge. Who buys an i7-2700K to use it with the crappy onboard IGP? But, the reason it is there is because Intel uses the same parent die for their quad-core mobile chips, and many of those really do just have the on-die IGP paired with the CPU.

palladin9479 · Jan 30, 2012

L3 does take up some die space, but not an enormous amount on most desktop chips. The typical "core" of logic + L1 + L2 takes up much more space than the L3, and if you have a reasonable amount of L2 cache, especially so. Video games often do like L3 cache as the Athlon II vs. Phenom II tests show. Encoding, not so much. Yes, the performance difference between L3-less Athlon IIs and L3-equipped Phenom IIs is often in that 10-15% range. But "enthusiasts" often say that one chip "absolutely slaughters" another when the performance delta is far less than 10-15%. I would expect that the enthusiasts would be all up in arms if Intel or AMD decided to drop L3 from all of their chips, claiming it makes them "glacially slow" or some other hyperbole...

Comparing a 3.6 GHz i5-2500K to a 3.6 GHz i7-3930K and saying they differ just in cache size is about like comparing apples to oranges. You have a four-core, four-thread chip with a 128-bit IMC versus a six-core, 12-thread chip with a 256-bit IMC. Any program that runs similarly on both chips is single-threaded or very close to it, and essentially just looks at single-core throughput and maximum Turbo Boost clock speed. That is a shrinking number of programs these days, and that number will continue to shrink in the future.

Also, die space isn't really that big of a concern. The general argument about larger dies being bad essentially boils down to a higher cost to make larger dies. That is true, to a point as larger dies have higher intrinsic costs due to fewer possible candidate dies per wafer due to sheer die area, as well as lower percentage yields of larger dies compared to smaller ones. However, it costs a lot of money to make a separate smaller-die mask and production line, and most people forget that. It is often LESS expensive to use the great big old server die with the mountain of L3 rather than make a special smaller mask with a small or no L3 that gives overall similar performance. How do I know this? We have seen many, many parts from both makers where this is the case. That i7-3930K you mentioned above is a 415 mm^2 monstrosity of a die, and a decent chunk of the die is actually fused off, as the die physically has 8 cores and 20+ MB of L3 cache onboard! Other big wastes of space on desktop chips are the massively wide memory interfaces seen in the LGA2011 desktop chips and the disabled HT/QPI interfaces that are not active on anything but the 4+ socket versions of those chips. At least the L3 may help performance a little, disabled parts do nothing at all. IGPs also do essentially nothing on enthusiast chips as well, just ask anybody with an unlocked LGA1155 Sandy Bridge. Who buys an i7-2700K to use it with the crappy onboard IGP? But, the reason it is there is because Intel uses the same parent die for their quad-core mobile chips, and many of those really do just have the on-die IGP paired with the CPU.

*Cough*

Core i7 LGA 2011

"Shared L3 Cache" is the size of 4~5 "cores". It's the single largest component on the die.

I picked those two Intel CPU's because their identical uArch's. Same branch prediction, instruction prefetch / scheduler. Disable the additional cores to make the core count the same, clock them both at the same frequency, disable HT and you have a pure test of L3. For something that eats out the die space of 4+ cores it provides less then 15% performance increase from doubling it (6MB to 12MB).

As I said earlier, having L3 > not having L3, but of all the components on a die, the one that can be removed with the least effect on performance would be the L3. We've seen time and time again how little difference it makes. You could cut it in half and include two more cores and get more total processing power, or remove it entirely and replace it with 4~5 cores to nearly double your total processing power. L3 is cheap to implement and if you get a defect you can just down bin the die, which is essentially what Intel does. You throw it in there because you ran out of options or want to make a big marketing statement (20MB of L3!!!!)

I think the context of this discussion is being lost in the details. This isn't about whether L3 does or does not improve performance as any idiot can see that 1MB of L3 > 0MB of L3. It's about the degree that it improves performance and whether it's required to reach high performance. Especially as AMD has removed it from their lower power APU's to create space for a bigger GPU. The exact scenario I mentioned above about which component to remove to make room to something else while keeping die space small for low cost production.

palladin9479 · Jan 30, 2012

Picture of a Thuban die

It's 6MB of L3 could easily fit 2 cores worth of processing power.

Now I think I need to make a clarification because I know someone is about to try this with me. L3 only matters when your predictor sucks, or when your L2 is too small to contain all the data it needs to contain. On uArch's with small amounts of dedicated L2, L3 makes more of a difference as it can server as a pseudo slow L2 to prevent miss's from hitting main memory.

Here is a good place describing what I'm talking about

http://www.anandtech.com/show/2702/5

On a Intel Bloomfield design, L1 cache latency is 4 cycles, L2 is 11, L3 is 42. L3 is 381% slower then L2. Intel got significantly better with Sandy Bridge, L3 cache is around 25 cycles, making it slightly more then 200% slower then L2.

Phenom II is 3 cycles for L1, 15 for L2 is approx 50+ for L3.

Waiting 40+ cycles for data that should be there in 3~15 cycles is very bad for your CPU, it means your CPU is stalling out. Waiting for main memory is even worse though.

As I've said before, L3 will not make a good design great, but it'll make a bad design mediocre (Bull Dozer). BD utilizes large L3 cache to mask it's poor scheduling / prediction rates. If they haven't fixed that, then PD / Trinity is going to suck hard.

-=Edit=-

Found a better break down of various latencies to illustrate what I'm talking about.

http://www.sisoftware.net/?d=qa&f=ben_mem_latency

Looks like the I5-2500K has hit the "sweet spot" for L3 cache to core to latency ratio.

Also demonstrates the importance of pre-fetchers and prediction. Sequential search's showed a 400~500% increase in efficiency from the prefetcher grabbing it before the program asked for it resulting in lower latencies. This is where I believe Bulldozer screwed up badly.

gamerk316 · Jan 30, 2012

But, come on gamerk, at least give them the benefit of doubt.

If anyone cares to remember, I was the first [and only?] person who WAY back in the BD roumer thread, speculated that due to suspected IPC regressions, BD could be slower then PII if it wasn't clocked in the 4GHz range.

So now I see a die shrink, with about the same clocks, and no news of any actual improvements in the design, and you want me to stay optimistic?

L3 cache is around 25 cycles, making it slightly more then 200% slower then L2.

Thats actually outstanding though, considering 20-30 cycles for the L2 is considered "normal". Compared to 80-100 cycles for main memory?

I'm actually VERY interested to see how DDR4, and its vastly improved clocks, affects this equation. With CPU-RAM clocks approaching 1:1 again, large amounts of cache may actually become a hinderance [IE: The die space might be better off used for something else, rather then cache].

jdwii · Jan 30, 2012

I'm with palladin9479 L3 cache is not that important and should be taken away to reserve die space especially in APU's were i would rather see a extra module or more Radeon cores., Now on real high-end CPU's such as the server market L3 cache is needed, But i really think Amd should strip the L3 cache on the 4 core BD to make it real cheap like what they did to the 4 core Athlon's. If they did this they could charge around 80$ for the 4 core BD and it wouldn't be that much slower heck they might be able to save 20-30% power consumption and then clock it up to around 4.2Ghz stock and have a faster 4 Core BD then they do now.

OR if the prediction unit is as weak as some say maybe they should just improve that and then strip the L3 away on the lower-end models.

esrever · Jan 30, 2012

the l3 is really simple logic and easy to put into a chip which is why its used.

bawchicawawa · Jan 30, 2012

You guys also have to remember that the current apu's have +100% more l2 cache than the phenoms do, which is good.

Does anyone know if AMD will do the same for trinity, take the l2 cache from piledriver and double it?

AMD Piledriver rumours ... and expert conjecture

Administrator

Distinguished

Glorious

Splendid

Distinguished

Distinguished

Glorious

Glorious

Glorious

Distinguished

Splendid

Distinguished

Splendid

Splendid

Splendid

Distinguished

Splendid

Splendid

Splendid

Splendid

Splendid

Splendid

Glorious

Splendid

Splendid

Distinguished

Share this page