AMD Piledriver rumours ... and expert conjecture

Reynod · Oct 27, 2011

We have had several requests for a sticky on AMD's yet to be released Piledriver architecture ... so here it is.

I want to make a few things clear though.

Post a question relevant to the topic, or information about the topic, or it will be deleted.

Post any negative personal comments about another user ... and they will be deleted.

Post flame baiting comments about the blue, red and green team and they will be deleted.

Enjoy ...

saint19 · Jul 5, 2012

I'd go with the task manager open while the game is running, in the chart you can see if the game really use the 8 cores or not (maybe not)

Blandge · Jul 6, 2012

Right now every "core" on a chip has it's own SIMD FPU unit. They will remove all of those and each core will utilize the iGPU's array. Remember GPU's are the equivalent of 12~30+ FPUs, so even with eight cores sharing one SIMD iGPU there is plenty of power to go around. It's also more efficient use of processing resources.

Honestly the "FPU" has already been replaced by SIMD units. We just refer to them both by the same word.

This seems like an interesting idea, but I'm not buying it. The FPU uses up so little real estate as it is that introducing long wires running across the chip from the CPU pipeline to the iGPU and forcing CPU SIMD instructions to compete with graphics seems illogical.

You cannot simply route long wires running from one end of the die to the other without incurring penalties in terms of power, timing and signal integrity.

If they do manage to make this work I'd be very interested to see what the die shot looks like.

palladin9479 · Jul 6, 2012

This seems like an interesting idea, but I'm not buying it. The FPU uses up so little real estate as it is that introducing long wires running across the chip from the CPU pipeline to the iGPU and forcing CPU SIMD instructions to compete with graphics seems illogical.

You cannot simply route long wires running from one end of the die to the other without incurring penalties in terms of power, timing and signal integrity.

If they do manage to make this work I'd be very interested to see what the die shot looks like.

iGPU's now consume ~50% of the CPU die, they have their own instruction / data cache, registers and stack.

Most of what you said would make sense, ~IF~ it was an integer unit. SIMD FPU on the other hand is designed to be a co-processor, its a separate processing unit with it's own registers and stacks, though it shares the cache. Modern "FPU's" are just SIMD units, modern GPU's are just huge SIMD array's. A 7870 has a config of 1280:80:32, that's 1280 32-bit SIMD execution units. That's 320 128-bit SIMD FPU's. Just an idea of what's inside a GPU. Now obviously an iGPU is much smaller, next generation should be the equivalent of 24~32 128-bit SIMD FPU's (96~128 texture units seems more then reasonable).

And while I know your trying to utilize hyperbole, it doesn't help much on this kind of forum. Your not "running wires" inside an integrated circuit made out of silicon. There will be a path created from the instruction decoder / predictor to the GPU's instruction que with a higher priority on those instructions being processed. Seeing as iGPU's are already 50% of a CPU's die, it becomes a matter of physically ensuring that the instruction control units are places near each other.

Now for the interesting part. This entire time I've been speaking about APU's who have fully functioning iGPU's inside of them. High performance desktop CPU's use L3 instead of the iGPU for a slight performance boost. If your needing to implement a centralized SIMD array for the purpose of processing SSE type instructions from the GPU, then taking out some L3 becomes the natural choice. Seeing that your not needing this unit to actually process graphics data itself, you can further remove components and make it much smaller then it what it would be on an APU. Even at "only" 64 texture units, that is 16 128-bit FPU's worth of power, or a 2x increase over what you have now.

People need to realize that "FPU" is just another part of a generic CPU (ALU / AGU / MMU / ect..). It's a specialized co-processor that is addressed separately and treated like an additional CPU. We've had integrated FPU's for so long that people forgot their not part of the x86 ISA. Decoupling them is easy.

lilcinw · Jul 6, 2012

iGPU's now consume ...

I just walked downstairs after watching the big bang theory and heard Sheldon's voice as I read this post. :lol:

It seems to me that if you were to replace the FPU with an iGPU you would literally replace it, i.e. pair up multiple EUs/Shaders with each integer core/module. That removes the latency problems of re-routing SSE etc. instructions to the other side of the die to be processed by the iGPU. If AMD bound 8 GCN shaders to each module it would replace the current 2x128bit FPU but the current SIMD engines in Trinity's iGPU have 16 shaders which, as palladin9479 points out, would double the shared-FPU of BD/PD.

palladin9479 · Jul 6, 2012

I just walked downstairs after watching the big bang theory and heard Sheldon's voice as I read this post. :lol:

It seems to me that if you were to replace the FPU with an iGPU you would literally replace it, i.e. pair up multiple EUs/Shaders with each integer core/module. That removes the latency problems of re-routing SSE etc. instructions to the other side of the die to be processed by the iGPU. If AMD bound 8 GCN shaders to each module it would replace the current 2x128bit FPU but the current SIMD engines in Trinity's iGPU have 16 shaders which, as palladin9479 points out, would double the shared-FPU of BD/PD.

The idea is also to combine those units into a central one. Eight "cores" is eight FPUs, even in BD's case where they bonded two 128-bit FPU's into a single big one your still looking at lots of unused potential. A cores single FPU is rarely kept busy the full time, and when your talking eight cores the chances of even half of those being busy is rare. There do exist rare times when you do need SIMD performance in general computing, and when you do need it you REALLY need it. It'll be the equivalent of having 16~32 "SIMD FPUs" in a general pool that any core can use, and in the case of APU's they can also be used for GPU processing. Couple that with OpenCL / GPGPU and you can see what direction AMD (and now Intel) is going in.

jimmysmitty · Jul 6, 2012

*sigh*

http://i1063.photobucket.com/albums/t505/ViridianCrystal/L4DCoreScaling.png

Source is a great engine. L4D was the first to have multicore rendering but I think it only scales to about 4 cores and then it sort of falls off:

This sort of helps prove my point as well:

TF2 as well for a bit different game.

And trust me, I am one of the biggest VALVe fans. Have every game, buy every game on Steam, have well over 1500 hours in TF2 and am in a few of their current betas as well. But while I love the modularity of Soruce and the fact that its one of the best at utilizing multiple cores, it doesn't scale much beyond 4 cores.

viridiancrystal · Jul 6, 2012

Source is a great engine. L4D was the first to have multicore rendering but I think it only scales to about 4 cores and then it sort of falls off:

http://images.anandtech.com/graphs/graph4083/35039.png

This sort of helps prove my point as well:

http://images.hardwarecanucks.com/image/mac/reviews/AMD/Bulldozer/26.jpg

http://images.hardwarecanucks.com/image/mac/reviews/AMD/Bulldozer/26a.jpg

TF2 as well for a bit different game.

And trust me, I am one of the biggest VALVe fans. Have every game, buy every game on Steam, have well over 1500 hours in TF2 and am in a few of their current betas as well. But while I love the modularity of Soruce and the fact that its one of the best at utilizing multiple cores, it doesn't scale much beyond 4 cores.

Love Valve, I think every upcoming company (gaming or not) should use them as a role model.
I agree that source seems to not take much benefit past 4 cores, it definitively uses more than 4 in my experience. It isn't much, but there us some taxing on the other cores.

Looking at this, there is an obvious jump in performance from the 2600k to a 3960X, and from a 920 to a 980X.

Well, let me try getting rid of that gpu bottleneck from my quick test. That should help clear this up a little. Either way, Source is still one of the best game engines created to date IMO. Props to Valve for giving benefit to the >5% of people who use Intel's Hexa-cores.

EDIT: I wasn't aware of how big the clock speed difference between the 920 and 980X is. My bad.

vishalaestro · Jul 6, 2012

Thing to note with those Trinity benchmarks is that Trinity has to worry about energy efficiency, not to mention sharing half its die with a IGPU and no L3 cache. Vishera wont have to worry about those things so 15% is easily doable and maybe 20% could even be a possibility. If it comes out at 15-20%, it will be within a reasonable distance to Intel and so long as they dont price them stupid like they did with the 8150 at nearly $300, it can be very competitive. If they bring their flagship out at $200, theyve got something. $200 for a very overclockable 8 core processor that is only around 10% slower than Intel would be not a bad deal. Vishera doesnt have to be better, just competitive.

It depends on what youre running now. I needed 4.3 to get my FX to perform better than my 4 GHz 1090. If youve got something like a X4 running in the mid 3's, then 4.2 on a FX would indeed be a decent upgrade. Might not be worth $160 though unless you count the coolness of having a shiny new toy to play with as being worth anything. 😀

so u are sure amd will equal intel by the release of 8320

vishalaestro · Jul 6, 2012

By the release of the 8320? No
By the year of 8320 Ad. ? Yes
😀

also i want to ask one more ..which mouse to choose logitech g300 or razer abyssus..i'm a finger grip user

Blandge · Jul 6, 2012

Seeing that your not needing this unit to actually process graphics data itself, you can further remove components and make it much smaller then it what it would be on an APU. Even at "only" 64 texture units, that is 16 128-bit FPU's worth of power, or a 2x increase over what you have now.

Now I understand. In the case that the iGPU is used as a general purpose SIMD array, I can see it as a viable option for throughput-sensitive applications, but I still don't see this happening without increasing latency or power. Both of which are important for consumer-level devices.

In the case that SIMD instructions have to compete with graphics, I don't see this improving overall performance.

lilcinw · Jul 6, 2012

The idea is also to combine those units into a central one. Eight "cores" is eight FPUs, even in BD's case where they bonded two 128-bit FPU's into a single big one your still looking at lots of unused potential. A cores single FPU is rarely kept busy the full time, and when your talking eight cores the chances of even half of those being busy is rare. There do exist rare times when you do need SIMD performance in general computing, and when you do need it you REALLY need it. It'll be the equivalent of having 16~32 "SIMD FPUs" in a general pool that any core can use, and in the case of APU's they can also be used for GPU processing. Couple that with OpenCL / GPGPU and you can see what direction AMD (and now Intel) is going in.

This is where my understanding of x86/87 breaks down. The way I understand it the OS schedules threads from a process to individual 'cores' (or register stacks) and those cores each decide the best way to process the instructions from those threads. So if a program runs an SSE call that is going to be made in the register stack of the core that has been assigned. If the integer core is decoupled from a large SIMD array then you are introducing latency to route that instruction to another scheduler.

It might be beneficial to have semi-shared schedulers or at least place the two as close as possible since the calls would come from the integer cores.

The way around this would be an ISA extension, which are amusing to watch since the cross-license agreement means both companies get to muddy the waters, and we have seen how successful AMD is at pushing extensions on its own (AVX, FMA4).

That doesn't do anything to legacy code that assumes the FPU and integer stacks are integrated and doesn't expect an SSE call to be routed across the die or the need to wait for the results to come back from the iGPU. If an SSE instruction is significantly slower due to the latency introduced then this should be a non-starter for AMD. Increased complexity for reduced performance? Will they really make the same mistake twice?

Please correct any assumptions that are incorrect (possibly this entire post 😗 ). As I said this is not an area where I have a great depth of knowledge and I would appreciate additional insight.

Blandge · Jul 6, 2012

If the integer core is decoupled from a large SIMD array then you are introducing latency to route that instruction to another scheduler.

FP/SIMD Instructions already have a separate scheduler from the Interger pipeline.

That doesn't do anything to legacy code that assumes the FPU and integer stacks are integrated and doesn't expect an SSE call to be routed across the die or the need to wait for the results to come back from the iGPU.

Legacy x87 code already assumes that the FPU uses an independent co-processor, which palladin has references as a reason that this new implementation will work. SSE,AVX etc are ISA extensions that tell the CPU which scheduler to use. This is apparent to the OS as well.

If an SSE instruction is significantly slower due to the latency introduced then this should be a non-starter for AMD. Increased complexity for reduced performance? Will they really make the same mistake twice?

Please correct any assumptions that are incorrect (possibly this entire post 😗 ). As I said this is not an area where I have a great depth of knowledge and I would appreciate additional insight.

Higher latency does not effect throughput as burst size goes to infinity, so on server or HPC workloads where there may be a constant stream of "FP" instructions in the pipeline the increased latency does not necessarily affect performance.

It was my concern however that increased latency and complexity would make this implementation detrimental to consumer level workloads.

lilcinw · Jul 6, 2012

It still looks like it all generates from the modules shared front-end. How would that work if the integer cores were decoupled from the SIMD array in an iGPU? Would it still need to generate instructions in the modules and then shift them to the iGPU scheduler?

Blandge · Jul 6, 2012

It still looks like it all generates from the modules shared front-end. How would that work if the integer cores were decoupled from the SIMD array in an iGPU? Would it still need to generate instructions in the modules and then shift them to the iGPU scheduler?

Yes exactly. I agree with you here. Palladin stated the following:

Seeing as iGPU's are already 50% of a CPU's die, it becomes a matter of physically ensuring that the instruction control units are places near each other.

The data still has a much longer (Physically) path to traverse through the silicon. Consider the worse case where the "FP" data uses the execution resources furthest away from the integer/FP fork. That's an extra distance equal to roughly "50% of the CPU's die" that must be traveled twice. There would certainly be added latency.

gamerk316 · Jul 7, 2012

Not necessarily. An AMD and Intel will run neck and neck in almost all games at 1920x1080 with a single GTX570 but crank it up to 2560x1600 or even higher and add a 2nd GTX570 and youll see the Intel start to pull ahead. High resolution and multi GPU configurations are becoming more and more popular and these systems will see the benefits of a faster processor.

Duh? Adding a second GPU will make games CPU limited. Hence why Intel is faster in multi-GPU configs.

I also note that, cost wise, with the exception of the almightly 8800/9800 GT's, SLI/CF setups don't tend to age well...

gamerk316 · Jul 7, 2012

Also, there is a really easy way to test pure CPU scaling in any game: Turn all settings to minimum and bench. This will 99.9% of the time remove the GPU bottleneck.

Then if you want, start cranking up the settings, and you can see where certain architectures start to rapidly fall back into the pack, and when the GPU becomes the major factor.

viridiancrystal · Jul 7, 2012

^locked at 4 cores

^all cores available

I'd say source goes up until 6 cores, with minimal returns after. For a game released back in 2008, that is very good. Portal 2 probably does even better.

Point is, cores matter now, and will in the future. Hopefully we will see some advancement from Intel soon.

gamerk316 · Jul 7, 2012

http://i1063.photobucket.com/albums/t505/ViridianCrystal/L4DCoreScaling4.png
^locked at 4 cores

http://i1063.photobucket.com/albums/t505/ViridianCrystal/L4DCoreScaling8.png
^all cores available

I'd say source goes up until 6 cores, with minimal returns after. For a game released back in 2008, that is very good. Portal 2 probably does even better.

Point is, cores matter now, and will in the future. Hopefully we will see some advancement from Intel soon.

But here's the point you miss: Those cores, being not very hard worked, don't GAIN you anything. Reduce the number of cores to four, and average core usage jumps to 60%, but you aren't maxing the CPU. As a result, clockspeed and IPC remain the dominant factors, even when "only" four cores are avaliable.

Now, if the cores were stressed 90%+, then you'd have a valid point. But core scaling isn't the dominant factor yep, IPC and Clockspeed are.

viridiancrystal · Jul 8, 2012

But here's the point you miss: Those cores, being not very hard worked, don't GAIN you anything. Reduce the number of cores to four, and average core usage jumps to 60%, but you aren't maxing the CPU. As a result, clockspeed and IPC remain the dominant factors, even when "only" four cores are avaliable.

Now, if the cores were stressed 90%+, then you'd have a valid point. But core scaling isn't the dominant factor yep, IPC and Clockspeed are.

1).In both of these cases I am still GPU limited. If I had Crossfire 7970's and got rid of the bottleneck, then this changes. I would easily stress 4 cores to 100%. However, The frame rate would be lower than if the game was on 8 cores at 80-90%.

2).If the only thing I was doing was playing the game, that would be true, but that often isn't the case. People multitask. Half a dozen tabs open in chrome, watching a show on a second monitor, hosting/talking in a skype call, recording gameplay, and exporting video all can happen at the same time. 6 Single threaded tasks will not all run on one thread. They will move around, using up to 6 threads.

JAYDEEJOHN · Jul 8, 2012

You have to think MT to make MT work.
As a benchmark vs real usage

esrever · Jul 8, 2012

try core priority on every odd core.

viridiancrystal · Jul 8, 2012

try core priority on every odd core.

I bet it would look a lot like this, just with the 1st, 3rd, 5th, and, 7th cores loaded.

jimmysmitty · Jul 8, 2012

1).In both of these cases I am still GPU limited. If I had Crossfire 7970's and got rid of the bottleneck, then this changes. I would easily stress 4 cores to 100%. However, The frame rate would be lower than if the game was on 8 cores at 80-90%.

2).If the only thing I was doing was playing the game, that would be true, but that often isn't the case. People multitask. Half a dozen tabs open in chrome, watching a show on a second monitor, hosting/talking in a skype call, recording gameplay, and exporting video all can happen at the same time. 6 Single threaded tasks will not all run on one thread. They will move around, using up to 6 threads.

The easiest way to remove a GPU bottleneck is to lower the resolution to a very low res. Thats where the CPU will do more work.

Still, overall I max out every Source game and easily hit the 300FPS limit. As for core scaling, I still don't think the difference between 4, 6 and 8+ cores is enough. Its at most a 20% performance increase for 200% more cores (4 -> 8) which is not that great overall.

Add to the fact that most games still don't use more than two cores, its not a necessary part yet and I doubt more cores will be useful very soon. I am all for more cores for less $$ but the truth is that software is well behind hardware and game are still mostly stuck at current console level, DX9. Until they move to DX11, which actually makes better use of multiple cores, it will stay that way and for that it probably wont be until 2013, probably Q4 around the holidays. Plus we will be on DX12 by then.

gamerk316 · Jul 8, 2012

1).In both of these cases I am still GPU limited. If I had Crossfire 7970's and got rid of the bottleneck, then this changes. I would easily stress 4 cores to 100%. However, The frame rate would be lower than if the game was on 8 cores at 80-90%.

2).If the only thing I was doing was playing the game, that would be true, but that often isn't the case. People multitask. Half a dozen tabs open in chrome, watching a show on a second monitor, hosting/talking in a skype call, recording gameplay, and exporting video all can happen at the same time. 6 Single threaded tasks will not all run on one thread. They will move around, using up to 6 threads.

1: Reduce resolution/settings to minimum, then bench. Repeat with CPU you are comparing too.

I'm still upset HardOCP stopped doing is low resolution testing, since no virtually no one tests CPU's in non-GPU bottlenecked situations anymore...

2: I would argue very, very few people multitask and game at the same time.

I also point out that in windows, forground tasks get a priority bump, so those extra chrome tabs? Yeah, they'll run when a game thread hits some sort of I/O wait and gets replaced by some other waiting thread. In Windows, the highest priority thread that is capable of running will be run, period. I'd argue multitasking situations are really the only place where you'd run into a memory/HDD bottleneck, as I/O waits due to paging becomes much more important...[anyone willing to test out that theory?]

noob2222 · Jul 8, 2012

1: Reduce resolution/settings to minimum, then bench. Repeat with CPU you are comparing too.

I'm still upset HardOCP stopped doing is low resolution testing, since no virtually no one tests CPU's in non-GPU bottlenecked situations anymore...

exactly what does it matter if you have to run minimum resolution to "see a virtual bottleneck"? who gives a ____ really ...

"dude I run 5,000 fps at 480x320 16 colors" " wow man, thats some major bragging rights, how does that game look?" "dude, like the atari 2600" "hahahaha you idiot"

http://video.answers.com/battlefield-3-meets-minecraft-517188631

If you run with video settings maxed out and its bottlenecking your GPU, guess what, it doesn't frigging matter what happens at low resolutions, your still gpu bottlenecked at your play settings. lowering the settings doesn't change that fact, period, end of story because its not something your going to do just to play.

I didn't buy 2 video cards so I could run 640x480, 800x600, 1200x1024. I bought them to play 1900x1200, ultra maxed aa, thats it. I don't care about any other resolution and neither does my cpu since I don't play there.

So again, what does it matter what happens at low resolutions? 2, 3, 4 years down the road if you upgrade your video cards you might see a bottleneck? guess what, by then your looking for a cpu anyway so again what does it matter today?

yes, its interesting to see how cpus change at low resolutions, thats it, you don't use those settings to see how your system is going to perform, just like using single player benches for multiplayer gaming.

AMD Piledriver rumours ... and expert conjecture

Administrator

Polypheme

Distinguished

Splendid

Distinguished

Splendid

Champion

Distinguished

Distinguished

Distinguished

Distinguished

Distinguished

Distinguished

Distinguished

Distinguished

Glorious

Glorious

Distinguished

Glorious

Distinguished

Champion

Splendid

Distinguished

Champion

Glorious

Distinguished

Share this page