AMD CPU speculation... and expert conjecture

Page 684 - Seeking answers? Join the Tom's Hardware community: where nearly two million members share solutions and discuss the latest tech.
Status
Not open for further replies.

szatkus

Honorable
Jul 9, 2013
382
0
10,780


K10 is slightly better than Piledriver clock for clock, but overall ST performance is more less the same. Of course except AVX code :)
 

logainofhades

Titan
Moderator


I disagree on that. A 4.2ghz FX 4350 mostly loses to a PhII X4 965 @ 4.0ghz, in applications performance.

Combined-Applications-Performance.png


Gaming it is just slightly better.
Combined-Average-Gaming-Performance.png


Faildozer and Piledriver are not better, unless you take into account multitasking of the FX 83xx chips. PhII X6 vs FX 6300 would not be really any different than the PhII X4 vs an FX 4350, clock for clock.

 

szatkus

Honorable
Jul 9, 2013
382
0
10,780


It's not a fair comparision. 1 module != 2 cores (unless you work at AMD's marketing department). 2 K10 with 2x1MB L2 are slightly larger than Bulldozer module. Without FMACs etc.
 

jdwii

Splendid


Actually i meant x86 decoders there's only 4 per module in bulldozers design.
 

jdwii

Splendid


This is actually not true all the time. When i owned my 1100T at 3.9Ghz it was even with my 8350fx at 4.3Ghz in dolphin emulator. They have a benchmark that can prove that. Also Fritz chess reported similar findings under my tests. The only area i see improvements in are programs that can use all 8 cores or newer programs that take advantage of the architecture, for example AVX.
 

jdwii

Splendid


Again my third post but in newer games expect better performance over programmers optimizing their code for Bulldozer or Piledriver. For example take a look at Watch dogs here
http://www.techspot.com/review/827-watch-dogs-benchmarks/page5.html

When i upgraded Gaming was easily a bigger improvement with normal applications staying around the same speed.
 

logainofhades

Titan
Moderator
The real argument, though, is that a tweaked K10 would beat FX. Give K10 all the actual improvements that FX had, with a die shrink, and it would be a better chip. AMD should have done that in the first place. They would have been far more competitive than what FX is.
 

juanrga

Distinguished
BANNED
Mar 19, 2013
5,278
0
17,790


Wow! What are you proposing? That Bulldozer, Piledriver to be compared only to themselves because no other in the industry uses CMT modules... Are you also proposing to compare only processors of same die size?

The point is that my above link to Toms review shows that an ancient PhII X4 965 @ 4GHz can match or even beat a newest FX 4350 @ 4.2GHz, contrary to what some such as 8350rocks pretended here.
 

szatkus

Honorable
Jul 9, 2013
382
0
10,780


To be correct: "On Bulldozer and Piledriver, the shared decode unit can handle four instructions per clock
cycle."

Source: http://www.agner.org/optimize/microarchitecture.pdf

4 instructions per clock in single thread. At Steamroller it isn't major bottleneck even in MT benchmarks.
 

szatkus

Honorable
Jul 9, 2013
382
0
10,780


Bulldozer were designed to achieve high clocks. Just look at Llano and Trinity.
K10 32nm FX would be 3.4/3.9GHz at best. As I said you would need a major redesign to get it higher.
They found out that Bulldozer is flawed around 2009, too late to do that and deliver in 2011.
 

szatkus

Honorable
Jul 9, 2013
382
0
10,780


Performance/mm^2 and performance/watt with the same features as Bulldozer (AVX, FMA etc.). In both cases K10 would likely lose. Only people at AMD could do proper simulations and I believe they did, that's why FX was Bulldozer, Llano was K10 and Trinity was Piledriver.
 
Well, the 4350 is on par with a PhII in my eyes, because of games. And if you take into account the higher clocks and AVX for programs that support it, it's an upgrade (finally).

Now, due to CMT, a PhII should be compared to the 8350, not the 4350. 80% of a core is still not 100% of a core :p

In any case, no need to discuss K10.5, since I'm sure all the good things they had in K10 are inside PD and SR. They just have to improve on them. CMT is not a bad idea at all and I agree with palladin: the problem is not CMT, but other stuff.

Cheers!
 
The problem of Bulldoser has a lot of things to do with a lot of things. I feels things such as
-the memory interface, intel is able to get so much more bandwidth at much lower lentancy than AMD. Hypertransport in the current CPUs seems just ancient compared to what intel can do
-the flex FPU which never really worked like AMD was hoping for. It massively under performs and Im sure AMD can't even fix it without a large change to the architecture
-the caches, compared to intel, AMD's caches have way too much latency and the organization of it due to the modular design looks terrible they are sucking a lot of power and they occupy so much die area

Are the main things that need to be fixed. They are what really put AMD behind and they are why just putting more ALUs and AGUs into a core won't help at all. Considering that 2 ALU (piledriver) per core is getting pretty much exactly the same performance as 3 ALU (Phenom) per core, just increasing it won't do a thing except have an ALU that is idle 99% of the time. It is probably why AMD cut the ALU in the first place.

CMT in the current form is not great. I don't think AMD will use it in Zen. Too many programs are light threaded that AMD will have to try to make more powerful cores. Intel's approach with SMT is much better to tackle both mutithreaded applications and single threaded applications even if it isn't very efficient at doing as many threads. I don't know if AMD will be able to get SMT working well on their first try tho. Intel's hyperthreading took many generations to get where it is today and when they first used it way back in the day, it was terrible.
 

szatkus

Honorable
Jul 9, 2013
382
0
10,780

Bandwidth is connected with BD performance and cache. When you get data from memory it must be scheluded efficiently.

No, actually Flex FPU looks ok. AVX performance maybe isn't too good, but at legacy code it doesn't look like bottleneck.

Agree in 100%

CMT and SMT has its own disadvantages. We'll see, I won't be surprised if Zen will be CMT architecture, but I bet something between pure-SMT and CMT.
 
The bandwidth is not just from the bulldozer cache, its the same in phenom. Even tho AMD supports faster memory speeds, their memory reads and writes were always slower than intel's. The Flex FPU was supposed to give double the performance of the old phenom FPU and should have been able to do the same number of operations as intel's sandybridge and ivybridge, obviously they never got close to that. Even on legacy instructions its only doing the 128 bit operations one at a time when it was designed to do 2 128 bit operations.
 

8350rocks

Distinguished


Because something in the pipeline is not quite right...not sure how or why...but something is bottlenecking in the pipeline preventing the flex FPU from doubling up.
 

juanrga

Distinguished
BANNED
Mar 19, 2013
5,278
0
17,790
Benchmarks show that quad-core Phenom II @ 4GHz matches quad-core Piledriver FX @ 4.2GHz, but this doesn't mean that K10 is competitive against Intel designs.

K10 arch was at its limits and, lacking any better idea, K11 engineers decided to use the old MCMT architecture invented in 1999 by Andy Glew.

The whole CMT (Cluster MultiThreading) concept is nonsense and not only was rejected by Intel when Andy worked for them, but the whole industry has rejected the concept. CMT doesn't solve anything is not solved by standard CMP but generates new troubles.

On top of CMT, Bulldozer engineers decided to take a crazy speed-demon approach (fueled by the usual SOI hype), pretended a CPU is a GPU, added weak memory and cache subsystems, and used automated tools, which increased the die area and power consumption.

What would AMD do? It doesn't matter they don't have the resources, the engineers or the time.

What AMD will do? I only see two possibilities:

(a) They take Puma as base and scale it up. Doing a wider core that hits higher clocks or

(b) They take Excavator module and CMPish the module, transforming it in a wider core.

I have the feeling that Keller will try (b) due to his experience with 21264 design. I think will transform the cores into coupled integer clusters.
 

juanrga

Distinguished
BANNED
Mar 19, 2013
5,278
0
17,790


Sandy Bridge and Ivy Bridge have 256 bit FP per core. AMD has 256 bit FP per module. The shared FP unit can do 2x 128 bit operations at one time, but then only can be accessed by one core on a given cycle, which effectively reduces the throughput at one half.
 

szatkus

Honorable
Jul 9, 2013
382
0
10,780


Sound like:
http://semiaccurate.com/2011/10/17/bulldozer-doesnt-have-just-a-single-problem/

@juanrga you forget about few things.

Ok, know I'm little to drunk. Cya, tommorow.
 
For others saying more execution units wouldn't improve performance over Piledriver I would like them to look at Intel's design since that seems like a flawed argument. Adding more resources is only the beginning other advancements need to be done as well for example more IPC per core (isn't it 2 per core instead of 3 with Phenom and doesn't Intel have 4?) an improved scheduler as well as cache speeds which severally needs improvement.

Except they won't. IPC and ALU aren't the same thing, and I do believe I've mentioned before that "IPC" itself is meaningless without context. x86 is not vector code, it doesn't do multiple integer operations simultaneously, neither does any form of RISC processor. One integer operation at a time in a perfectly serial fashion. In other words it's not possible for a native x86 CPU to execute more then one integer operation at a time. Thankfully neither Intel nor AMD produce native x86 CPU's anymore. Instead they take the serialized x86 operations upfront, dissect and analyze them in hardware, then break them into smaller micro operations and dispatch them to multiple internet integer processors. By correctly predicting which data is needed and which instruction is next after a jump then CPU can virtually execute multiple integer operations simultaneously. The efficiency of the predictor will largely determine the maximum number of simultaneously executed operations. Intel's design can execute 2~2.5 integer operations under optimal circumstances, AMD's about 1 ~ 2. This is the reason AMD went with 2 ALU's per core, outside of extremely situational lab code you simply aren't going to see more then 2 ALU's being used simultaneously with their predictor. Intel see's the same wall at around 3, the third is rarely used.

And now everyone is about to ask why Intel included a fourth ALU in their Haswel design, and the answer is simple. Since they use HT, the ALU's are split between both threads when they are being loaded. That fourth ALU is primarily for increasing performance in the i3 and i7, and one of the reasons we saw such a bit improvement in multi-thread performance of those two models.

As for K10, we've already been there and discussed this. There already exists a 32nm die shrunk enhanced K10.5 CPU, the Llano. There has been tons of testing and benchmarking, many including a dGPU with the iGPU turned off. The Llano indeed had slightly better performance then the Phenom II when run at the same clock, but the design has hit a wall in speed and performance. All CPU design's have a wall. Earlier I mentioned how improvements after a uArch is released is typically done via further optimizations, making traces between two components slightly shorter so that you can have a 3 tick latency vs a 4 tick latency, little things like that add up quickly. Eventually you get to a point where you can no longer improve a CPU uArch design, you've optimized it the most it can be, you've shorted all the traces, tuned all the timers, made the buffers as fast as possible. When that happens the only solution is to create a new design, inheriting many of the previous ones components and features but designed from scratch with different interconnects and different organizations. You then start that tweaking process all over again, if the CPU is released before you've done enough tweaking then it will actually be lower performance then the highly tweaked / optimized previous model (sound familiar).

So ultimately the answer is no, they couldn't of just made another K10 with X more of Y and somehow "won". AMD did the only thing they could do, the modular design actually works really well for them as it enables them to produce new CPU's for cheap. I really can't overstate how import cost is here, probably the single biggest function of CPU design. Modular allows them to continuously reuse the same components without needing to constantly tweak the connections between the components. They can still tweak and enhance individual components and those enhancements become standard for future releases, but on a per-core / pre-CPU design they don't need to spend much more money. This is one of those design decisions that pays large dividends in future use's.
 

truegenius

Distinguished
BANNED
my previous comment of "bump" got published :oops:

i was actually testing if thread works now or not :miam:

btw here is my comment that i kept safe in a txt file for many days :pt1cable:


the fish is this, a mere dual core beats x4/x6 phenom 2 :pfff:
maybe new inst advantage


exactly ;)


which makes me more angry.
why did they wasted precious money and time just because of the fact that they didn't had any new idea. they could have used this wasted time and money to do more experiments to modular design to make is beat sandy clock for clock for real. and could have released real 32nm k10 with those improvements discussed above.
engineers at amd they must be using intel pc that is why they didn't spotted the uncore angle which would have got 5-10% better performance ( even a mere +5% is bettar than -20% of bd) for general softwares and even better in games.



*discussed but skipped, with no satisfactory conclusion.
**your argument that llano perform 5% better than phenom 2 shows the remaining potential of k10, because llano cpu is athlon (no l3 cache) , and we all know how athlon compares with phenom 2.
***we didn't even saw speeds of phenom 1 in llano, 3Ghz for top part ! really ?
engineers at amd designed llano in very badly, why ?
because, stock 1.4+v for mere 3ghz at 32nm ! (a8-3870k)
even 1.4v is stock oltage for 45nm k10, and we have seen stock clock upto 3.7ghz which is 23% higher for same clock
generally when components gets skrink then voltage reqirement for same speed decrease but in case of llano they pushed 1.4+v fr decreased clock.
i know that bd/pd can get higher clocks, than k10 but 3Ghz isn't a wall for k10, many am3 k10 runs at much higher clock than 3ghz
even we can get 4GHz @ stock 1.4v on phenom which shows its true speed potential (which also show us that amd overvolted their phenom chips)
AMD-Phenom-II-X4-965-BE-OC.png


my phenom 1090t can do 3GHz at just 1v (on 4 core) while stock voltage is 1.4v for 3.2ghz (6core, can do 3.2 ghz on 6 core @1.15v)
so why they pushed llano to 1.4 v for mere 3ghz?
because they (and fabrication) messed it up.
So this means llano does not indicate clock potential of k10, buit it does show that it can get improvements. And after reading 20core zen joke i now think that k10 would have served better (it will perform better for same power consumption even if they don't improve its performance)

here are some more things to show k10's clock potential
(below table will be used for further comparison)
source ( yes i know its juan's link :p ) : http://www.tomshardware.com/reviews/piledriver-k10-cpu-overclocking,3584-17.html
AMD-Power-Consumption.png


4ghz @1.4v (probably stock volts)
~18% overclock
and from power consumption table above, we can see that overclock power consumption (cpu load) was only ~10% higher than stock)
AMD-Phenom-II-X4-965-BE-OC.png


fx4350
and we required 35% more power for just 12% clock boost
if we take linear power consumption increase with overclock, then we will need 53% more power to reach same 18% clock boost of 965, while 965's increase was only 1/5th of fx4 for 18% clock boost
AMD-FX-4350-OC.png


even in case of fx6
for 30% more power increase for 15% boost. which means 36% more power consumption (taking linear increase) for 18% boost.
AMD-FX-6350-OC.png



btw what happened to idea of replacing fpu with gpu in apu for hsa ? is it dead ?

oh crap, i am getting error 500 even for new comment :(
 
Status
Not open for further replies.