AMD CPUs, SoC Rumors and Speculations Temp. thread 2

Page 26 - Seeking answers? Join the Tom's Hardware community: where nearly two million members share solutions and discuss the latest tech.
Status
Not open for further replies.


Per Anandtech's review of Haswell uarch, haswell is 4 ALU/2 AGU. Do you have a source to disprove this?
 


Where are you getting 4/4?

Looking at that flow chart it looks like 4/3...

I see 4 ALUs

2 Load/Store

1 Data Store

You surely are not counting the HTT register stack at the end as a full AGU are you?
 


The 4th Store can be used to free up ports 1 and 2 of the Store/load which makes it a 4th AGU. It is not a HTT specific address.
 


From what I can see it has 2 full AGU (the 'Load / Store' blocks), and 2 partial ones ('Store Data' and 'Store Address').

As usual it isn't so black and white.

I mean it's quite possible Zen could feature similar things and I'm assuming the latter 2 in Haswell are only useful in certain circumstances.
 

indeed, seems like the count of agu isn't black and white. perhaps that's why it's rarely mentioned.
 


Yes, it clearly is not 4 full AGUs...and I equate 2 partial to 1 whole...hence the 4/3 comment.

@jimmysmitty:

The big thing in my mind is that you are overlooking the fact that HTT enabled parts do not necessarily have a different core in them...so in non-HTT enabled parts, the additional register stack could be disabled...

(Though, I would be very open to evidence suggesting otherwise...)

Essentially, Haswell is 4 ALU/2 AGU and other stuff that may/may not be AGUs, but act similarly in part.
 
The main thing is with those changes it looks like AMD can certainly get the 40% IPC uplift they were going for.

By the time it launches and with more DX12 titles in full swing, a Zen+Fiji2 rig could perform quite well.
 


This is the core execution engine. If you actually look at the Nehalem/Sandy Bridge core execution unit they did not have these two extra ports or AGUs. So why would these extra parts be disabled in non-HT parts when a 2600K, which had HT, did not have them?

They may not be full AGUs but they are AGUs that can be used in certain work loads to help offload some work from the primary AGUs.

If anything it would be 4+2+2.

Cazalan, with DX12 I am doubting CPUs will bottleneck much except for the very low end. What I am more interested in is GPU performance there as the 980Ti supports DX12 FL 12_1 while Fiji supports 12_0 although it is 11_1 that is the performance level.

I am interested to see how Zen does. But again the 40% increase will depend on a lot of actors and where they get the number from. Is it a per core per clock increase or are they considering it a 40% increase over Excavator including the cores and SMT?
 
A balanced design for architectures (x86, ARM,...) is the one that has a 1:1 ratio for ALU:memory ports.

Sandy Bridge and Ivy Bridge are 3:3 microarchitectures.

Haswell and Broadwell are 4:4.

K10 is 3:3.

Bulldozer, Piledriver, Steamroller and Excavator are 2:2

Nvidia Denver is 2:2.

And so on. Zen is unbalanced with a 4:2 ratio. This disappoints, as mentioned by David Kanter in the quote that I gave before. The problem is that the amount of code that requires 4 integer ops and only 2 loads or 2 stores or one load + 1 store per cylce is virtually inexistent. Therefore 2 ALUs in Zen will be feed continuously whereas the extra 2 ALUs will remain idle most of time. That is the reason why I predicted 3 ALU + 3 AGU for Zen. My design was balanced.
 


No.

4 ALU + 2 AGU cannot provide 40% higher IPC than 2 ALU + 2 AGU because most of the time the two extra ALUs will remain idle, because the system cannot load/store fast enough: e.g. if one memory port on Zen is being used for store, then Zen only can do one load.

My 40% prediction over Piledriver was based on that my model for Zen was 3 ALU +3 AGU. I.e. 50% extra both computational and memory ports than Piledriver (2 ALU+2AGU).

However Zen only adds two extra ALUs and this will only increase integer performance by few percents. The only choice is that most of the IPC gain is coming from better cache. However, the L2 cache is 512KB, which has higher latency than the 256KB on Intel chips.

Although in the past I was confident that AMD could hit 40% higher IPC, now I am not sure.

DX12 is largely irrelevant. Because Zen main target are servers and HPC, not desktops. DX12 will not help AMD to gain server market share, and that is what AMD needs to remain afloat and pay debts..

Some time ago I asked why all customers rejected Zen for future supercomputers and choose instead future chips from IBM and Intel. I said then this implied some problem with Zen. Now we know why:

 


You are making a lot of assumptions in your hypothesis.

Also, @jimmysmitty:

Sure, those 3 ALU architectures did not have the extra ALU either. I am talking about Haswell i5 versus i7. Those parts could be disabled in i5s, and activated in i7s. Unless you have something to point to the contrary...

Intel uses the same core for everything so far as I know, and some of them have disabled sections, some of them do not.
 


This part we are discussing has nothing to do with SMT though. This is the very basics of the cores of the CPU. SMT is handled after the fact. Why would they disable something like this?

That is what I am getting at. i5 and i7 are still the same. Even a i3 is the same. Any core based on a Haswell uArch will be this layout with other features added after the fact.

The best way to test it would be to disable SMT on a i7 and then clock it and a i5 the same and see the performance difference. Most likely it will be very similar minus a few advantages for the i7 for applications that can utilize the extra L3 cache (2MB more) in the i7.
 


I am aware the cores are the same...

However, K series do not have most of the virtualization features enabled...what else could they possibly be disabling between one set of chips and another in spite of having the same cores in the processor?

My point is that just because it is in the uarch does not mean all the models get to play with it equally. For all we know...some of that may only be specifically for Xeons and disabled in everything else across the board...(honestly, the more I think about it...the less that would surprise me...but I digress...)
 


Sure I am leaving out many details, but they are irrelevant and don't change conclusion.
 
im reading a TON of discussion on AGU's and ALU's and nobody discusses how ipc can change other than to put in more of these.

it cant be that simple. it it was amd would put in 5/5 or 6/6 and cream the crop. they claim 40% ipc over excavator now I doubt this as much as the next guy but we cant simply say that this is or is not based on the number of agu's and alu's! sure it has something to do with it, but what about core design? fetch improvements and how do we know the latentcy of a 512kb l2 is slower than that of a 256kb? I mean you guys all make what looks to me like some serious assumptions. I would not expect to see less than 30% ipc improvement over excavator which would give us a full 40% over piledriver and that gives us roughly skylake perf. just sayin
 
The number of ALUs and AGU and their balance in itself doesn't matter, without the context of the entire micro-architecture and it's utilization of the execution units. Performance has so many factors outside ALUs and AGUs. I still believe we need more information to actually have an well-concluded questimate in regards to zen performance.

+1 Vogner16
 


It is easier to put it this way:

The Bulldozer design is a lot like Netburst. It is a narrow pipeline, meaning it can't process as many instructions per clock but it can get to higher clock speeds easier hence the 5GHz FX series CPUs. From Core 2 and up and AMDs K8/K10 series CPUs they were a wide pipeline with lower clocks but much higher instructions per clock.


The ALU/AGU tells us some of the performance. If it was a 3/3 even with a better execution engine that would mean it would be slower than Haswell. With what we see it should, theoretically, be able to in some scenarios equal Haswell.

Of course it is correct to not assume since we don't know the whole picture yet but we can surmise, after all this is a rumors and speculation thread and some of the people here have been doing this for a very long time so they know the basics, and some know the advanced, of CPU uArch.
 
I have a hunch AMD went with a mixed bag of SMT and CMT with those extra ALUs. They must have learned something from using CMT and they're keeping it for themselves very closely. Or, that diagram is bonkers and in reality is 3/3 😛

In any case, the 32KB 8-way L1 D$ cache still has my attention. They are betting for great prediction and efficiency there, since that is the best middle point there is for typical workloads. Why not throw all the meat and use 64KB fully associative instead? That implies they are going for the smallest core they can make that provides good efficiency. Remember PD's is 2-way associative.

Cheers!
 


I doubt they are using CMT. They already said it is a standard core with SMT. Not sure it would benefit them in any way really.
 
I have to agree with jimmy here. we are simply too far out to know where the final performance will lie. all we really have to work with are rumored uArch and AMD's lofty claim of 40% IPC improvement. we still don't know what clocks AMD is capable of here! that's basically the biggest part of this isn't it?

hell if they hit 6 GHz or something absurd then your worries of small pipeline are irrelevant.

still speculation is fun 😛
 


The problem with a smaller pipelin can be seen with Intels Netburst and AMDs BD uArchs. While they can hit absurdly high clock rates (the Pentium 4 topped out at 4GHz stock clocks well before the Core i did) they have such low IPC that it is almost a bad trade off. Core 2 came in and the 2.4GHz E6600 was pounding the higher clocked Pentium 4s. That was die to the wider pipeline but also a much more efficient design overall.

BD is in that same bucket right now. You can get a 5GHz FX CPU yet a stock i5/i7 beats it in most situations at a much lower clock speed.

The change they are making right now will for sure increase performance over Excavator, they are going back to a core design similar to K10 but adding SMT and improvements, but how much depends on a lot of factors and as with anything most companies use situational performance to tell us what it should be and until we know that we can only assume that 40% is one of many things.

As for clock speeds, the info right now points towards 3.5-4GHz much like Intel. I highly doubt that GF has a mature enough 14nm yet to do much higher clock speeds than that or to even match Intel. That we will have to wait and see.

Personally I think the increase is including the SMT cores and not a per core per clock increase. But that is just my opinion.
 


which is what I was getting at. despite the small pipeline amd matched ivy and sandy perf $ for $. amd claims 40% more ipc than excavator and lets say we only get 30% over piledriver (worst case scenario) if we take that info of 3.5 to 4 (stock, I overclock and assume everyone else here does as well so could we assume 4.5 effective?) puts us with a overclocked skylake clock to clock speed. only diff is in the IPC. and from what it looks like 30% over piledriver should get us there!

I see all arrows point to matching intel performance i7's closely and with pricing expected to be around a K series i5, we have a good chip on hand! Whats not to like about this? nobody expects it to be faster than intels i7, but nobody expects it to cost that much either so $ for $ we have another win to me. there is a reason I built 2 8350 computers and its because im cheap. not for top end perf. that title changes to a new cpu every year.
 
Status
Not open for further replies.