AMD CPU speculation... and expert conjecture

Page 695 - Seeking answers? Join the Tom's Hardware community: where nearly two million members share solutions and discuss the latest tech.
Status
Not open for further replies.

juanrga

Distinguished
BANNED
Mar 19, 2013
5,278
0
17,790


I already provided before in this same thread a link to an article that showed that Jaguar and Piledriver have similar IPC, with difference of less than 3% on non-synthetic benchmarks.

I will link it again

http://www.extremetech.com/computing/174980-its-time-for-amd-to-take-a-page-from-intel-and-dump-steamroller
 


I had a good read through that one, it's with noting that this is looking at *multi threaded* efficiency. The author notes in the comments that the *single thread* efficiency (which is more relevant to IPC) is more in the region of 10% - 20% less depending on the application. The comparison was 4 cores vs 4 cores, so the module multi threaded penalty accounts for why the gap is much less.
 

thats only on mobile, the desktop parts will have some which have large IGPs but most will have HD5400 or w/e.
 

blackkstar

Honorable
Sep 30, 2012
468
0
10,780


The simplest way to measure the efficiency of the two chips is to divide their respective benchmark scores in a given application by (CPU Frequency * Core Count). This normalizes both variables and gives us a measure of intrinsic core performance

Jaguar and Bulldozer family modules do not scale the same and this article assumes that 2 bulldozer cores scales the same as 2 jaguar cores. I will believe Jaguar and Steamroller have similar IPC when I see 2m/4c Steamroller vs 4m/4c Steamroller vs Jaguar all at the same frequency.
 

juanrga

Distinguished
BANNED
Mar 19, 2013
5,278
0
17,790


IPC means Instructions Per Cycle, not Instructions Per Core. You can also check the Steamroller results, whose module lacks the 20% penalty of Piledriver.

Jaguar IPC is just between Piledriver and Steamroller.
 

juanrga

Distinguished
BANNED
Mar 19, 2013
5,278
0
17,790


Note that Excavator will bring 10--15% higher IPC over Steamroller.

As an anecdote: do you know that an AMD engineer said that the goal for Zen was 100% higher IPC? Yes, I did laugh.
 

jdwii

Splendid
IPC can be tested on a multicore benchmark instructions per cycle, but i find it to be a bit ridiculous to do so.

IPC alone is actually kinda meaningless, i actually prefer the term CPI or cycles per instruction.
 
The problem with measurements is IPC/CPI varies by workload, and you have to factor in the effects of the CPU cache/memory access. What I typically do, for a specific benchmark, is do a comparison between the two. I correct for clockspeed and number of cores (I have to assume perfect scaling, which I note hurts Intel in these comparisons), and solve for relative IPC.

Take an AMD quad versus an Intel quad at the same clock. Intel takes 30 seconds to do some benchmark to completion. AMD takes 60. Solving for RELATIVE IPC, I can determine Intel's is 1, AMD's is 2, so therefore, Intel's IPC for that benchmark is twice as fast. I do NOT make any attempts to solve for actual IPC, because its unsolvable.
 

juanrga

Distinguished
BANNED
Mar 19, 2013
5,278
0
17,790


No. ExtremeTech article is about IPC, which can be measured in either singlecore or multicore benchmarks

Performance = IPC * Frequency

They measured performance, then divided by frequency to get the IPC of the processor (1) and then divided by number of cores to obtain the IPC of each core.

(1) http://en.wikipedia.org/wiki/Instructions_per_cycle



If you know one then you know the other because are the inverse

CPI = 1/IPC

http://en.wikipedia.org/wiki/Cycles_per_instruction



Any performance measurement of a processor depends of the application used and of other elements in the system. What ExtremeTech measured was the average IPC over a collection of different synthetic and non-synthetic tests.
 

juanrga

Distinguished
BANNED
Mar 19, 2013
5,278
0
17,790
Once clarified what is IPC I will clarify now my


  • ■~40% higher IPC over Jaguar/Piledriver

ExtremeTech article shows that IPC grows as

Piledriver < Jaguar < Steamroller

The differences are small. Piledriver is about 3--13% slower than Jaguar, whereas Jaguar is about 2--8% slower than Steamroller. Piledriver has the module penalty, but Steamroller mostly eliminates it with the doubled decoder: Piledriver is 11--15% slower than Steamroller and the module penalty is about 20% (check my BSN article about Kaveri).

Note that ExtremeTech article mentions something very relevant: Jaguar is partially bottlenecked by the L2 cache running at half the clock. Puma+ runs the cache at full speed.

The reason why I didn't differentiate between Jaguar and Piledriver when I chose the baseline for my ~40% claim is because the error on the ~40% is superior to the difference between Jaguar and Piledriver. Therefore a finer prediction didn't make sense.

I want mention that I agree with ExtremeTech on that AMD would abandon the big core and focus on improving the cat line.

Finally, I also note that I mentioned a quote from Keller stating clearly that Zen will be a high frequency "small dense core" but nobody commented on it, specially the people who insist on that Zen will be a 'giant' core of about the size of a whole Bulldozer/Piledriver module.

I would like to know your opinion about what Keller said.
 
Well, "dense" implies "more transistors per square mm", so "big" won't be an accurate term when moving to a dense process and comparing to a non-dense one. Just like Kaveri and Llano. The amount of transistors is *very* different, but the size is not that much (IIRC).

Now, I'd like to see the technical differences and what they're adding to the GPU and CPU portions. At 14nm they have a LOT of room to add stuff.

Cheers!
 
Performance = IPC * Frequency

You forgot #Cores, which matters if you want to solve per core rather then the CPU as a whole.

The full equation is rather complicated:

Performance = IPC * Frequency * [ (Core [x] Core Loading * (1 - Core [x] Performance Penalty) + (Core [x+1] Core Loading * (1 - Core [x+1] Performance Penalty) ... ]

The "1 - Core [x] Performance Penalty" is trying to solve for the inherent loss in performance for virtual cores involved in either HTT or CMT. For HTT, which is about 20% effective, you get 1 - 80% (.8) = .2, or 20% the performance of a normal core.

So we simplify down to Performance = IPC * Frequency, though this tends to understate IPC as average core load decreases, which hurts Intel when comparing IPC.

Isn't math fun?
 

juanrga

Distinguished
BANNED
Mar 19, 2013
5,278
0
17,790
Well, the whole quote is "small dense cores (Cat)". You can do a big dense core as well. Excavator is a big core that uses HDL. Excavator is about 27% smaller than Steamroller, but Excavator is still a much bigger core than Jaguar/Puma.
 

truegenius

Distinguished
BANNED


i am wondering if they (or anyone saying jaguar/ph2/pd have same per clock per core performance) did their math homework, or are they using some benchmarks which does not scales with clock speed and core count like sandra, gaming, real world, office etc
because my homework showing me different results or am i making some repetitive mistake which i am unable to see
http://www.anandtech.com/bench/product/1223?vs=1271
http://www.anandtech.com/bench/product/1223?vs=362
http://www.anandtech.com/bench/product/1223?vs=23

summary with performance percentage delta in comparison to jaguar (athlon 5350)
all results are are normalized to 2.05ghz 4 core
using multithreaded to remove result inconsistency due to turbo
18% modular penalty means it will perform 22% better with 4c/4m config
these results will surely vary with different ram bandwidth, nb etc, but i can't normalize that by myself
 

juanrga

Distinguished
BANNED
Mar 19, 2013
5,278
0
17,790


As I said before #cores is already included in "IPC", which is the instructions per cycle of the whole processor

http://en.wikipedia.org/wiki/Instructions_per_cycle

if you want the IPC per core, just divide the IPC of the processor by number of cores. That is what ExtremeTech did.



Yes, you can complicate it in lots of ways. Instead writing an effective IPC you can write the product of pure architectural IPC plus a term that depends of the number of instructions in the task. This last term accounts for compiler optimizations and other effects. Then you can go beyond and extract from the architectural IPC the parts due to the execution core from other parts associated to different elements of the architecture: then you finish with a more complex expression that has into account the latency of the caches of the functional units, the size of the ROB and other stuff.

The complex expression allows to know how each element of the architecture affects the IPC and why some architectures run better some workloads than others. But when measuring the architecture performance as a whole we reintroduce all those complexities in the above "IPC" term and so we simplify down to the well-known

Performance = IPC * Frequency

Yes, it is fun but off-topic. I will stop here.
 

juanrga

Distinguished
BANNED
Mar 19, 2013
5,278
0
17,790


I never use Anandtech numbers. I was curious about your claims, just picked a random benchmark, compared the Athlon 5350 and the A8-6500T scores and I got that the A8 has 93.2% of the IPC of the Athlon, i.e. Jaguar gives 7.3% more IPC than Piledriver, which agrees with numbers found by ExtremeTech.
 

8350rocks

Distinguished
@juan:

When they were realistically considering 16 core cpus for zen, that was true, it had to be. That number is now half at 8 cores meaning what happens to those cores when there is no gpu on die?
 

Cazalan

Distinguished
Sep 4, 2011
2,672
0
20,810
Not sure any of those are really comparing the cores themselves. You'd have to pick benchmarks that don't stress memory bandwidth or IO bandwidth. I.e. remove the uncore bottlenecks.
 

Cazalan

Distinguished
Sep 4, 2011
2,672
0
20,810


Hopefully they go back to having some L3 because even the Apple A8 has 4MB L3.
 
Status
Not open for further replies.

Latest posts