AMD CPU speculation... and expert conjecture

Page 219 - Seeking answers? Join the Tom's Hardware community: where nearly two million members share solutions and discuss the latest tech.
Status
Not open for further replies.

8350rocks

Distinguished


These are the IPC improvement modifications for steamroller:

•Store to load forwarding optimization
•Dispatch and retire up to 2 stores per cycle
•Improved memfile, from last 3 stores to last 8 stores, and allow tracking of dependent stack operations.
•Load queue (LDQ) size increased to 48, from 44.
•Store queue (STQ) size increased to 32, from 24.
•Increase dispatch bandwidth to 8 INT ops per cycle (4 to each core), from 4 INT ops per cycle (4 to just 1 core). 4 ops per cycle per core remains unchanged.
•Accelerate SYSCALL/SYSRET.
•Increased L2 BTB size from 5K to 10K and from 8 to 16 banks.
•Improved loop prediction.
•Increase PFB from 8 to 16 entries; the 8 additional entries can be used either for prefetch or as a loop buffer.
•Increase snoop tag throughput.
•Change from 4 to 3 FP pipe stages.

Taken from this article:

http://www.brightsideofnews.com/news/2013/3/6/analysis-amd-kaveri-apu-and-steamroller-core-architectural-enhancements-unveiled.aspx

Given what they're talking about there, the interdependency constraints will be slightly more relaxed, the pipelines will be increased, and the prefetch increased as well. Plus with double the dispatch bandwidth, they actually will be able to feed the cores faster...

It looks good.
 

juanrga

Distinguished
BANNED
Mar 19, 2013
5,278
0
17,790


Agree that 2M/4C comes first by those reasons.

My guess is that a future 8C Steamroller chip will replace 12C/16C Piledriver Warsaw CPUs in the 2P/4P servers. I suppose this will happen in 2015.

I see each Piledriver chip being replaced by a Steamroller with half the cores.



There are two estimations. The first from Hot chips 2012, which claims about a 30% over Piledriver. Using that the 4C SR APU will be at the level of performance of an i5 IB

4C SR APU ~ i5-3570k

The second from Feldman interview (June 2013), where he says that Steamroller core is twice faster than jaguar core. Using that the 4C SR APU will be at the level of performance of an i7 SB

4C SR APU ~ i7-2600

Of course this all refers to the CPU alone. The GPU will be very powerful (Kaveri GPU has more GFLOPs than the entire trinity CPU+GPU).

Moreover, the above computations ignore the role of HSA. If half of the CUs are used for compute I would wait a performance boost of about 5x. Precisely developers "are seeing five times the performance" {*}

http://www.expertreviews.co.uk/processors/1299913/the-big-interview-apus-hsa-and-where-next-for-amd

{*} With HSA enabled software kaveri is years ahead of a i7-3930k :)
 

szatkus

Honorable
Jul 9, 2013
382
0
10,780

You mean this? http://assets.vr-zone.net/17088/AMD_SteamrollerServer.jpg
+30% more "ops delivered per cycle". But it's about FPU, not IPC of the whole module.


Do you see there any "per clock"? Top clock for Jaguar core is ~2.0GHz, Bulldozer can easily achieve > 4.0GHz. Too imprecise to even call it "estimation".

There's excactly one meaningful estimation: http://www.tomshardware.co.uk/AMD-Kaveri-APU-Gaming,news-44336.html
+15-20% IPC. Seems realistic.
 

GOM3RPLY3R

Honorable
Mar 16, 2013
658
0
11,010
With that 4C SR CPU being ~ i7-2600, which is HTed, wouldn't that mean that the per core performance would be greater on the SR CPU than the 2600? I hardly believe and ounce of that. It being anywhere from a low end i5 to a i5-3570k would be more reasonable. There's no way that AMD can pull off something where the per core performance doubles and it doesn't have the same, if not, higher price than the equivalent Intel CPU.

Even so it "exceeding an i7-3930k," the 3930k, if the right cooling option is added, can be OCed to 5.0 Ghz, but at an absolute max. If this chip can get near the power of the 3930k stock clock and vcore within the next year, I'd be very surprised.

I have to say, since AMD is able to throw less cores in the CPUs, I wonder what this means for the Heat and Power Consumption? :3
 

juanrga

Distinguished
BANNED
Mar 19, 2013
5,278
0
17,790


Yeah, there are rumours I am going to be banned by posting on-topic, he he.

By the above

A_
_B
A_
_B
...

I did mean your "A ->B -> A ->B -> ..."

Now I need time to decode what do you mean by "A-A ->B-B->A-A-> ..."
 


I would assume the Steamroller Kaveri A10 APU is around i5 2550K CPU performance for the most part.
 

cowboy44mag

Guest
Jan 24, 2013
315
0
10,810
Juanrga:
"http://www.expertreviews.co.uk/processors/1299913/the-b...

{*} With HSA enabled software kaveri is years ahead of a i7-3930k :)"

Very interesting read, and I think it sums up what the difference in thinking between Intel fans and AMD fans when it comes to performance. Intel fans are still going by current and past ways of measuring performance where raw CPU power was the most important factor. AMD fans are looking to the future, with fully developed HSA applications, where the combined CPU/GPU power is what is important. The raw CPU power of an A10 Kaveri APU may only equal an i5 2550K, however when running HSA enabled applications its benchmarks will far surpass i7 4770K. It will all come down to what application you are running and what technology is being implemented as to what is more powerful Steamroller or Haswell. Intel will doubtless still have better benchmarks for the old standards of benchmarking, however Intel fans have an impossible time accepting that those benchmarking methods are going to become obsolete very soon.

We are no long interested in CPU performance alone, but CPU/GPU performance in HSA enabled applications (which we will be seeing more and more of). There are already some top notch companies in the HSA Foundation, and I expect many more to jump on board once its full potential is realized.
 

+1
 

GOM3RPLY3R

Honorable
Mar 16, 2013
658
0
11,010
I was wondering what to post for a little bit, as there was nothing I could really build on. But I thought I would bring to attention this:

001e8108_medium.jpeg


It's a music generation software made by Yamaha that uses real actresses voices to create songs. The actresses sing (more or less), all the vowels in Japanese, then they can be used at ~ any pitch to make a song. I thought this would be nice to add since it is a pretty CPU dependent, as with my Core 2 Quad @ 2.5 Ghz (idles @ 40-50 % usage), gets ramped up to 90-100% easily just by putting a few measures of voice in. I think that running it, and audio encoding would be a great benchamarking tool.

Otherwise, it's pretty neat software. I'm actually recreating a song from an anime that I watched (Kobato). The song is Ashita Huru Hi, its a nice song. Thus far, 8 measures in, the lyrics ~ sound exactly like they're pronounced (with the exception of a few vowels), and the pitches are spot on. ^_^

Hatsune Miku, the most famous Vocaloid Actress:

hatsune_miku_girl_cute_posture_look_25061_256x256.jpg
 

cowboy44mag

Guest
Jan 24, 2013
315
0
10,810


Intel has nothing like HSA. Intel is not interested in developing true APUs, not to mention their attempt with Haswell integrated GPUs isn't fairing so well now is it. Companies take one look at Haswell and put dedicated graphics solutions in them.

HSA has so much potential that it is truly mind blowing. IF you bothered to read the the linked article you would see that software applications with HSA support are showing 500% improvement. I believe that Haswell was a 10% improvement at best, more like 6% real world.

Simply put take a look at the HSA Foundation. Those companies are no "joke", and they wouldn't be supporting HSA development if it were a joke.
 

griptwister

Distinguished
Oct 7, 2012
1,437
0
19,460


Considering HSA is faster than quick sync. Your quoted post was a load of nonsense. And all those Intel benchmarks you posted are also under this category are a load of non-sense.

(I still have no Idea why we feed this guy)
 

szatkus

Honorable
Jul 9, 2013
382
0
10,780
But you know that x5 boost is the best case scenario? It highly depends on type of computation. There's plenty of algorithms for which a performance boost could be much smaller.

BTW. QuickSync is competing with OpenCL, not HSA. HSA is just hardware feature which can be used by OpenCL.
 

juanrga

Distinguished
BANNED
Mar 19, 2013
5,278
0
17,790


I did mean this

Screen%20Shot%202012-08-28%20at%204.38.09%20PM.png


You are right on that Feldman does not say per clock, but neither he says 4GHz vs 2.0GHz.

Imagine that you are right and he did mean Steamroller 4GHz is twice faster than Jaguar 2Ghz (which is a rather trivial statement), then Steamroller would be about the same than Jaguar per clock and I would correct my estimation by a x0.5 factor. The conclusion being that Kaveri 4C would be slower than trinity 4C. Congrats!

I think that Feldman did mean per clock, and using that I obtain the i7 SB level, which is close to the i5 IB level obtained with the 30% IPC.



The 2600 has poor per core performance than the 3570k. My above estimations were for multithreaded scenarios (that is why the 2600 scores higher than the 3500k), because this is what is more important for real use for most people.

Even if the 3930k could be OC up to 8 GHz (which I doubt) using extreme cooling, it couldn't compete with the performance of a HSA enabled chip offering a 500% boost at stock clocks. And don't forget that kaveri can be overclocked as well, which would increase the gap.
 

juanrga

Distinguished
BANNED
Mar 19, 2013
5,278
0
17,790


A safe estimation. To add that the difference between the 3570k and the 2550k is of about a 10%.



Thanks. I also considered it very interesting. Note that the i7-4770k is not a HSA chip, but even if it was a fully enabled HSA chip it couldn't compete with kaveri under HSA software, because the 4770k has a total (CPU+GPU) of 848 GFLOPs, whereas kaveri 4C has a total (CPU+GPU) of 1052 GFLOPs.

I don't know now the performance of the iGPU of the 3770k but the total (CPU+GPU) has to be of less than 624 GFLOPs. More exactly 624 minus the GFLOPS between the HD4600 and the HD4000.

Your point about old benchmarks/apps versus new is perfect! I found this slide which gives a good idea of the tendency towards using the true performance of APUs

900x900px-LL-b48347b3_apps.jpeg
 

szatkus

Honorable
Jul 9, 2013
382
0
10,780

But Jaguar is already quite close to Piledriver. Double performance per clock would be just impossible. Even +30% IPC is too big (no one did that big jump in last 10 years) to be possible. Also I think that information from last month (+15-20%) are more reliable than some slide from last year.
 

juanrga

Distinguished
BANNED
Mar 19, 2013
5,278
0
17,790


The interview does not say "up to" or "best case scenario". But, at the same time, we don't wait 5x for any application. In fact, it would be ridiculous to wait 5x when typing in a word processing. The GPU play no role and the current performance of CPUs is more than enough for that task.



HSA is more than that. Toms has an article discussing HSA and OpenCL.
 

szatkus

Honorable
Jul 9, 2013
382
0
10,780

It's obvious. 400% is much more than any estimation I saw earlier (not by AMD but done by some developers). It's possible to get so large boost, but in very rare cases. They just took the biggest number.

You're right it's also buzz-word.

Seriously, HSA is invisible for developer. OpenCL was designed with shared pointers in mind. It will just work faster on HSA-capable hardware.
 
At the risk of igniting a firestorm:

1: The official 7-zip build is compiled by MSVC, not ICC.

2: From S/A:

http://semiaccurate.com/2013/07/15/company-of-heroes-2-relic-performance/

For those of you debating building a new system optimized specifically for playing this game, Relic told us that Company of Heroes 2 is much more likely to be CPU-bound, rather than GPU-bound. Apparently the Essence engine supports the use of up to eight threads simultaneously. But all the threads being run on those cores at some point have to come back and finish executing on the first core. Thus single threaded performance is very important according to Relic.

*whistles*
 

8350rocks

Distinguished


Well, you're only partially right, OpenCL did imagine using shared pointers at some point, and GPUs doing general compute.

However, HSA goes a step farther, HSA enabled GPUs have the capability to use the GPU as a GPGPU when necessary on a much larger scale, where as OpenCL capable GPUs have fewer compute pipelines. Look at PS4, which is HSA enabled...it has 8x as many compute pipelines as the next best GPU Compute capable card out there...(HD 7990 has 8 compute pipelines, HD 7970 has 4).

So, it isn't just an expansion of OpenCL, but it's really moving in a forward direction ahead of what OpenCL originally envisioned/foresaw.

EDIT: Think of it like this, HSA is OpenCL in the same way that a house is a bathroom because it contains one...
 

szatkus

Honorable
Jul 9, 2013
382
0
10,780


I was talking from software perspective. On hardware level it's completly different from traditional GPGPU.
 


Sure they can! Granted, there's currently no way to funnel the DX/OGL API's to them, but nothing is stopping a developer from writing their own graphics implementation for use on a GPGPU.
 


But again, look at the results by architecture generation: i3's hanging around with BD? Thats your per-core performance. The +8 FPS going from a 8150 to the 8350? Per core performance. The 20 extra FPS the 4770k gives you on top of that? Per-core performance. Heck, thats the reason the PII X4 is still hanging around (and really makes me wonder where a Q9650 would end up on the chart...)

So again: another game that uses 8+ threads, that performs better on quad core processors with higher per-core performance rather then the 8 core processor with lower per-core performance.
 
Status
Not open for further replies.