AMD CPU speculation... and expert conjecture

Page 255 - Seeking answers? Join the Tom's Hardware community: where nearly two million members share solutions and discuss the latest tech.
Status
Not open for further replies.

griptwister

Distinguished
Oct 7, 2012
1,437
0
19,460


Supposedly Excavator! And that's gonna be really friggin incredible if it comes!

I'm really excited for what AMD is doing. I'm so looking forward to steamroller. I know it wont be the chip to rule them all, but at least we'll see some decent OC support, better IPC improvements, and new motherboards.

IMO, what sucks most about Intel is that they held back on haswell. Even LavoandPrice were like, "wtf bro?" (Don't get butt hurt if you like intel, you know this is a fact). It's pretty obvious what Intel did, they removed TSX out of the 4770K. And they could have done that for a reason. Maybe if AMD posed a threat with steamroller, they could put TSX back in their K series CPUs during the haswell refresh to be competitive. Or, they could put it in the Haswell refresh just to justify you buying a all new i7 CPU with MINIMAL performance improvement(except if you use Intel QuickSync). Or they could just skip it altogether again!
 

kebbz

Honorable
Jul 27, 2012
212
0
10,680


Nice thanks for the input!



Im not butthurt at all! :D AMD user here, my fx-6100 is fail but definitely looking at the fx-6300. Totally agree with you though.
 


Can you tell me what TSX does in a practical way, ie what programs benefit from it and how. I have been trying to work this out for ages and despite 3 forum posts I created and reading a few by others I still have no idea.
 

juanrga

Distinguished
BANNED
Mar 19, 2013
5,278
0
17,790


There’s no change in the execution capabilities of the FPU

AMD claims 8 FLOPs per cycle for the Steamroller FPU. Piledriver FPU is also 8 FLOPs per cycle. I already devoted a pair of posts in this thread to explain the 8 FLOPs per cycle.

Piledriver 2M @ 4GHz: 128 GFLOP
Steamroller 2M @ 4GHz: 128 GFLOP



I was considering office/scientific computations where the GPU is usually at 'iddle' and can be utilized for heavy compute. I don't can offer a detailed answer for games, because AMD does not offer detailed info about the GPU in kaveri. However, consoles offer some tips to the future.

We know that the new consoles PS4/XboxOne are moving heavy compute to the GPU: physics effects and other intensive computations are being made in the GPU without any mayor performance penalty for graphics. We already saw this year a demo of next-gen physics engines

http://www.youtube.com/watch?v=zPnwmsTokso

We also know that HUMA plays a central role in GPU compute, because eliminates the PCI bottleneck and redundancies

http://www.tomshardware.com/news/AMD-HSA-hUMA-APU,22324.html
 


For the record I didn't say stars cores but AMD may revive the K10 similar design prodominantly for the enthusiast market where the CPU's are sold to the gamer and overclockers segment, Small and Big cores will continue as is with adaptive APU's targeting anything from mobile to server segments and other low powered devices.

Phenom II x 6 1100T the node is around 346mm² ~ 900million transistors. 45nm
FX 8350 - 315mm² ~ 1.3Billion transistors. 32nm

There is no doubt a re-engineered Phenom x 6 at 32nm will not only be faster but more efficient. I believe stars cores are cheaper to engineer than modular bulldozer cores. 45nm X6 at 3.3ghz can match a FX octocore at 4.1ghz on less transistors and older design with higher leaks, on a 32nm with 1.3B transistors at 3.5ghz not only would a phenom beat a FX from pillar to post, it will probably directly compete against Intel parts in between i5 and i7's and be more efficient than the FX line.

On the socket side, Intel have a fully integrated NB and IMC controller, AMD's is not, the NB itself is barely capable of 3ghz while Intel are beyond 5ghz on a fully on chip IMC. I have heard about SB1150 chipset motherboards or whatever it was for some time yet nothing has come out and that is because there is not more AM socket chips slated. While FM2+ integrates high speed USB and PCIe 3 along with newer technologies. AM3+ is 4 years old and is very much a reason for holding AMD FX processors back in some regards.

Bring back K10esque designs, lower clocks, lower power, well moarrrr performance and a new socket and chipset.

 

juanrga

Distinguished
BANNED
Mar 19, 2013
5,278
0
17,790
As mentioned before, Bulldozer was a good design when it is understood in its correct economic, legal... context. This is the epoch when AMD spent billions in the fabs. This is the epoch of the costly acquisition of ATI. This is the epoch of the litigations with Intel because was bribing OEMs to reject AMD products...

The CMT nature of each module allowed for a heavy reduction in costs, power consumption, and die space over an traditional non-clustered cores design as that used in Phenom. This is why AMD was able to release a 8-core chip. The problem with Bulldozer is that AMD was subject to budget constraints (again recall the context) and had to sacrifice stuff from the final design: high latency caches, inefficient front-end...

Piledriver corrected several defects and was able to outperform Ivy Bridge in multithreading whereas losing in single-threaded.

Now steamroller corrects the main defect of Buldozer/Piledriver and provides a dual decoder. The decoder alone was responsible for about a 20% penalty in performance of Bulldozer/Piledriver. Note that a 20% is about the performance gain from Sandy Bridge to Haswell. I.e. the performance gain of two last Intel generations.

If the penalty is so large, why did not AMD implement a dual decoder in Bulldozer? The answer is development cost. Implementing a double decoder is much more difficult and costly than implementing a shared one.

Steamroller is very much what Bulldozer would have been. AMD is not going to go backwards, because there is no gain to do that. At contrary, AMD continues the evolution of Steamroller with Excavator.

I already explained why I doubt that AMD will release new eight-core CPUs a la FX. I will add that I also doubt that AMD will release eight-core APUs.

Evidently, AMD needs something new in the top high-end market. My doubt is if AMD will release a six-core version of Kaveri/Berlin APU by the end of 2014 or if will release in 2015 the four-core Carizo APU with Excavator modules and something similar for servers.
 

8350rocks

Distinguished
Juan:

http://www.brightsideofnews.com/news/2013/3/6/analysis-amd-kaveri-apu-and-steamroller-core-architectural-enhancements-unveiled.aspx

The document lists the following changes to improve instructions per clock (IPC):
•Store to load forwarding optimization
•Dispatch and retire up to 2 stores per cycle
•Improved memfile, from last 3 stores to last 8 stores, and allow tracking of dependent stack operations.
•Load queue (LDQ) size increased to 48, from 44.
•Store queue (STQ) size increased to 32, from 24.
•Increase dispatch bandwidth to 8 INT ops per cycle (4 to each core), from 4 INT ops per cycle (4 to just 1 core). 4 ops per cycle per core remains unchanged.
•Accelerate SYSCALL/SYSRET.
•Increased L2 BTB size from 5K to 10K and from 8 to 16 banks.
•Improved loop prediction.
•Increase PFB from 8 to 16 entries; the 8 additional entries can be used either for prefetch or as a loop buffer.
•Increase snoop tag throughput.
•Change from 4 to 3 FP pipe stages.

At minimum they are reducing the FPU pipeline stages, which means the FPU will be able to act more quickly.

http://www.xbitlabs.com/news/cpu/display/20130331080217_AMD_We_Are_On_Track_With_Steamroller_Micro_Architecture_in_2013.html

It should be noted that while integer pipes of Steamroller will not be too different from existing ones, the floating point pipe will be a bit redesigned. In general, AMD promises that both integer and floating point per-core performance of Steamroller will be higher than they are today with Bulldozer micro-architecture.

The improvements to the FPU, combined with split decoders should make the FPU improve hugely.
 

Cazalan

Distinguished
Sep 4, 2011
2,672
0
20,810


Supposed to be same GCN cores used in Kabini/Temash (2 CU), except 8 CU (512 "Radeon Cores" like HD 7750).
 


Oh, and perhaps improve the caching system, that seemed to be K10's achilles heel. Agena had too little clock speed and cache to compete beyond the Q8XXX, while Deneb went all out on clocks and cache and while successful, could be better with improvements here and there.
 

Cazalan

Distinguished
Sep 4, 2011
2,672
0
20,810


Good catch. I overlooked that change as well.
 

Cazalan

Distinguished
Sep 4, 2011
2,672
0
20,810


What you're talking about has largely already been done. It's the modern Jaguar core which had it's roots in the K10 cores. They've already scaled it to 8 cores for PS4/XBone. The limitation currently is the TSMC bulk process which likely tops out around 3Ghz before the TDP goes into orbit. For higher clocks they need GF.

Jaguar has some room to grow already if they increase the TDP. It supports turbo modes which are not enabled in Kabini/Temash to keep to 25W max.

I could see them resurrecting the Phenom name. It will be odd at first having a GPU in there but Intel did the same with their product lines. Even the lowly Celeron/Pentiums have embedded GPU now.

 

8350rocks

Distinguished


Just thinking out loud here, however...

Jaguar on GF's 28nm FD-SOI process anyone...?

You might see a 3.0 GHz base clock laptop CPU out of that with a relatively low TDP if that were to happen...

Hmm...suddenly with kabini and temash being possibilities to do the same...Intel's ULP designs look like they're playing more catch up.
 


Perhaps, lets not get TOO ambitious though.
 

griptwister

Distinguished
Oct 7, 2012
1,437
0
19,460


This link should help you. (from my understanding, it just improves performance, I guess this is the reason why Intel Haswell has lower memory reads than Ivy)

http://en.wikipedia.org/wiki/Transactional_Synchronization_Extensions

@kebbs, my last two paragraphs in that post weren't directed at you. It was more like adding to a thought I had. Just so you know, I don't want you, or others to think I was attacking you, or Intel. Thanks for agreeing with me! Lol.
 

Cazalan

Distinguished
Sep 4, 2011
2,672
0
20,810



Have to be realistic here. The roadmap is already laid for 2013/2014.

We'd be talking 20nm GF in 2015 to pull that all together.
 

8350rocks

Distinguished


I know...it was just a bit of a day dream really...but that would be amazing anyway!
 

juanrga

Distinguished
BANNED
Mar 19, 2013
5,278
0
17,790


Those are claims made by "bright side of news" in base to supposed secret documents that they saw. In the same link that you gave us they claim that kaveri comes with 6 steamroller cores and GDDR5 memory controller.

In a posterior article, they claim to see new documents contradicting the previous documents

http://www.brightsideofnews.com/news/2013/6/1/amd-updates-roadmaps2c-shows-28nm-kaveri-for-socket-fm22b.aspx

Now seriously. I am giving you AMD official claims. During official presentation of Kaveri at Computex this year, AMD reported that the FPU in Steamroller is 8 FLOP per cycle, as in Piledriver

2M Piledriver @ 4GHz: 128 GFLOP
2M Steamroller @ 4GHz: 128 GFLOP

This contrast with Haswell doubling the FPU capacity

i7-3770k: 224 GFLOP
i7-4770k: 448 GFLOP

Steamroller doesn't because AMD is offloading to the GPU. This is the rumour spread a year ago:

It's not clear what AMD means by "streamlined execution hardware." Typically that's execu-speak for "We got rid of some stuff," but that may not be a problem here. Sunnyvale is pushing the idea that the GPU effectively becomes the floating-point heavy lifter at some point in the not-too-distant future

And that future is here, with AMD substituting Opteron CPUs by Berlin APUs (Berlin CPUs are secondary and probably APUs with disabled iGPU) and, almost sure, substituting FX CPUs by Kaveri APUs (maybe with CPU versions a la Athlon).



The same quote that you give mentions Bulldozer. Yes, the FPU has been redesigned compared to Bulldozer

AMD_BulldozerVsSteamroller.jpg


but I am comparing it to Piledriver. A final note, although "there's no change in the execution capabilities of the FPU", compared to Piledriver, we would wait better performance coming from the improved front-end of the module.
 

juanrga

Distinguished
BANNED
Mar 19, 2013
5,278
0
17,790


Yes, AMD officially claims GCN cores and up to 8 CUs. From their official claim of 1050 GFLOP we can guess that GPU cores will be clocked at 900MHz and that is all...

Most people wait Cape Verde technology, but Semiaccurate have just spread the rumour that Kaveri comes with Hawai technology.

We don't know relevant details. Are ordinary Radeon cores? Or are improved Radeon cores with improved compute queues like in PS4 APU? Without that information I cannot answer the original question made to me.
 

8350rocks

Distinguished


Then why is it in the diagrams shown in this article, slides from AMD no less, they discuss a FPU "rebalance" and pipeline adjustments?

http://translate.google.com/translate?sl=auto&tl=en&js=n&prev=_t&hl=en&ie=UTF-8&u=http%3A%2F%2Fpc.watch.impress.co.jp%2Fdocs%2Fcolumn%2Fkaigai%2F20130704_606220.html%3Fref%3Drss

Translated from Japanese...
 

juanrga

Distinguished
BANNED
Mar 19, 2013
5,278
0
17,790


As mentioned before, it is not possible to design a single architecture doing it well in the entire range from ultramobile to high-end desktop/server. The problem is not in the process. Look at Intel, Haswell was optimized for mobile and, as consequence, performs poor than its predecessor (Ivy) at the high-end desktop. Intel moving to '22nm' was of no real help here.

There is no way that Jaguar can scale up to 4 GHz. This is a low frequency design that tops on about 2 GHz (probably the optimal freq. is about 1.5--1.6). AMD is taking here a wise approach with two (three if you count ARM) architectures optimized to different markets.

Steamroller is the big brother of Jaguar. As mentioned by Feldman (AMD) Steamroller core offers double performance than Jaguar core.

Finally, only to mention that consoles APUs don't come as new 8 core configuration, but as a dual 4 cores configuration.


 

8350rocks

Distinguished


ES's for the PS4 APU topped out at ~2.6 GHz and were close to 50W power consumption there. They will likely end up scaled back to around 2.2-2.4 GHz where ever they find a sweet spot.
 

juanrga

Distinguished
BANNED
Mar 19, 2013
5,278
0
17,790


Those slides are from hot chips 2012, where AMD discussed the improvements of Steamroller over Bulldozer. Those slides are included and discussed in any of the three or four sources that I gave to you, including the one clearly stating that the FPU capability is the same than Piledriver.

The question is why you insist on ignoring that AMD has officially claimed this year at Computex that Steamroller FPU is 8 FLOP per clock exactly, the same as Piledriver. Why?

2M Piledriver @ 4GHz: 128 GFLOP
2M Steamroller @ 4GHz: 128 GFLOP

If you have different data or sources that contradict AMD, please share them with us. Links to sites reporting that they saw in hypothetical secret documents don't count. Give me a link to AMD claiming that the FPU has been improved and is capable of more than 8 FLOP per cycle.

The same about SOI. AMD officially claims that Kaveri is bulk. Your same japanese site claims that Kaveri is going bulk. Still waiting your source that Kaveri is SOI.
 

8350rocks

Distinguished
http://www.eetimes.com/electronics-news/4405121/FDSOI-production-video

This article and connecting the dots leads me to believe it's FD-SOI.

Here's why...

(1) FD-SOI will be cheaper per die at SHP level performance because of fewer masks

(2) FD-SOI and planar bulk share many of the same processes and designs translate easily from bulk to FD-SOI, so there would literally be nothing preventing them from holding out a few months and getting their chips on SOI over bulk...(except the maturity of the process of course)

(3) The timing of the delay announcement coinciding within 30 days of GF announcing that it's 28nm FD-SOI will be available full scale production in Q1 2014 with production ramping up from Q4 2013 is a bit conspicuous, if not all together expected.

FPUs:
(1) While the FPU itself may be unchanged, the decoders and pipelines are not. In engineering terms this would be the car equivalent of opening the secondaries on a carburetor in a car. Nothing in the unit changed, you just gave it a lot more to work with, and it produces more performance/acceleration.

(2) AMD knows it's gaming performance deficit comes strictly from the weaker FPU...they would have to. By acknowledging that they need to get the FPU they have to work more efficiently/effectively, they are operating a work around to get maximum use out of what they have in place without a complete redesign of the FPU itself.

(3) Piledriver didn't inherently change a lot of the actual core performance in so much as it changed how the cores were fed instructions. We are seeing a trend here again with SR, minor pipeline adjustments and a few tweaks to feed the architecture more effectively. This denotes that they know their architecture was designed well, it was just implemented poorly. Though they had to work out the growing pains through development to get the product to do what they wanted. Additionally, it was adapted from server structures originally, which means it was geared heavily toward integer calculations to start with.
 


Cheers for the info but does anything benefit from it? the 4770 does use it but the 4770K doesn't is there anything the 4770 performs better in?
 
Status
Not open for further replies.