AMD CPU speculation... and expert conjecture

Page 260 - Seeking answers? Join the Tom's Hardware community: where nearly two million members share solutions and discuss the latest tech.
Status
Not open for further replies.


That was never actually stated. They just haven't released any info indicating a "FX" CPU made with SR.

Also remember "enthusiast" is the $300+ USD range for CPUs not the $200 USD range.

4-5 year old tech expect 4-5 year old performance.

This is not remotely correct. There is nothing available in FM2+ that isn't available in AM3+ aside from interconnects to a video output. What's important is the chipset that the CPU is connecting to, in this case the 990FX. The 990FX provides 32x PCIe 2.0 lanes for video output in either a 2 x 16 or 4 x 8 configuration along with some x4 and x1 slots. HT 3.0 @2.6ghz. The BD CPU supports HT 3.1 @3.2Ghz and DDR3 memory.

There is currently no need for anything else. The socket is merely an interface between the CPU, the memory and the chipset, as long as it's needs are met then there is no need for anything different. DDR4 or a 32/32 (vs 16/16) HT bus are the only two things that would possibly require a different connector and SR has neither (as of most recent info). FM2 just adds a video output channel for the iGPU to use.
 

Ranth

Honorable
May 3, 2012
144
0
10,680


Why is it you say that? What is it that decreases the performance? What is missing from 990fx? Just curious.

 

griptwister

Distinguished
Oct 7, 2012
1,437
0
19,460
I'm not buying this FM2+ "Myth." I think the high end gigabyte boards are there to represent how good the APU is gonna be, not that there will be a high end line up. If this rumor is true. There go my hopes of buying a 6 core steamroller. Lol.
 

hcl123

Honorable
Mar 18, 2013
425
0
10,780
This deserves an answer



You are being really helpful lol .. and i never said it does..but let me help you

Instruction set
http://en.wikipedia.org/wiki/Instruction_set

Operand
http://en.wikipedia.org/wiki/Operand

"Operands may be complex, and may consist of expressions also made up of operators with operands. "

" In computer programming languages, the definitions of operator and operand are almost the same as in mathematics."


In computing, an operand is the part of a computer instruction which specifies what data is to be manipulated or operated on, whilst at the same time representing the data itself. A computer instruction describes an operation such as add or multiply X, while the operand (or operands, as there can be more than one) specify on which X to operate as well as the value of X.

Additionally, in assembly language, an operand is a value (an argument) on which the instruction, named by mnemonic, operates. The operand may be a processor register, a memory address, a literal constant, or a label. A simple example (in the x86 architecture) is

MOV DS, AX

where the value in register operand 'AX' is to be moved into register 'DS'. Depending on the instruction, there may be zero, one, two, or more operands

Now everybody is enlighten LOL. And doubt you didn't know any of this, since was you that wrote that part of wikipedia lol :??:



#$%% ... i'm not registered, so upon posting the system doesn't allow me to post images only the links. (doubt how much you like, complain to the admin )

And yes i made a mistake, that is the primary motive i'm answering, that is Trinity not Llano, but the rational is the same. Integer cores don't do FLOPS, so those "cores" mentioned in the chart can only be the "FMAC pipes" the only ones AFAIK can do FLOP operations... right ? And there are 4 of them (FMAC pipes) alright in Trinity in 2 modules.

Trinity can has in the Intruction Set
http://www.cpu-world.com/CPUs/Bulldozer/AMD-A10-Series%20A10-5800K.html


But that leaves a big problem... which instructions are executed that have 8 vector numbers per pipe !? If we consider "operands", that is "chunks" (values) that are to be operated upon, only FM4 has 4 operands, those "chunks" are defined by registers, that is, are "inside" what is defined as a register.

Those are
http://en.wikipedia.org/wiki/FMA_instruction_set
Mnemonic (AT&T) Operands................................. Operation
VFMADDPDx xmm, xmm, xmm/m128, xmm/m128 $0 = $1×$2 + $3
VFMADDPDy ymm, ymm, ymm/m256, ymm/m256
VFMADDPSx xmm, xmm, xmm/m128, xmm/m128
VFMADDPSy ymm, ymm, ymm/m256, ymm/m256
VFMADDSD xmm, xmm, xmm/m64, xmm/m64
VFMADDSS xmm, xmm, xmm/m32, xmm/m32

I looked around and i must confess i've not found anything clearly stated. A 128bit vector can only have 8x "values"/chunks" with 16bit "numbers" (8x16=128) and that leads to SSE2 16bit FP or XOP 16bit CVT16 FP... that is vectors of 8x 16bit chunks (pardon the language, so everybody understands)... The problem now is the number of "operand/registers", if the max that an operation (by mnemonic) can operate upon is determined by 4 registers, how 8 "chunks" are going to be operated at the same time on the same cycle ? can you answer please ?

And i think in modern "processors" the only real logic operations that exist "hard wired" are ADD, *SHIFT*, MULTIPLY and DIVIDE (oops! an omission lol... nobody helps ? (EDITED) ).. please correct me... copy, move, shuffle, permute, pardon my presumption, are about moving values/numbers around inside and from registers to registers are not what i'd would call real logic operations.(edt)

I used wikipedia, but if you have anything that can enlighten us please share.

Streaming SIMD Extensions
http://en.wikipedia.org/wiki/Streaming_SIMD_Extensions

SSE2
http://en.wikipedia.org/wiki/SSE2

Advanced Vector Extensions
http://en.wikipedia.org/wiki/Advanced_Vector_Extensions

SSE5 ( lead to XOP)
http://en.wikipedia.org/wiki/SSE5

XOP
http://en.wikipedia.org/wiki/XOP_instruction_set

Making a long story short, 8 FLOPS/core, assuming here its FMAC pipes, cause no other processing element in Trinity is able of FLOP operations (AFAIK)... and the FLexFPU in BD uarches is a co-processor(like a processor inside a processor)... and using a very loose definition with a lot of subsidies considering that in fact 8 FLOPS ( meaning 8 Floating Point numbers) is operated upon in each cycle by each FMAC pipe... than it can only be possible with 16bit Floating Point numbers... no !??

See the dilemma ?... do you understand ? ... and is this correct when those 8 "numbers" must reside inside 4 registers, which is the max stipulated by operation "mnemonic" and this for FMA4 ?

But if it is, and i'm bad and ignorant ( the opposite of you lol), then the only way i see it, is with vectors of 8x 16bit "chunks"( whatever... for ppl to understand.. and FP "chunks" since we are talking FLOPs)... if there is something escaping me please correct;

Now that isn't terribly useful is it ?... unless we were using 16bit DOS Operating Systems and Applications that hardly deserves to be mentioned... even if not fundamentally wrong, because in a very tiny case 16bit FP operations are still used to perform some complementary calculations in programs ( is used in games).

There is more than 1 article, where one vendor raised suspicion about other vendor FLOP calculations... if this gets worst we will have a "fake FLOP war"... and in light of what i expose, and the complexity of the matter, i found no surprise that some vendors use slight different methods which might lead to some awkward charts even if those are not fundamentally incorrect.

If you on your infinite wisdom can correct me, please do ( boy! more insult coming lol)... i don't feel embarrassed or find reasons to be, i don't even mind, if its the TRUTH... i'm all for the TRUTH.
 

juanrga

Distinguished
BANNED
Mar 19, 2013
5,278
0
17,790


AMD clearly stated (may 2012) that its entire cpu line at 28nm was bulk instead SOI. AMD explained us why they selected bulk over SOI. If SOI plans didn't exit as you claim, then AMD wouldn't need to explain us why they preferred bulk over SOI.

In fact, Glofo announced its 28nm SOI at least a month before: april 2012.

AMD claims officially that they selected bulk for flexibility. In the past, AMD had serious problems as consequence of delays at Glofo. For instance, Cray lost some supercomputer contracts because AMD couldn't provide them Opteron chips due to production delays at Glofo. In my opinion, moving away from glofo is a wise option. Now AMD can select the better fab for each product at any time and avoid delays and similar problems.

Moreover, AMD must be saving money by unifying all its products around bulk.

A priori SOI is superior to bulk. However, I have rumours that Glofo has mastered its bulk process up to a point where the differences with SOI are minimized.

There are rumours Carrizo (Kaveri sucessor) is 20nm and Glofo claims that its 20nm bulk process offers the best performance/price option for customers.

Why would AMD move Kaveri from bulk to SOI and then return to bulk a year after with Carrizo?

Glofo claims that its 28nm SOI production volume starts at 1H 2014. AMD claims officially (AMD did again a pair of weeks ago) that Kaveri starts shipping to customers this 2013.

In July this year PC-Watch confirms that kaveri is going bulk at glofo and adds (sorry this is automatic translation from japanese):

However, I found for the 32nm whereas the SOI, 28nm is the bulk, there is no great difference in the transistor performance.

which seems to confirm the rumours that I know about Glofo mastering bulk and minimising the differences with SOI. In fact, the article even claims that Glofo needed much time before being able to master the 28nm bulk process for massive production of high performance at constant ratio

http://translate.google.com/translate?hl=en&sl=ja&tl=en&u=http%3A%2F%2Fpc.watch.impress.co.jp%2Fdocs%2Fcolumn%2Fkaigai%2F20130704_606220.html

Now, remain that part of your post about who is using Glofo 28nm SOI. I don't know. I only know that Glofo seems to be working hard in publicising its 28nm SOI. Maybe they are trying to attract customers.
 

8350rocks

Distinguished
It is easier to translate from bulk to SOI, than to go from SOI to bulk.

Bulk requires more masking layers. Additionally, as I said before, Planar bulk is at a disadvantage compared to SOI. Look at Intel using Tri-gate FinFET on bulk...they have heat issues among many other things, with a far more advanced bulk process. Also, GF can say many things about how great one process is, but they cannot overcome the basic thermodynamic properties of the substrate.

FD-SOI will cost less @ 28nm, 20nm and 14nm with better performance as of the GF presentation @ the conference covered by advanced substrate where they and IBM presented their research. I don't see any way that could possibly change...GF themselves showed that FD-SOI is such an advanced process over the HKMG PD-SOI that was used in Trinity/Vishera/Richland that it blew it away in power consumption and efficiency, as well as performance.

That would be a step backward.
 

juanrga

Distinguished
BANNED
Mar 19, 2013
5,278
0
17,790
My thoughts about High-End:

1) AMD has clearly stated its transformation into a gaming company.

For gaming, most users don't really need an ordinary i7 or an eXtreme chip. An i5 or a FX-6xxx are good enough.

The 4C Kaveri CPU will be at i5/FX-6 performance level. I consider that high-end.

2) Server/desktop roadmaps seem to demonstrate that there is no 6/8 core Steamroller chips. If those did exist they would be used in Warsaw CPU, however Warsaw will be using Piledriver.

Rumoured 6 core version of kaveri is not in the official roadmap. AMD is releasing only 2/4 core versions for desktop/mobile. Moreover Berlin for servers comes as 2/4 core versions as well.

3) AMD is focusing on HSA (consoles, desktop, mobile, servers). In my opinion, a hypothetical 8-core Steamroller does not fit with HSA 'paradigm', because the CPU is better at serial workloads whereas the GPU is better at parallel workloads

amd_kaveri_apu_huma-580x302.jpg


It doesn't make sense to increase the number of CPU cores, if the goal is to crush the parallel work at the GPU

Note how the above image suggests that CPU will be a quad at best: this is more evident in the following image

huma_intro_large.jpg


4) As many rumours seem to confirm, AMD abandons AM3+ and the FX line and focus on the FM2+ platform. This makes sense, it is better to focus resources in a single platform than split the resources.

Berlin comes as both CPU and APU. Kaveri comes only as APU. However, I think it is safe to hope a Steamroller CPU for desktop under the Athlon brand, as successor to current Richland-based Athlons.

5) What is the use of ordinary i7, FX-9000, and eXtreme series? Video encoding? password crunching? spreadsheets? Scientific compute? For all that an APU will be better with appropriated software, because a 200-400 GFLOP CPU cannot compete against an APU with 1000+ GFLOP

HUMA-1.png
 

Ags1

Honorable
Apr 26, 2012
255
0
10,790
Not every parallel task can be converted to 1000 threads executing the exact same instruction in lock-step. There is still plenty of room for multicore CPUs, particularly in gaming and so forth. For encoding, image processing and so forth, then GPU computing offers a real benefit.
 

juanrga

Distinguished
BANNED
Mar 19, 2013
5,278
0
17,790


NO. The 8 FLOP per core are for Single Precision, i.e., 32 bit. I already demonstrated this before and i am not going to repeat the demonstration

i__src66e13e53113e26c120439c9dd826eabd_paraf0d99c20bd457d46a92c72841873c47.jpeg


Trinity A10 (Piledriver) CPU has 121.6 GFLOP (Single Precision).
Kaveri A10 (Steamroller) CPU has 128 GFLOP (Single Precision).
i5-3570k (Ivy Bridge) CPU has 217.6 GFLOP (Single Precision).
Kaveri A10 (Steamroller) APU has 1050 GFLOP (Single Precision).
 

8350rocks

Distinguished


This is all well and good, except that mainstream software lags about 2 years behind hardware. Hence we are only just now seeing games that can actually take advantage of Bulldozer architecture.

Additionally, because the lowest common denominator is not Vishera or IB, and is really still the C2D and C2Q and P2X4 etc, we run into situations where many programs do not even incorporate the newest instruction sets on the new CPUs because they won't run on the older stuff still in a *large* number of homes.

AVX code isn't even in most modern software right now because only the most recent generation of CPUs has it. It does no good to run AVX code when most of your customers are on ~Nehalem or ~Thuban architecture give or take, as they see no real performance increase.

HSA is great, and in 2 years when software at a mainstream level can utilize it, then Kaveri will benefit greatly. So far though, outside of a few productivity apps, you won't see much of it in newer software. In games, you might see some and that would be only on console games and ports from consoles. I would even argue the ports to PC might ignore HSA and HUMA simply because so much of the industry is still on older architecture and/or Intel architecture.

The greatness of the design is not in question, the capability of the software to utilize it is my chief concern, as that is still the quintessential Achilles' heel for Piledriver.

EDIT: In the past, AMD's solution to this has always been raw blunt force solutions like 6 and 8 core monsters that do through raw power what Intel does through efficiency.
 

hcl123

Honorable
Mar 18, 2013
425
0
10,780


??.. I completely missed it, i'm an occasional poster, can you repeat it gain please.

See!.. in in my undertanding the only instructions capable of such vectores are AVX and XOP, and in 256bit format. The problem is that to execute those Trinity uses 2 cores (FMAC pipes) working together.. not one... so it couldn't be 8FLOPS/core but 8 FLOPS/2 cores ?

What instructions (kind format whatever) are you "using" that fits that calculation ?... after all NO number is crunched without an instruction, no CPU does anything without an instruction (you could even have 1Mb vectors).

Kind of the "COMPLETE LIST" is in here, what misses is linked to other pages ( point in there please ) (edt)
http://en.wikipedia.org/wiki/X86_instruction_listings
 

hcl123

Honorable
Mar 18, 2013
425
0
10,780


Oh! but AVX 2 will change that i'm convinced, it has a plethora of very compelling instructions either for Integer as well FP... and integer will be perhaps even more juicy than FP

See Intel AVX FMA instructions on http://en.wikipedia.org/wiki/X86_instruction_listings

XOP can still have some advantage because it provides non-destructive "register operands", either for FMA4 or IMA4... but is not as extensive a pack as AVX, more so i think the IMA ( integer multiply add or accumulate) part of XOP i think is only 128bit vectors.

I think one feature Stemaroller couldn't miss even if the software developers are late, is support for AVX 2. Just because of this it will make those FlexFPU different. If AMD fails to support this, the CPU IPC can be even 40% better, but it will fail. period.

It wil give a golden opportunity for Intel to "twist arm" for all Windows benchmarks to be exactly around what AMD doesn't support.. it will be a mistake or omission AMD will pay dearly ( running AVX 2 by microcode will be tremendously slower, it will be a tremendous benchmark beating).

" EDIT: In the past, AMD's solution to this has always been raw blunt force solutions like 6 and 8 core monsters that do through raw power what Intel does through efficiency. "

That is clearly not true. Intel efficiency is due to fab process more clearly now tweaked for low power (while SOI is better for clock... umm which do you prefer ? ) and quite excellent power management features. Nothing of this is strictly due do design or more particularly to the micro-architecture... i'm convinced AMD BD design with Intel fab process and specially with intel power management could be more power efficient than actual Intel designs.

Another point about past, is that SSE5 was an AMD invent, those were/are instruction extensions clearly more clean and efficient than AVX 1 (edt)... not only didn't intel adopt it, it forced AMD to re-code the extension pack to the same prefix of AVX, which became XOP... We don't have FMA (clearly more efficient for FP calculations) since the K10 and Core2, because of Intel.

Also putting more cores inside a single chip die was and is due to the superior Hypertransport based Xbar/North bridge in all AMD chips since Athlon 64. Sharing L2 like in Core2 can be efficient, but L2 sharing with more than 2 cores is not that clean efficient (cache pollution due to prefetch and contention comes to mind), and is tremendously more complex(edt). Intel responded for many-core chips with a very good ring-bus, but this tend to have exponential higher latencies as the number of cores/nodes augment and is not more efficient. I think for future many-core chips Intel will have several Ring-buses or a segmented ring-bus(they already do for the EP versions).. and those are bulky.
 

8350rocks

Distinguished


I was speaking about brute force in relation to the era since Intel overtook the performance crown. The era before that AMD was clearly better, not just in performance, but in efficiency as well. I would love to see Jim Keller resurrect that in this day and age, though I fear a departure from SOI would be a step backwards for them at this point.

I also think AVX and AVX2 would be beneficial, but it won't become widespread in adoption until a few years down the road.
 

Cazalan

Distinguished
Sep 4, 2011
2,672
0
20,810
If AMD just wants efficient (performance, power, and cost wise) why not just use 8 Jaguar cores and bump the clocks. At 3.1mm squared they are super tiny and efficient cores.
 

hcl123

Honorable
Mar 18, 2013
425
0
10,780

(edited)

There have been phases... K7 was clearly better than Pentium III or the first versions of P4.

You have sensible arguments, but the problem with those Multimedia like extensions, is that ppl don't judge a chip by the software they use but by benchmarks... most "current software" don't go beyond SSE 2 to 4, AVX is starting to appear, but rest assured benchmarks don't care about representativity, they will have whatever extension pack as the better suits and gives advantage to whom controls those benchmarks.

About departure from SOI, if the bulk versions are only the first iterations of Kaveri, and most notoriously the "mobile" versions, then it will came no harm... if a transition is made to FD-SOI then is a complete gain... if bulk is definite and to stay for all SKUs, then i complete agree with you.
 

Krnt

Distinguished
Dec 31, 2009
173
0
18,760
Jaguar cores are around 87% the IPC of the Phenom II, that is almost the same IPC of a Pilediver, but being definitively very efficient. Now the thing is that they are not designed around High performance like the Steamroller cores are, because of some details on it front end, L/S and cache clocks, making the SR far better options for the job. Steamroller should match or even beat Phenom II IPC by a fair amount (even 10% is good enough).

About the instructions game, is funny that when AMD supports or even lead in new instructions Intel rapidly make new instructions making the same thing but being uncompatible with the ones made by AMD, or just make new versions of the already existent with almost zero improvement.
 

hcl123

Honorable
Mar 18, 2013
425
0
10,780


Because it will lose in performance in all the line, even at the same clock. And worst it will never be at the same clock, because some particularities of BD design, was the enabling factor to have tweaks to have superior clocks.

Also at a middle range a tweaked Jaguar (even more than PS4 or Xbox) at the same clock, would not lose much integer IPC, but Jaguar is not tailored for all those Multimedia like Extensions, in which the strong particularity of BD design (and the best) is exactly having a co-processor, with vector/FP out of integer code paths. Those vector/FP capabilities are and will be even more important for HPC and some server jobs.

If you go put a FlexFPU in Jaguar, then with all the balance search for best efficiencies and tradeoffs, i'm afraid you'll end up with something terribly resembling BD design LOL (and this one is already done, only needs is tweaks)


 

Krnt

Distinguished
Dec 31, 2009
173
0
18,760

I agree at the clock tweaks and multimedia extesions, but for me the Jaguar already has FlexFPU, well not sure, but uses the same method to work on AVX as BD with its two 128bit FPU pipelines, also has less than a half of the Integer performance of a BD module.
 

hcl123

Honorable
Mar 18, 2013
425
0
10,780


You forgot to mention is ~87% of a L3 less Phenom II. Ppl tend to only see "cores", but even the best core in the world would be crap, if is put together with cache and else subsystems that are crap.

There are notorious differences, it depends on the workload of course, but a PhII with L3 could put 5 to 10% or more for that comparative. for a lot of workloads ( Jaguar is for ultramobile). The same with Piledriver FX which has more than double the cache compared to Ph II with L3.

No i believe the overall performance differences are bigger than what you say. 10% or more in overall performance already PD is ahead of PhII(with L3). There are some few benchmarks and perhaps some fewer real applications where PD is behind, if we consider at the same clock. But BD uarch can never be considered at the same clock, other chips to have the same clock abilities would have to pass for the same tweaks... and virtue of that perhaps completely re-designed. Its not all about SOI, matter of fact the SOI version of BDver1 was almost a complete crap. (AMD waited until Samsung developed SOI "stress" tech for Glofo - a partnership of fab light-, since Glofo at the time, by whatever abhorrent mis-management fired/dispensed most of SOI engineers, most of them from AMD origin).

 

hcl123

Honorable
Mar 18, 2013
425
0
10,780


I'm not sure but i suspect Jaguar doesn't use the all of AVX 1. XOP is not there, so Multiply Add or Accumulate is not there ( either Fused->FP or Integer). And this is the most important specially for HPC jobs (some server to). So Jaguar vector/FP capabilities are clearly inferior in all the line, no matter the AVX label.

Doesn't matter for games... yet.. that is why Sony MSFT choose it, and doesn't matter much for most of the single-thread desktop applications. But where it matters is abysmal.

 

Krnt

Distinguished
Dec 31, 2009
173
0
18,760

Well I was comparing it vs a full Phenom II, but it was just a not so accurate mathematical comparison, using the Cinebench 11.5 numbers I found here:
http://www.anandtech.com/show/6976/amds-jaguar-architecture-the-cpu-powering-xbox-one-playstation-4-kabini-temash/3

Yeah I know Cinebench is not a good bench for that kind of comparison, also that performance is not always linear, I was assuming a theoretical Jaguar quadcore at 4GHz (obviously it will never reach that) could achieve 4.0 points in that bench, vs a Phenom II X4 at 4Ghz that achieves 4.6 points, not accurate at all but it gives an idea.

About PD cache, it is double than Phenom II in almost all areas except in L3 where it has 1 MB per core compared to Phenom II which has 1.5 (with the exception of Thuban), but that is not going to make a huge difference. The main difference is in the PD cache that is slower than Phenom II, but that was made to bump clock rates and keep consumption under control as far as I know.
 
Status
Not open for further replies.