AMD "Zembezi

Page 7 - Seeking answers? Join the Tom's Hardware community: where nearly two million members share solutions and discuss the latest tech.
Status
Not open for further replies.

illfindu

Distinguished
Nov 30, 2009
370
0
18,810
Hey im looking towards the future and I'm eyeing the AMD 8 core bulldozer CPU'S coming down the line. I'v seen some source say there going to use a AM3+ Socket and im wondering if that means youll be able to toss one in a current AM3+ compatible board?
Will my current http://www.newegg.com/Product/Product.aspx?Item=N82E16813130297&nm_mc=OTC-Froogle&cm_mmc=OTC-Froogle-_-Motherboards+-+AMD-_-MSI-_-13130297 msi 870A Fuzion work i noticed just now that its a AM3 not a AM3+ I'm guessing that means ill need a new mother board cause the sockets are comparable?
 
Solution


That "market" is no different from any other market. Not all software makes the best use of 24 cores - some of the tests in that link didn't even make use of 12 cores. How is a lower clocked 24 core server supposed to perform against a higher clocked 12 core server when the workloads are only optimised for 12 cores?

22156.png


The result of this scaling is that for once, you can notice which CPUs have real cores vs. ones that have virtual (Hyper...


For something to qualify as a Core it must have all of the elements required to be a Core. In this case... these modules share resources (and not just any resources)... they share Floating Point Units.

They are therefore not true Cores. You can count the amount of true cores by counting the amount of clustered Floating Point Units. In this case... 4.

I would argue that you can therefore only call it:

4 Cores with CMT Technology.
or
4 Bulldozer Modules (each modules consisting of two Integer Processing Units and one shared FP cluster).

They cannot empirically qualify as being "Cores". If anything... they're Clusters.
 


Or, one could say that intel could not catch AMD in core count, so they are throwing megahertz at the the problem.

Here's the difference: Cores are not constrained like clock speed is. Look at the last 5 years and ask yourself, how have cores progressed? (3X) How has clock speed progressed?

If you had to lay a bet on where performance in 2013 was going to come from, how many of you would place your bet on clock speed and how many would bet on core counts?

The reality is that clock speed isn't going anywhere. Nice to see that Intel has cornered the market on that, but where is is going?




So, according to you, there needs to be a 1:1 ratio between integer and FPU in order to be a true "core", right?

Thanks for verifying that Bulldozer will have true cores. Glad we got that out of the way.

 


According to the definition of a "Core" it is supposed to have its own computational components (not shared resources). A similar argument has been made about the RV670/RV770 etc architectures in the GPU arena. Not every Stream Processor (ALU) is capable of issuing an instruction. In fact clusters of them are capable of being issued instructions. In conclusion an RV770 (for example) is not an 800 Core unit but rather 10 SIMD cores.

Unless Bulldozer isn't sharing FPU capacity (Not my interpretation of the slides) then I wouldn't call them "cores".
 
Each bulldozer core has access to its own FMAC. So, apparently, each Bulldozer core is actually a core in your definition. Thanks for verifying that you are on track with the rest of us.
 


They're sharing the same FP Scheduler though are they not?

BulldozerHotChips_August24_8pmET_NDA-9_575px.jpg



Contrast that with Barcelona:

Barcelona-7.gif


Bulldozer is sharing resources. So I have to ask then... how will Bulldozer perform as it is relative to how would it perform if each core had it's own dedicated FPU units under FP heavy scenarios?

Now I do understand that the Bulldozer Module comes with FMAC units (Fused-Multiply-Accumulate) which works to give Bulldozer FMA4 support. But if you're scheduling FP workloads into a single unit then you are creating a bottleneck in FP heavy scenarios.

Now I do understand that FP heavy scenarios are few and far in between these days (Integer performance is king) as Dedicated GPUs tackle the heaviest of FP loads but I still see a degradation of performance based on the sharing of resources between two cores.

My take is that the performance benefits were too little to justify the transistor space needed and that by sharing the FP Units allowed for some cost/implementation savings.

I still do not view it as two complete cores. While not in the area of being one physical and one logical core (SMT) it still shares resources and as such has the potential to cause bottlenecks under certain scenarios (less than SMT for sure but still likely to occur under heavy FP loads).

Those are my observations. No amount of marketing can change them. I'll have to wait and see when the product ships and extensive testing can be done.


http://www.hardwarecentral.com/features/article.php/3911856/AMD-Flexes-New-Floating-Point-Unit.htm
Inside Flex FP
The single floating-point unit of the Bulldozer has come under a bit of fire, as an eight-core Bulldozer would have only four physical FPUs. Instead of a dedicated 128-bit FPU per core as in current Phenom II designs, the Bulldozer architecture will feature a single 256-bit FPU shared by two integer cores.

The reason for this is simple - adding a second integer core to a Bulldozer module increases CPU real estate by only 12 percent. This is similar to the “shared resources” strategy Intel employs on a per-core basis with Hyper-Threading, but Flex FP is doing it at the component level and with a larger resource base.

AMD's stance is that as most programs have significantly more integer code than floating-point, a single integer core does not require its own dedicated 256-bit FPU. By adding a second integer unit, and sharing the same FPU, AMD can target the Bulldozer directly at the most common instructions.

AMD's Flex FP also includes some additional enhancements designed to improve performance and keep the data pipelines flowing. Bulldozer has dedicated schedulers for both integer and floating-point commands, rather than using a single scheduler for both units like Intel does on their Core-based processors.

By designing a separate FPU scheduler, each of the floating-point processes can be handled independently. This can not only speed up floating-point operations and keep the FPU path filled up, but it also drops the scheduling load from the integer processor. The only caveat is that there is a scheduler for each physical unit, two for integer and one for FPU per module.

AMD's Flex FP is designed around a full 256-bit FPU that can be further segmented into dual 128-bit data pipes. Flex FP certainly lives up to its name, and can handle two 128-bit SSE instructions through a single core, or both cores can simultaneously process a 128-bit FPU command. Support for AVX (Advanced Vector Extensions) instructions allows Flex FP to handle full 256-bit floating-point execution, but programs need to be recompiled to take advantage of it.

AMD is promoting Flex FP as a more flexible design that can easily handle both standard 128-bit floating-point code and the enhanced 256-bit AVX instructions. This differs from what Intel will offer with the Sandy Bridge FPU, which can process 1x128-bit in legacy mode and 1x256-bit with AVX code. Flex FP allows multiple configurations, so AMD Bulldozer should be able to process as a full 256-bit FPU, just not in the same form.

The difference is that regardless of the configuration, Flex FP can handle only 128-bit pieces, and pairs them up into 2x128-bit for a 256-bit AVX instruction. Intel can handle a full 256-bit floating-point AVX command per core, as well as a dedicated 128-bit path for legacy applications. This may sound equivalent, but this slight difference means that a Sandy Bridge multi-core processor should be faster when using the AVX instruction set.

In other words... if you've got an AVX command... you only end up with 4 units capable of processing it per Bulldozer chip. An 8 core Sandy Bridge would have 8 such units, 6 core Sandy Bridge 6 such units and Quad Core Sandy Bridge... 4 such units.

This should significantly reduce the Folding@Home performance of Bulldozer (compared to Sandy Bridge) and also significantly reduce the transcoding/encoding performance (relative to Sandy Bridge).

It can, and ought to, cause a bottleneck under FP heavy workloads unless there is something AMD hasn't yet told us.

Share and Share Alike
There are some concessions inherent in Flex FP, some of them real and others more of a mindset. Consumers like to get what they pay for, and no matter the innovative design and theoretical 128-bit FPU per core, some may view the Bulldozer module as “missing” a floating-point unit. Even taking the high road, it is slightly disingenuous to market a four module Bulldozer as a true 8-core processor, at least against Intel's Sandy Bridge and its dedicated 128-bit/256-bit floating-point engine.
Exactly my sentiment

With AMD painting a big bull's-eye on its shared Flex FP, you can also bet that floating-point performance evaluations will be a major part of any upcoming Bulldozer review. The shared resources of the module also extend to the integer unit, which AMD has already stated has only 80 percent the performance of a dedicated multi-core. If FPU performance tapers off even lower, that could spell bad news for Flex FP.
Doesn't sound like it is two full cores based on what AMD has stated. If it performs at an average of 80% that of dedicated Multi-Core then it is not a Dedicated Multi-Core. So they're not two true Cores. It seems logical to me.
 


I think Intel and AMD are trading blows with core count. I don't see one always upping the other. I guess it depends on how we look at it too. If we go by MCM, then AMD has the most cores. If we co by monolithic, then Intel has the most cores. And in a technicallity, Intel will have the most cores for a while if you count their 48 core CPU going out to researchers for testing in cloud computing. I don't count it myself as its not something the mass majority could use but still is an amazing feat.

As for the clock speed, I don't think Intel is really adding to it. Their process on 32nm has been around for a while and they have been doing semi manufaturing for quite a while. They know the tricks better than most on how to mature a process to its peak and I think that their 32nm is hitting its peak. And as new processes come out (22nm, and then the 14nm) they will start of low and end up higher than 32nm.

If we look at both AMD and Intel CPUs from 65nm to 45nm we can see this. 65NM Core 2 Duos and Athlon X2s had a hard time really pushing 4GHz on air. But the move to 45nm for Intel pushded 4GHz + on air and most Phenom IIs will hit 4GHz on air with ease.

I would suspect that BD will start of low like Nehalem, pushing 4GHz - 4.4GHz on air and then as the process matures it should be able to push 5GHz or at least 4.5GHz like SB does.

If I were to look to the future of performance, I would necessarily say that cores or clock speed will be the main performance champ. It will be a mix of both. I consider this because if you look at software, not all software gets coded for multiple cores and the majority of it that does is very inefficient. The best I have ever seen in a game is Source. It utilizes cores very well and pushes at least a 50% boost in FPS from dual to quad. But most games will use one core for the games, one for physics, one for sound and the rest for background tasks (as long as you have Vista or 7).

But to say that only clock speed or only cores will determine performance boosts is wrong. It will be both together in a decent upgrade for each.
 


I think "per socket" performance pretty much nullifies the "cores vs. clock speed" argument. That is what I always recommend.
 


Yes.. for their Core architecture they're using a single scheduler (SMT is not part of the argument as we all know it can, on average, deliver a ~15-20% performance boost).

The difference is that each Intel Core has access to a full 256-bit FPU unit capable of an AVX command. FPU performance ought to be decidedly in Intel's corner while Integer performance is a bit of a ??

We don't know how well Bulldozer will perform relative to Sandy Bridge but we do know it will be taken out to the back of the shed when it comes to Floating Point Performance.

Say you're transcoding using an application that makes good usage of AVX commands. This application shcedules an x amount of commands. x = 8 for the sake of argument.

You have a machine running a Bulldozer Quad Module processor And another running a Sandy Bridge 8 Core Processor.

It would take two cycles for the Bulldozer Quad Module to compute 8x 256-bit AVX commands while Sandy Bridge could do it in a single cycle. This is of course not a real world scenarios but it does illustrate what I am trying to say. You don't truly have 8 Cores when you have 4 Bulldozer Modules because when running FP heavy workloads... your Bulldozer will function more like a Quad than an 8 Core processor.

Is AMD not interested in competing with Intel and nVIDIA in the Scientific computing market? First Radeon drops computational power in exchange for more traditional gaming performance and now AMD drops their FPU performance on their CPUs.

Cost cutting measures are good but at some point you have to compete in terms of overall throughput.

At the end of the day, how you view Flex FP comes back to your concept of a CPU core as it relates to the Bulldozer. If terms like "processor module" and "shared resources" are not your thing, then AMD could have a tough time, especially if core efficiency and price-performance are ignored in favor of a straight core-to-core performance comparison.
Couldn't agree more.
 
But, in order to do 256-bit they merge a 128-bit FPU with the 128-bit SSE registers, so, they cannot do an AVX and an SSE instruction on the same cycle.

Because they are sharing these registers, does that mean it is not a real core?
 
Here's the bottom line. You are arguing that shared resources mean something is not a "real core" but your frame of reference is an intel processor that has plenty of shared circuitry.

Let's face it, this is a non-issue, it is all about per socket price, performance and power consumption. Anything else is simply people trying to split hairs.
 


Subjective Values you're putting forward.

I could care less about Socket Price. I give some credence to Power Consumption but am most Enthused about Performance. Everyone has their own opinions.

And it is not a non-issue. I've just illustrated (above) how it can be an issue... especially for people who like top performance and enjoy FP heavy workloads (F@H, ABC@Home, Grid@Home, Einstein@Home, Transcoding, Encoding etc).

 


I think the problem is that this was supposed to be AMDs SMT, CMT. Its confusing as it was said to be a better version of HT when in reality they are cores that share a few more resources than normal.

I think the main goal of this is to lower die size if anything, sort of the same approach as ATI had with lower die size and performance per square mm. Maybe. Who knows.

Still its best to wait until it comes out and see what happens performance wise.
 
I think the bottom line is whether BD cores will be faster then I7 cores clock for clock.

According to a lot of benchmarks an i7 950 often outperforms a 1090T EXCEPT for some highly threaded apps, but overall 4 Intel cores are better than 6 AMD cores. Throwing more cores at the problem isn't always the solution.

http://www.anandtech.com/bench/Product/100?vs=146

Forgetting price for a moment, comparing 6 Intel cores vs 6 AMD cores really shows where both companies CURRENTLY stand clock for clock and core for core.
http://www.anandtech.com/bench/Product/142?vs=146

John, do you know if any info will be available in March that can give us an idea of Bulldozer's real world performance? I'm kinda tired of learning about the architecture and how great it sounds on paper. ( I can't speak for all, JMO )

Since BD is scheduled for Q2 release, is it early Q2 or late Q2 that we'll see BD in the market?
 


Yeah - I'm hoping that some real stuff starts getting released in March, and by real, I mean solid reliable numbers by reliable sources. We can talk about the architecture all day long, but what does it translate to in terms of performance?
 

Fixed that for you.
 


Thanks, 6 AMD cores are still getting spanked by 6 Intel cores what's your point?

I know, I know, they're cheaper, blah blah, I'm just comparing 6 core vs 6 core, clock for clock, the gap widens once they're both overclocked, but let's not go there.

When AMD had the Dual Core FX series that spanked Intel those too cost $1000 at the time, the shoe is simply on the other foot right now, and I sure hope BD will close down this gap, core for core & clock for clock.

I wonder how high BD will overclock too.

This is the most important info people care about at the end of the day, all we can do is theorize on the architecture and its potential performance, but none of us know how well it will translate in real world.
 

Just nit-picking lol. 3.3 vs 3.3. I know it still makes no difference.

And I never said anything about price. I know your point.

Kinda interesting:
http://www.anandtech.com/bench/Product/99?vs=142

two additional cores makes little difference in most benchmarks if anything. 4 cores is really the max any "normal" or even heavy user today should need. Heck, my 6 cores are overkill even for all the folding and virtualization I do.
 
Status
Not open for further replies.