AMD Releases Interlagos Opterons With Up to 16 Cores

Page 3 - Seeking answers? Join the Tom's Hardware community: where nearly two million members share solutions and discuss the latest tech.
Status
Not open for further replies.
Wait until tomorrow and see what Intel will announce...
Also, isn't there a relation between Memory / core in servers. Does this indicate that AMD requires almost 3X the memory in a server to get 80% better performance against an Intel chip that was available in Q1 of 2010?
Shouldn't they compare their Top CPU with the Top CPU from Intel (X5690) and not the SIXTH fastest CPU in the Intel line?
Also, forget synthetic benchmark. Show me VMWARE scores, and then let’s talk! While you are at it please include the license cost as part of your deployment cost to run a real application like SQL, Oracle, or vmWare.
 
The fact that AMD would follow up their dual-die, 12-core Magny-Cours with a dual-die, 16-core Opteron was a no-brainer.

Perhaps the biggest concern here is that AMD's targetted price-point competition, the Xeon 5670, may be beaten thoroughly on performance, but a lot of people look just to the superficial per-package TDP: and Sandy Bridge's low TDP DOES make it look pretty attractive to professionals, even if, at face value, the performance-per-watt may be lower; after all, that PPW only considers the cost of the CPU itself; the cost in money, power, and space for cooling may balance that out, and is harder to pin down, so many professionals may, in fact, just go with the lower TDP option in hopes that it will yield the best deal in the long run.

As for the debate of cores vs. modules... Each of the 16 "cores," in an Interlagos *IS* as capable, on paper, as a single core on a Sandy Bridge CPU. Each can execute its own instructions and operations, independent of the other. The issue here mostly stems from AMD's chosen implementation of AVX; like Sandy Bridge, Bulldozer cannot fully retire a AVX instruction in a single cycle with just a single core. However, while Intel's FPUs are more isolated, AMD's solution was to join the FPUs of both cores in a "module" at the hip; yes, the FPUs may be shared, but there's twice as many per module as a Phenom II has.

It's not the number that's the problem, but rather collisions; so far, with the architecture being alien to all operating systems out there, there's no system in place to ensure that, with both cores in heavy use, that both get smooth access to the FPU resources. Chances are that software won't ever fix this, and it's a hardware issue: then it's AMD's fault for not including better thread-management circuitry there. At the very least, the most likely short-term solution is an OS that adjusts its thread handling to act much like it does with an Intel Hyper-Threading CPU: fill all the even-numbered threads FIRST, then do the odd-numbered ones. That way, the most intensive threads won't have to clash with each other, sapping performance.

As for the desktop market... Barring a significant adjustment to compensate for this threading weakness, the per-clock performance of Bulldozer is not going to be able to match up to Intel's offerings. On the server side, with embarrassingly parallel workloads, this can be compensated for by brute force; simply upping the number of total cores, and damn the TDP. It may or may not work, though I can't help but recall if you replace "up the number of cores and damn the TDP" with "up the clock speed and damn the TDP," you have Intel's entire design philosophy from 2001-2003, and we know how well THAT turned out. Plus, of course, given the decidedly fewer-thread nature of home computing & gaming, this strategy won't impact outside the server market.

Ironically, though, if AMD either wants to grab more market share to displace the 2500K & 2600K et. al, or simply grab a higher slice of the "extreme enthusiast" crowd, they could opt to tap into the apparent clock speed headroom of Bulldozer, and get over their fear (a fear shared by Intel, admittedly) of actually binning a CPU at the dreaded 4.0 GHz mark or above. AMD claimed the 1 GHz mark, and Intel quickly beat them to the 2 GHz and 3 GHz ones... But right now AMD's best-equipped to hit 4.0 GHz first.

[citation][nom]otacon72[/nom]Um the Xeon X5670 has 6 cores. Intel should make an X5670 with 16 cores and blow the Opteron 6276 out of the water just to shut AMD up.[/citation]
Keep in mind that the X5670 MSRPs for $1440US, (which, given the history of the 12-core Magny Cours that preceded the Interlagos, is likely around the 6200's MSRP) so the price might've factored in here; remember that the "performance crown" is meaningless outside of the gaming sector, and "performance-per-price/watt" is king: after all, with the enterprise market, if one chip isn't powerful enough, your always-there solution is to just buy more.

[citation][nom]de5_roy[/nom]would love to see some server benches against xeons... and some real world general purpose benches e.g. gaming and transcoding on the side.[/citation]
Gaming isn't exactly a real-world benchmark for a server. LINPACK is, in fact, actually more realistic; after all, that's the benchmark used for measuring the world's most powerful supercomputers.

[citation][nom]shqtth[/nom]fpu wa snever designed to take orders from two cpus at once. two threads at once on an fpu is hard due to the stack structure of the fpu and the register structure of 3dnow/sse2.[/citation]
Actually, as I recall, Bulldozer actually removes support for 3DNow!... Ironically, perhaps retention of it would've allowed for a potential solution to FPU resource collisions... But requiring using an entirely extra extension just to compensate for that is just plain unfeasible anyway.

[citation][nom]hetneo[/nom]This is such a huge blunder on AMD's part. This is supposed to be their flagship server CPU but they stacked it against almost 2 years old[/citation]
Again, keep in mind that they're likely going for a price-point comparison. I think that, at this point, AMD's essentially decided to ditch the "go for the crown jewel" strategy; after all, it's mostly a marketing gimmick, and never is the part that rakes in the sales. And again, in the server marketplace, IT professionals don't need the highest-end chip to get the most FPS: if they need more power for their server farm, they just buy MORE CPUs; hence 2x 2.0 GHz CPUs for $200US each is EASILY a better deal than 1x 4.0 GHz CPU for $1,000US, just to give an example.

In this case, to be more direct, while Intel's current server flagship, the 10-core 8870, (12-core Sandy-Bridg-based Xeons aren't out 'till 2012) probably would best the Opteron 6200, it also MSRPs for $4,616US. And actually, given that the net theoretical throughput (instructions & ops) is only 36.5% greater than the 5670 given it's reduced clock speed, (which is slightly below the 36.8% increase in TDP) there's a chance that the 6200 would still come out on top given the margins of victory AMD's been boasting.

Of course, that just makes me wonder what AMD was thinking in picking the chip to compare it too; they could've retained a "it's more powerful" argument AND tacked on a "it's much lower in price" and a "it's lower in TDP" claim as well.
 
People need to stop really looking at core counts .... it doesn`t matter if it has 4 or 100 , if they make a 100 core CPU that costs 2 times less than a 6 core xeon and it performs better , who really cares what`s under the hood ? Price / performance is the deal not cores / performance . For the last time guys .. stop counting the damn cores, they use different architectures .. it took some times to get rid of that MHz per MHz comparison ... remember in the past P4 vs Athlon ... those CPUs at the same speed performed differently ... ppl slowly stopped comparing them head to head in the MHz race ... now stop comparing CPUs by core count , compare them power/performance/price .... and here in this particular example AMD has the edge... you can buy 2 Opterons 6276 for the price of that Xeon and obliterate it ... i don`t care about how many cores it takes AMD to beat intel , but for that 1500 $ price tag i get 2 CPUs that each have 80% more performance that i get from intel.
 
So what if performance per core is way worse than Intel? If price is the same, and TDP is similar, then for server work, the 16 core that's 84% faster (if it actually is) is a winner.
 
Benchmarks? I'd like to see how the quad-channel memory effects performance.

 
[citation][nom]panders4[/nom]I'm glad they chose to make the FPU shared. Any truely floating point heavy job should be offloaded to the GPU anyway. At least in an ideal world. That seems to be the way GPUs are going anyway.[/citation]I'm thinking this is their target. They are expecting servers to be specialized anyways, and vector to use their GPUs.

The biggest threat is that they can have a motherboard with 64 cores on it.
 
Something lots of people are missing is that the server market is no longer one box = one server. Enterprise is going virtual, meaning one box with multiple sockets and large amounts of memory running a dozen or more servers. It's becoming about how much you can pack into the smallest space at the lowest power cost possible. The actual cost of the hardware is a very small concern, provided it's not astronomical. Support contracts and software license's tend to be bigger then your HW costs anyway.

I've said this before, the BD architecture was designed to be a server chip. It's wide but not very deep. You can process 32~40+ instructions simultaneously (32 integer / +/- 8 FPU) just not super fast. When your running dozens of virtual servers on a box, you ~need~ a high core count to reduce the number of context switch's.
 
[citation][nom]bfstev[/nom]judging by the comments, not many people seem to know much about the bulldozer processor other than "it sucks!". Bulldozer should do very well in the server space as it does seem to be more aimed at it. They need to come out with a more gaming/consumer orriented part or just leave the consumer market completely(i really hope they dont but thier inability to compete except on price is not very reassuring.[/citation]
they should just leave the consumer market to their APU line
 

They would lose money :non: then have the potential of going out of buisness, but will only be kept around because of the government.
 
In its own comparisons against an Intel Xeon X5670-based system, the Opteron 6276 scored an 84 percent higher performance in Linpack.
Is that good? A 16 core (16 integer and 8 floating point units, right?) vs a 6 core? I know Linpack measures floating point performance, but still..
 
[citation][nom]shqtth[/nom]big reason why the bulldozer failed witht he FX line, is becuase they had a 6 core phenom processor, and the bulldozer only had a 4 module version. AMD should of came out with a 6 module version, that way it would of been on par with the 6 core phenom in terms of fpu power. As it stand the 8 core phenom is kind of on par with the 6 core phenom despite the fact the FX is clocked faster.[/citation]
Agreed. AMD could have saved allot of face and marketed BD as a quad core with superior hyperthreading. Which is what it is. Then it would have been bench marked properly and compared against other quad core cpus. Still wouldn't reach flagship speeds, but at least it wouldn't have been an embarrassment. Maybe that's why marketing got the axe?
 
Dear AMD

Please do a die-shrink of Thuban, clock it at 4+ GHz, and make it compatible with the AM3/AM3+ socket

That should give something for Intel to think about
 
I'm baffled.... How come so many people here act as if this was a desktop part.....It's a SERVER part. This means there's a completely different standard... If you don't know what you're talking about, please STFU!!!
 
[citation][nom]panders4[/nom]I'm glad they chose to make the FPU shared. Any truely floating point heavy job should be offloaded to the GPU anyway. At least in an ideal world. That seems to be the way GPUs are going anyway.[/citation]
I don't think that's the reason AMD made the FPUs "shared." For one, unlike with the fetch-decode portions, they are not 100% shared and the per-core quantity of circuits reduced; a single Bulldozer module (2 cores) has, as I checked, the same number of FPU parts, and the same width, as two Pheon II cores; they're just semi-attached to each other.

The main reason for this is AMD's implementation of AVX, the successor to SSE. While SSE vector instructions work with 4x32 (128-bit) FP data, AVX increases the width to 256-bit without added overhead, for EITHER 8x32 or 4x64 FP vectors.

Neither Sandy Bridge (Intel's first CPU to implement AVX) or Bulldozer simply upgraded the FPUs to be wide enough to compensate; rather, they allow their existing FPUs to handle half an AVX instruction's workload at a time. With the case of AMD, this is accomplished by having the FPUs so close and inter-connected that when an AVX instruction is executed, it simply treats what would normally be two separate SSE FPUs as a single AVX FPU.

[citation][nom]palladin9479[/nom]It's becoming about how much you can pack into the smallest space at the lowest power cost possible.[/citation]
Well, you're mostly right, except that SPACE is often not that big a concern; even here, if they need more CPUs, they can just add more racks. Most large datacenters and server farms are restricted, in fact, by the power that the electrical grid can supply them, rather than floor space; they often have large, empty rooms that would normally fit plenty more racks, if only the grid could get them enough juice to run them.

[citation][nom]iceman1992[/nom]Is that good? A 16 core (16 integer and 8 floating point units, right?) vs a 6 core? I know Linpack measures floating point performance, but still..[/citation]
Um, no. The 16-core Interlagos does have 16 separate FPUs. It is capable of executing 16 separate SSE instructions simultaneously, much the same as you would if you had, say, 4 quad-cores, be they either Pheom-based or Sandy-Bridge based.
 
Well, you're mostly right, except that SPACE is often not that big a concern; even here, if they need more CPUs, they can just add more racks. Most large datacenters and server farms are restricted, in fact, by the power that the electrical grid can supply them, rather than floor space; they often have large, empty rooms that would normally fit plenty more racks, if only the grid could get them enough juice to run them.

Space is most definitely an issue, especially if your anywhere outside of the USA. The term is rack density, or computational density, basically the amount of resources you can put in the same amount of space. You can't just "add racks", you must also add the power and environmental controls that go with them, and that is more expensive then the racks themselves. Not to mention floor space is at a premium at most facilities.

My job happens to primarily be designing and engineering exactly the solutions described here. Our customers require that we fit as much as we can per RU, this is why we use Sun UltraSparc T2's and now T3's combined with zones. Thus we can fit three racks into 20U worth of equipment.
 
Um, no. The 16-core Interlagos does have 16 separate FPUs. It is capable of executing 16 separate SSE instructions simultaneously, much the same as you would if you had, say, 4 quad-cores, be they either Pheom-based or Sandy-Bridge based.
Ah yes thanks for correcting me. I was thinking about the desktop bulldozer. So they're saying a 16 core is 84% faster than a 6 core? Hmmm
 
Status
Not open for further replies.