AMD Demo's Naples Server SoC, Launches Q2 2017

Guest · Mar 7, 2017

AMD provided more details on its Naples server SoCs at its Tech Day event, along with a few tests comparing it to Intel's Broadwell offerings.

AMD Demo's Naples Server SoC, Launches Q2 2017 : Read more

redgarl · Mar 7, 2017

Well, I just hope gaming at 1080p benches are not going to worry investors... just cmon...

aldaia · Mar 7, 2017

Any data on CPU frequency and TDP?

Xeon E5-2699A v4 is a 145W and runs at 2.4 Ghz. 2x the performance with same number of cores sugests that Naples is running at higher frequencies.

Infidel_2016 · Mar 7, 2017

The datacenter I just worked in had 5400 servers in it. Just over half were running dual and quad AMD 16 core chips. Old tech, but nice to see AMD in the farm.

aldaia · Mar 7, 2017

Answering to myself, just realized the 2x could also be due to the benchmark being memory bound. Naples has 2x the chanels, and it gets to 2.5x the performance when memory freq is increased.
Cores seem to be relatively irrelevant in this particular benchmark.

2x the performance with same number of cores sugests that Naples is running at higher frequencies.
v4

spiketheaardvark · Mar 7, 2017

As glad as I am to see a decent AMD desktop chip, it is pretty clear AMD's real priority is servers. Those TDP numbers everyone glossed over in the Ryzen reviews are more important to AMD than I think most people gave credit to.

drajitsh · Mar 7, 2017

Conservatism of data center managers might well nullify any advantage that amd has. Like it did in the p4/ opteron days, when AMD did have a brief lead.

firefoxx04 · Mar 7, 2017

When AMD had their lead, massive cloud companies were not around. Google, Microsoft, IBM, Amazon are always adding servers to their cloud systems. I am sure there will be buyers.

Hand__ · Mar 7, 2017

Power consumption alone will be the reason data centers dump Intel.
An under-clocked Ryzen at 30w scoring 850 in cinebench, that is less power than Intel Atom processors under load at 33w.

That will be millions of dollars in power savings for companies with large data centers.
Intel is scared to death that people don't figure this out. AMD Desktop Ryzen cpu is beating their low power Atom processors.
https://forums.anandtech.com/threads/ryzen-strictly-technical.2500572/

bit_user · Mar 7, 2017

According to this, Naples is just 4x desktop Ryzen dies, in a single package. This suggests there might be only 64-96 lanes of true PCIe bandwidth to the CPU cores, themselves.

http://www.anandtech.com/show/11183/amd-prepares-32-core-naples-cpus-for-1p-and-2p-servers-coming-in-q2

It'd be interesting to see this thing go head-to-head with a KNL Xeon Phi.

genz · Apr 14, 2017

aldaia :

Doubling memory channels equates to a less than 20% performance advantage even in the most biased applications. AMD would have had to write their own benchmark specifically to get the figures you claim. Also with AMD still not having 3200 memory running on (already released) Ryzen at the time of this article what makes you believe that they would have it running on Naples multisocket engineering samples that they don't even have finalized chips on... especially when the competition is with top of the range 50+ core multisocket Intel hardware which is so much more difficult to get fast memory stable on than a comparitively simple i7.

bit_user :

I would call false on that. PCI lanes are not that hard to test for and we would quite easily see if they lied on that point when the GPU cards start dropping to x8 speeds over 96 lanes. Without a major press release before launch (which would be major because of the all the bad press it would generate in combination with the precarious time it is right now for investors with the rating downgrade) AMD stands to be sued by business purchasers that expected 128 PCI lanes and got 96. With that kind of press release, AMD is still at threat of having major financial issues. What else could they have lied about is the next obvious question and the recalls will be much stronger than with the competition because they have a new product and a bad track record going back nearly 10 years (performance wise). Look at the fallout with GTX 970s having slower access to one side of the RAM and you'll see. I RMA'd mine because I had it mated to a 4K display.

You just don't do that kind of stuff to server buyers that are purchasing trays full of CPUs at $2000 a unit. You can get sued for millions in losses, lose huge business customers, and then get sued for millions more in damages because your lack of PCI lanes caused Google servers to run slowly and had to be worked around by a God level $15000 an hour tech team and revision of the load balancing code for the entire Google system. Big firms can't wait to redeploy and software solutions to hardware problems are not only expensive, but clearly not the fault of the company that makes them.

In short, FUD is being spread.

bit_user · Apr 14, 2017

genz :

You're thinking of desktop apps - not server. And you're thinking of single-digit core counts - not 32 per package. They wouldn't have put so many channels, if they didn't have good reason to think they'd need 'em. Just look at Ryzen and tell me AMD is adding more memory than they need!

genz :

Based on what? Anandtech's analysis is a lot more thorough than yours.

While not specifically mentioned in the announcement today, we do know that Naples is not a single monolithic die on the order of 500mm2 or up. Naples uses four of AMD’s Zeppelin dies (the Ryzen dies) in a single package. With each Zeppelin die coming in at 195.2mm2, if it were a monolithic die, that means a total of 780mm2 of silicon, and around 19.2 billion transistors – which is far bigger than anything Global Foundries has ever produced, let alone tried at 14nm. During our interview with Dr. Su, we postulated that multi-die packages would be the way forward on future process nodes given the difficulty of creating these large imposing dies, and the response from Dr. Su indicated that this was a prominent direction to go in.

genz :

That was a silly thing to do. Nobody thought it was good at 4k, and I'm sure the cut down memory controller made almost no difference in that regard.

genz :

The architects of big systems using these chips, who work for the firms that you claim would sue them, are sophisticated enough to read more than just press releases and surely understand enough about what they're getting to know where any bottlenecks are.

genz :

Since when does Anandtech spread FUD? What do you have on the actual merits of the claim? Sounds to me like you didn't even read the article. Moreover, if you're going around labeling things as FUD, then you should disclose any material interest you have in the matter.

aldaia · Apr 15, 2017

genz :

That may be true for some desktop apps (usually constrained by single thread performance). However, if your application is significantly bottle-necked by memory bandwidth, doubling the number of channel may double performance. Such applications exists in the HPC domain, and the seismic analysis app used by AMD is a good example.
Go here for a good read on: Basic Computer Architecture Stuff

genz :

Who said anything of 3200?

The article says explicitly: "For the second test, AMD conducted the same test but brought all 64 cores to bear and bumped its memory speed up to 2,400MHz while the Intel system remained at 1,866MHz. Once again, AMD's carefully selected workload completed faster on the Naples system, yielding a 2.5X advantage."

1866 -> 2400 is a 28,6% improvement in memory speed, so technically a 25% (2x to 2.5x) improvement on overall performance is perfectly possible for a memory bound application.

genz · Apr 17, 2017

bit_user :

Fair enough. I have never even heard of seismic workloads before. Thanks for the knowledge.

bit_user :

I would call false on the AMD only having 64 to 96 lanes of PCI-E, not on it being a multi chip package. What am I, an AMD developer in disguise? Cmon.

bit_user :

It was the difference between 60 FPS GTA5 and 35 FPS GTA5 and that goes for every other game that used that last 512MB. I suppose if I was you this is where I'd tell you you're thinking of server apps and not desktop.

bit_user :

Firstly lets address the litigation statement. Maybe you've heard of false advertising before. Maybe you've heard it's illegal before. Maybe, just maybe it doesn't matter who sues them... it matters that they can be sued.

You're trying to get your likes now so you want to take on a condescending tone. I understand that but look up and you will realize the FUD is FUD. Saying the memory difference is down to channels as if it's a bad thing discredits the superior channel count of Naples. Saying it looks like there is only 64 to 96 channels of PCI bandwidth ignores this paragraph from your own link detailing 128 PCI lanes in both single and dual socket config:

"In dual processor mode, and thus a system with 64 cores and 128 threads, each processor will use 64 of its PCIe lanes as a communication bus between the processors as part of AMD’s Infinity Fabric. The Infinity Fabric uses a custom protocol over these lanes, but bandwidth is designed to be on the order of PCIe. As each core uses 64 PCIe lanes to talk to the other, this allows each of the CPUs to give 64 lanes to the rest of the system, totaling 128 PCIe 3.0 again"

Sorry but that is speculative FUD by definition.

Upvoted aldaia for being polite whilst educational. Bravo. FYI the comment was based on 3200mhz RAM being easier to make work with a single chip than 2400mhz over dual socket working with 8 separate chips.

bit_user · Apr 18, 2017

genz :

That wasn't me. I didn't cite any examples. You can find them, if you care to look.

genz :

Ah, but that's not what I said. I don't dispute their claims of what's on the outside of their package, but either Ryzen doesn't expose all of Zeppelin's PCIe lanes, or there's a bottleneck inside the package. I was just highlighting that fact. Do the math. It doesn't add up.

genz :

Obviously not. Some people on here do trade stocks in these companies. That's more what I had in mind.

genz :

I was disputing the idea that it was a good 4k card, or that the memory controller crippling would've really changed that. If you care about gaming and framerates, then why buy a 4k monitor and pair it with a 970? Either get a lower-res monitor, a faster card, or accept that your framerates will suffer. To have a hope of decent 4k gaming, you should've had at least a 980 Ti. Otherwise, it tells me you weren't really serious about 4k gaming.

BTW, I did not state that their crippled memory controller never makes a difference. Just that it's not the only thing standing in the way of the 970 being a good 4k card.

genz :

Snarky comments earn downvotes.

genz :

I don't even know what the heck you're on about. I'm not here for mind games, trolling, etc. I like tech. I like to read about it, think about it, and exchange insights. I'm not really interested in arguing with someone who has a strong partisan opinion or who cares more about winning arguments than learning stuff. If you want to know why I said that about their PCIe bandwidth, I can explain it. But not if you're going to be so snarky and defensive.

genz · Apr 18, 2017

bit_user :

That was a thank you fyi. Take it how you want. I don't care to look because it's not my field and not worth my time. You seem very competitive.

bit_user :

I would then refer you to my original statement where I said that increasing memory channels (identical use case in this test) offers negligible increases in performance. You responded with saying that that did not apply in massively multicore chips. This is the argument that you are correct with Ryzen not exposing Zeppelins lanes, right?

I would also refer you to original Ryzen documentation speaking on CCX dualcore lith having accelerated access to each other but then a slower access to cores in other CCX via L3 cache. I expect Naples to address this by providing lower latency faster Infinity Fabric and cache as the main feature over Ryzen with respect to upgrades.

https://www.techpowerup.com/231268/amds-ryzen-cache-analyzed-improvements-improveable-ccx-compromises

Looking at the benches above, the main binning sacrifice I see being made in Ryzen are there. Access to L3 from CCX is mostly balanced to exact percentages (as calculated by the rate at which access times are increased to RAM exactly at 8MB access as despite there potentially being plenty more L3 available on other CCX cores). Load balancing logic for this is almost certainly choosing RAM over other CCX core's L3 to store it's own data, but even if it isn't sometimes Zep/Naples main lith differences are going to be in the L3 and interconnects because of their provisioning for so many more cores, so it's only logical that the binning lith isn't there yet (at the release of Ryzen) hence the engineering sample's benchmarks needed faster RAM latency to simulate.

I would finally refer you to a simple bit of logic on Infinity Fabric, best told from here:

https://www.reddit.com/r/Amd/comments/5zr8lv/i_asked_amd_a_followup_question_about_infinity/df0g8bq/

Zepplelin is an APU. Greenland supports HBM so is going to need direct connection to the IMC and conventional link to HT/IF, but has 4 GMI links which equate to 7 PCIe 3.0 lanes of performance roughly going to the . It's old info that was improved upon before even Ryzen, and in all likelihood could be the sacrifice we see in the future APU rollout but that's definitely not Naples. In fact that's probably not even server. That's where I started feeling FUD.

This is also why 1866mhz to 3200mhz ram is a big deal here. Intel's Ring Bus is always at max speed regardless of RAM performance. AMD's solution clocks up and down with RAM speed. imo that will probably allow us more overclocking headroom by reducing IMC speeds on both sides of the chip on the gaming side when needed (remember the overclocking hits we took when Intel moved the IMC on-chip with first i7 - this sidesteps the problem) and the gearing of Ryzen was toward that with the benefit of easier binning meaning the product gets out sooner.

Literally speaking, that means that the update for 3200mhz Ryzen RAM is also a big boost in interconnect speeds and the improvements to make that possible HAVE to happen in microcode. It would be on Ryzen now if it was done. It will be on Naples but I guess that's gonna have to come after these engineering samples.

I would finally say that it's very unlikely that Zeppelin is actually going to perform anything like Naples as Zeppelin is an APU, which assuming it took to server benchmarks would be crossfired on 2 socket servers. Whilst a perhaps revolutionary idea for marketing (extra two GPUs close on each CPU in a 6xCF config with load balancing + HBM extending the infinity fabric across the motherboard to distribute threads closer or further from CPU/4GPUs depending on needs. Would be a monster.) the logical leap simply isn't in AMD's reach right now. Zen is all about balance after all so I see it coming much later.

bit_user :

Oh do tell... it certainly isn't the shader performance. A small overclock puts it squarely in GTX980 territory. The 980ti wasn't out at the time I bought it and a card broke. DX12 was on the horizon with massive performance increases being shouted from the rooftops and I buy a lot of cards to pass down through systems. Do you hear yourself? I deserve a downvote? You need to calm down because I'm here to share and be shared with just like you. Also look up to see what a crippled memory controller does, then realize that GDDR has latency in the hundreds of clocks.

bit_user :

I have substantiated my facts now using logic and/or citations. At the moment, you feared or wore uncertain... then defended your doubts that the bottleneck was in the chip or in Ryzen and I've explained. I also added to that by taking a moment to explain why the RAM speeds are likely different. Then that Zepplin is an APU and thus needs to dedicate more onboard PCI resources and lose die to a GPU (actually I assumed you'd realize that), especially in a crossfire + HBM capable arrangement (and what else is the point of putting APUs in a 2 socket config). The thing we forget here is that the nature of a web over a ring is that you actually have access to MASSIVE bandwidth but with linearly increasing latency (as data has to take longer paths around saturated buses) thus max bandwidth available scales up as more bandwidth is needed by any one core. Every CCX has direct access to the IMC as well so as the clocks increase through the Infinity fabric (as a byproduct of RAM speed increases) it becomes logical for the latency times to shorten and thus more routes between cores to become viable over RAM, especially remembering that in real terms ( as in adjusting for CAS etc increases) RAM latency is on the increase and that will only get worse with DDR5 unless NVRAM shocks us.

Oh and your anandtech link qualifies the CCX lanes details.

bit_user · Apr 19, 2017

genz :

I wasn't sure, because you credited me with something I didn't say. Anyway, it's not competitiveness, but rather the fact that I'm only willing to go so far, in these sorts of debates. In the past, I've spent the time to dig up some good benchmarks that show what I'm claiming, here, so I'm confident they can be found. But I don't have them handy and I don't have the time to find them, right now.

So, if you care to educate yourself, you'll search them out yourself. If not, you can dismiss my point and we move on.

It's a bit like the reverse of your GTA V benchmark and the 970. What you say is generally true about memory bandwidth, I'll grant you that. But not for some workloads and for very high core counts. In the same way, what I said about the 970's memory controller is generally true, but not in all cases. If a gamer were a huge GTA V fan and had a 4k monitor, then the 970 would be an especially bad choice. Similarly, if someone were assembling a Naples-based server, then they should consider whether it's really a good idea to populate only half of their memory channels.

Anyway, I don't know why you're wasting time on these enormous posts. My original post was simply referring to this bit of the Anandtech article:

Each die provides two memory channels, which brings us up to eight channels in total. However, each die only has 16 PCIe 3.0 lanes (24 if you want to count PCH/NVMe), meaning that some form of mux/demux, PCIe switch, or accelerated interface is being used. This could be extra silicon on package, given AMD’s approach of a single die variant of its Zen design to this point.

Couldn't be much simpler. 4x16 = 64 lanes. 4x24 = 96 lanes. Not 128 lanes. ...unless Zeppelin has more lanes than Ryzen exposed. If you have information on that, do share. Otherwise I'm done.

I don't see any sign of malice, in Anandtech's reading of the facts they cite in that article. But call it FUD, if you want. I'm past the point of caring.

Search

AMD Demo's Naples Server SoC, Launches Q2 2017

Guest

Guest

redgarl

Splendid

aldaia

Distinguished

Infidel_2016

Commendable

aldaia

Distinguished

spiketheaardvark

Distinguished

drajitsh

Distinguished

firefoxx04

Distinguished

Hand__

Prominent

bit_user

Titan

genz

Distinguished

bit_user

Titan

aldaia

Distinguished

genz

Distinguished

bit_user

Titan

genz

Distinguished

bit_user

Titan

TRENDING THREADS

Latest posts

Moderators online

Share this page