Everything Zen: AMD Presents New Microarchitecture At HotChips

bit_user · Aug 25, 2016

Nice article, Paul!

The only block that strikes me as truly novel is the stack engine. I'm curious to know more about it, but perhaps the article already conveys everything they disclosed? Stack is an interesting area for optimizations, but it seems that a lot of them would bend or break ABI compatibility.

For instance, you could use a dedicated cache, so as not to compete with instruction fetches & load/store. But, no doubt there's code which loads values previously pushed onto stack, and probably even some which uses push and pop interchangeably with load/store.

BTW, I'd love to read a side-by-side comparison with Skylake, at this level of detail. I'm sure there are some slight differences in the way things are put together. So, even where there are similarities, some (possibly significant) variations still lurk.

Martell1977 · Aug 25, 2016

If the performance is as advertised, then the real make or break will be availability. Looking at the current generation of AMD and nVidia GPU supplies, I hope they can keep the CPU's in stock so the prices don't inflate and totally screw AMD over.

As for Zen+....what's the big deal? Intel is selling SkyLake, but told everyone about Kaby Lake and Cannon lake on the road-map. The fact that AMD has Zen+ in development shows that, like Intel, they are confident that Zen will be a success and are continuing forward.

Also, comparing Zen @3ghz to an underclocked i7 is not a problem. They are showing that they are at least on par if not a little ahead of Broadwell-e and that is really good. In a comparison like this, comparing to it to a CPU with a higher clock speed would skew the results. The only other option would have been to overclock Zen. They were trying for an apples-to-apples comparison.

InvalidError · Aug 25, 2016

bit_user :

While that would be nice, most of the really interesting details are hidden deeper inside the individual functional blocks and I doubt either company will be willing to provide that level of details on their schedulers, branch predictors and other key circuits.

Since much of that stuff is likely very similar between Intel, AMD and any other CPU architecture using a given concept, access to the actual HDL code may be necessary to properly analyze the meaningful implementation differences between these common concepts.

bit_user · Aug 25, 2016

photonboy :

Architecturally, I agree with you. I think each generation had all the architectural improvements they could reasonably fit.

This seems often overplayed, but one does get the sense that they could push clocks a little higher, maybe make caches a little bigger, or perhaps add a couple more cores. They have healthy margins, today. If pressured, they ought to be able to deliver slightly better value for money.

Where we might see the biggest difference is in their server products. Intel completely owns that market, today. With renewed pressure from AMD, as well as some heat from various ARM upstarts, we might start to see some significant price cuts in the E5, E7, and D-series Xeons.

IceMyth :

How long are you prepared to wait? One could say the same thing about Skylake and Kabylake, etc. They'll probably launch about a year apart, if I had to guess.

Brian_R170 :

Really? I could see VR eventually having a measurable impact in PC sales, but I don't see the general public drifting away from PCs because of any problem Zen might solve. If anything, it'll improve performance/$ a bit, but nothing earth shattering. After Zen, the low end won't be any cheaper, and the high end won't be much/any faster.

Brian_R170 :

The only things like that which Zen could usher in (and it really has nothing to do with the CPU core, itself) is a VR-capable APU. If you can bring VR down below the $1k threshold, then you might start to see adoption outside of the bleeding edge segment.

photonboy · Aug 25, 2016

BIT_USER
A VR-capable APU might be a game changer, I agree. I'd like to see how large of a GPU they could manage to fit into an APU (Zen CPU + Polaris GPU).

I guess it's just talking HALF the size of the 8-core Zen CPU and comparing that to the size of the RX-480 and estimating how much space you have.

The RX-480 is the minimum, so if they can manage that with a 4-core CPU that would be interesting.

bit_user · Aug 25, 2016

InvalidError :

I'm not asking for the moon and the stars! I just wonder, if you take the publically-disclosed information and compare them side-by-side, how do they look?

I feel as if I can't get the big picture from this. Virtually all of the block names I recognize and understand, but I want to know like whether AMD did any clever re-arranging of things from how Intel did it, or maybe did AMD make anything massively bigger than intel, etc. I know it has limited predictive value, if one is only concerned with power, performance, etc. But I had previously stated that this all looks straight out of the Intel playbook, and then I started wondering whether & to what extent that's correct.

If I had time to burn, I'd do it myself. But maybe Paul or someone else at Tom's wants to take a swing at it.

ElGruff · Aug 25, 2016

@bit_user
I agree that VR is going to be massive, the experience is incredibly immersive even with the solutions that use a smartphone.

There are a lot of people still using 2500k's on outdated platforms, Zen is going to be very attractive to them (platform features aside the extra 4 cores / 12 threads are going to be very tempting).
8c 16-20cu APU with HBM should also be a game changer, so nice of Intel for creating a market for ultrabooks xD

InvalidError · Aug 25, 2016

ElGruff :

I am still highly skeptical about the future of VR - I heavily doubt I am the only one who already occasionally has issues with motion sickness with regular 3D games and VR can only make that orders of magnitude worse if less than perfect.

As for Zen's benefits over Sandy Bridge, those are of no importance as long as there is next to no software and hardware able to make use of those extra resources. The updated IOs may be nice but most people have no use for them yet. An upgrade is not worth bothering with if it does not improve your immediate and near-future experience by a significant amount. I'm using an i5-3470 and I'm not going to upgrade it in the foreseeable future since it still handles everything I need to do more than well enough for me. And that's one of the big challenges for AMD: there is a horde of people like me who are still perfectly happy with their 3-5 years old PCs and won't be giving Zen more than a passing glance unless it offers terrific bang-per-buck.

IceMyth · Aug 26, 2016

@bit_user

Will depend on what the coming Zen will be vs how long until Zen+ will come out and most importantly the prices. As of now, I wont do any upgrades before next Aug. which will give enough time for PCIe 4.0 & AM4 & maybe Zen+ to come out.

Wisecracker · Aug 26, 2016

I guess any publicity is good for Zen and AMD, but the author could have saved his subjective snark for the comment section.

WFang · Aug 26, 2016

I'm still rocking my oooold phenom CPU I got off of Newegg more than 5 years ago.. Guess what, paired with a Sapphire Fury (1 x SAPPHIRE NITRO Radeon R9 Fury 100379NTOCSR 4GB 4096-Bit HBM PCI Express 3.0 x16 TRI-X OC (UEFI) Video Card) I snagged brand new at $299 fire-sale, I'm VR Gaming with the rest of them!

Do I wish I had a more recent CPU and MB? -YES, of course, but at the same time I'm not overly hampered by what I have.. When Zen comes out and stirs the pot price-wise, I'll likely update the system, and, all things considered, I'll be picking up a Zen if nothing else than to help keep AMD in the market.

bit_user · Aug 26, 2016

WFang :

Really? Which HMD? That CPU is well below spec, and I thought all the boards were only PCIe 2.0.

Paul Alcorn · Aug 27, 2016

bit_user :

I will do some digging to see if i can get like-for-like block diagrams of the different units, but I don't think that Intel has been as forthcoming with the various stages of the pipeline (usually pretty high level outlines). I was pretty surprised that AMD was so transparent, tbh. AMD gave us a lot to chew on, and as more SoC details come forth we will get a look from a higher level, then we can definitely dig down into direct comparisons of the larger picture.

Paul Alcorn · Aug 27, 2016

InvalidError :

I think most people would be shocked at hold my desktop proc is, as well. Software development just seems...stalled. Its odd, slow procs make for wonderful programmers, but big nasty procs make for lazy programmers (hides behind couch).

I, for one, will buy Zen just to play with it, but probably a lower end SKU. Something new, something different. It will probably make zero difference in what I do on a daily basis, but I am excited to kick the tires

bit_user · Aug 27, 2016

Paul Alcorn :

I would appreciate that.

Intel did have a presentation that goes into some of this level of detail, at IDF '15. Just search for "IDF15 Skylake". I don't know if they released anything else on it. I think it's common to see this level of detail at Hot Chips, though I don't follow it closely enough to know whether Intel has been getting rather tight-lipped, in recent years.

<rant>
I've got to say, while searching for that, I visited another prominent hardware site's CPU section and was pretty surprised at the number of recent stories that I didn't see on Tom's: a certain OEM shipping Xeon Phi x200 systems, more from Hot Chips 2016, more on AMD's Naples platform, details on Intel's Broxton, Knights Mill announcement, Intel's Joule, Kaby Lake is shipping, Apollo Lake news, and Denverton. Just in the past 1.5 months.

Please don't take that personally, but I thought it was worth mentioning. I appreciate all of your thorough and well-written articles, but I could use more of those stories and less coverage of such things as flatulence simulation devices. Maybe I'm in the minority, here.
</rant>

InvalidError · Aug 27, 2016

Paul Alcorn :

Zen is a similar story for me: don't need it but might still get one out of curiosity if it delivers i7 performance at i5 prices.

COLGeek · Aug 28, 2016

For the sake of all, let us hope that Zen is truly competitive by some measure. Competition drives innovation and lowers cost for consumers. For too long, AMD's relative weakness, based on stagnant CPU evolution, has allowed Intel to dominate and dictate the terms of PC based computing.

bit_user · Aug 28, 2016

COLGeek :

I agree with all that, but don't forget the process advantage Intel has enjoyed for so long. I actually wonder how well AMD's current chips compare with Sandybridge, which is the last thing Intel built on 32 nm.

Even some of the IPC improvements that went into Zen might not have been viable, at 32/28 nm.

Without getting ideological about this, I think we could probably agree that we'd have had a much more vibrant CPU market, if Intel had been forced to spin-off its fabs, or at least manage them as a completely separate business that gives fair access to everyone. Whether or not you think that should have happened is a completely separate matter. Just take it as a hypothetical.

Anyway, AMD needs to make the most of this rare process near-parity. I've no doubt Intel will ship at 10 nm at least a whole generation before AMD can get a shot at it.

InvalidError · Aug 28, 2016

bit_user :

I suspect that the single main reason behind Zen's IPC gain is the P4/Core-like decoded uOP cache: it greatly reduces the latency penalties of branch mispredicts when the mispredicted branch code is still within the trace cache by bypassing the decode overhead.

bit_user · Aug 28, 2016

InvalidError :

Did you mean "the biggest single reason"? I'd probably agree with that, but I doubt it accounted for much more than half of the increase, given the number of other improvements made.

I've already said I couldn't believe they didn't have a uOP cache since long ago. It benefits not only the mis-predicted case you mentioned, but also reduces energy consumption and eliminates a potential bottleneck, for the predicted case. In fact, it might have more impact across the predicted cases, since those are (pretty much by definition) the most common.

InvalidError · Aug 28, 2016

bit_user :

In the case of a successful predict, the instructions are already being fetched and decoded but it does reduce the number of instances where instruction decoders become bottlenecks during small loops too. There wouldn't be much point in having 6+ ports wide dispatch (10 total in Zen: 4x ALU, 2x AGU, 4x FP vs 4x INT + 3x shared FP for Steamroller) if decode could only produce three or four instructions per cycle.

Having a decoded uOP cache has many benefits. I would be a little surprised if it was only responsible for 50% of Zen's gains.

I wonder if AMD will give up on having identical execution ports and begin dedicating execution ports to specific subsets of instructions like Intel does. They have already taken one small step in that direction by having two dedicated AGU ports.

WFang · Aug 29, 2016

bit_user :

Using HTC Vive VR. To be honest I wasn't sure at all, in fact I was nearly positive that the CPU wouldn't "cut it" based on the requirements.. but I decided to try. I'm pretty sure my CPU is throttling the Fury card in some instances, but surprisingly, EVERY title I have played so far have played great!

As far as PCIe 2.0 vs 3.0, it has been proven many times over that GPU's are nearly never bandwidth constricted on the bus, especially for single card setups, and since a 3.0 card is fully capable of communicating on a 2.0 or 1.0 bus, that's not a problem in itself.

Even the memory in my old faithful PC is shit. It's 4 sticks made of two miss-matched pairs totalling up to 12GB.. lol. The EVO SSD and the Fury GPU is the only good stuff in that box.

Partial excerpts from CPU-Z report (I'm running stock speeds too) for my cpu, gpu and memory:

CPU-Z TXT Report
CPU-Z version 1.76.0.x64

Processors Information
-------------------------------------------------------------------------

Processor 1 ID = 0
Number of cores 4 (max 4)
Number of threads 4 (max 4)
Name AMD Phenom II X4 945
Codename Deneb
Specification AMD Phenom(tm) II X4 945 Processor
Package Socket AM3 (938)
CPUID F.4.3
Extended CPUID 10.4
Brand ID 13
Core Stepping RB-C3
Technology 45 nm
TDP Limit 95.9 Watts
Core Speed 3013.8 MHz
Multiplier x Bus Speed 15.0 x 200.9 MHz
HT Link speed 2009.2 MHz
Stock frequency 3000 MHz
Instructions sets MMX (+), 3DNow! (+), SSE, SSE2, SSE3, SSE4A, x86-64, AMD-V
L1 Data cache 4 x 64 KBytes, 2-way set associative, 64-byte line size
L1 Instruction cache 4 x 64 KBytes, 2-way set associative, 64-byte line size
L2 cache 4 x 512 KBytes, 16-way set associative, 64-byte line size
L3 cache 6 MBytes, 48-way set associative, 64-byte line size
FID/VID Control yes
FID range 4.0x - 15.0x
Max VID 1.375 V
# of P-States 4
P-State FID 0xE - VID 0x0E - IDD 19 (15.00x - 1.375 V)
P-State FID 0x7 - VID 0x16 - IDD 15 (11.50x - 1.275 V)
P-State FID 0x2 - VID 0x1E - IDD 13 (9.00x - 1.175 V)
P-State FID 0x100 - VID 0x2A - IDD 7 (4.00x - 1.025 V)

Package Type 0x1
Model 45
String 1 0x3
String 2 0x6
Page 0x0
CmpCap 4
ApicIdCoreSize 4
TDC Limit 76 Amps
Boosted P-States 0
Max non-turbo ratio 15.00x
Max turbo ratio 15.00x
Max CPU COF 30
Core Performance Boost no
P-State 0, FID 0xE - VID 0x0E (15.00x - 1.375 V)
P-State 1, FID 0x7 - VID 0x16 (11.50x - 1.275 V)
P-State 2, FID 0x2 - VID 0x1E (9.00x - 1.175 V)
P-State 3, FID 0x100 - VID 0x2A (4.00x - 1.025 V)
Attached device PCI device at bus 0, device 24, function 0
Attached device PCI device at bus 0, device 24, function 1
Attached device PCI device at bus 0, device 24, function 2
Attached device PCI device at bus 0, device 24, function 3
Attached device PCI device at bus 0, device 24, function 4
TSC 3013.6 MHz

Temperature 0 38°C (99°F) [0x12C] (Core #0)
Power 0 104.50 W (Package)

-------------------------------------------------------------------------
Display Adapters
-------------------------------------------------------------------------

Display adapter 0
Name AMD Radeon (TM) R9 Fury Series
Board Manufacturer PC Partner
Memory size 4095 MB
PCI device bus 1 (0x1), device 0 (0x0), function 0 (0x0)
Vendor ID 0x1002 (0x174B)
Model ID 0x7300 (0xE331)
Performance Level 0
Core clock 300.0 MHz
Memory clock 500.0 MHz
Performance Level 1
Core clock 1020.0 MHz
Memory clock 500.0 MHz

Chipset
-------------------------------------------------------------------------

Northbridge AMD 870 rev. 00
Southbridge AMD SB850 rev. 40
Graphic Interface PCI-Express
PCI-E Link Width x16
PCI-E Max Link Width x16
Memory Type DDR3
Memory Size 12 GBytes
Channels Dual, (Unganged)
Memory Frequency 669.7 MHz (3:10)
CAS# latency (CL) 9.0
RAS# to CAS# delay (tRCD) 9
RAS# Precharge (tRP) 9
Cycle Time (tRAS) 24
Bank Cycle Time (tRC) 33
Command Rate (CR) 2T
Uncore Frequency 2008.8 MHz

Memory SPD
-------------------------------------------------------------------------

DIMM # 1
SMBus address 0x50
Memory type DDR3
Module format UDIMM
Manufacturer (ID) G.Skill (7F7F7F7FCD0000000000)
Size 4096 MBytes
Max bandwidth PC3-12800 (800 MHz)
Part number F3-12800CL9-4GBXL
Number of banks 8
Nominal Voltage 1.50 Volts
EPP no
XMP yes
XMP revision 1.2
AMP no
JEDEC timings table CL-tRCD-tRP-tRAS-tRC @ frequency
JEDEC #1 6.0-6-6-16-22 @ 457 MHz
JEDEC #2 7.0-7-7-19-26 @ 533 MHz
JEDEC #3 8.0-8-8-22-30 @ 609 MHz
JEDEC #4 9.0-9-9-24-33 @ 685 MHz
JEDEC #5 10.0-10-10-27-37 @ 761 MHz
JEDEC #6 11.0-11-11-28-39 @ 800 MHz
XMP profile XMP-1600
Specification PC3-12800
Voltage level 1.500 Volts
Min Cycle time 1.250 ns (800 MHz)
Max CL 9.0
Min tRP 10.88 ns
Min tRCD 10.88 ns
Min tWR 15.00 ns
Min tRAS 29.63 ns
Min tRC 40.88 ns
Min tRFC 160.00 ns
Min tRTP 7.50 ns
Min tRRD 6.00 ns
Command Rate 2T
XMP timings table CL-tRCD-tRP-tRAS-tRC-CR @ frequency (voltage)
XMP #1 9.0-9-9-24-33-2T @ 800 MHz (1.500 Volts)

DIMM # 2
SMBus address 0x51
Memory type DDR3
Module format UDIMM
Manufacturer (ID) G.Skill (7F7F7F7FCD0000000000)
Size 4096 MBytes
Max bandwidth PC3-12800 (800 MHz)
Part number F3-12800CL9-4GBXL
Number of banks 8
Nominal Voltage 1.50 Volts
EPP no
XMP yes
XMP revision 1.2
AMP no
JEDEC timings table CL-tRCD-tRP-tRAS-tRC @ frequency
JEDEC #1 6.0-6-6-16-22 @ 457 MHz
JEDEC #2 7.0-7-7-19-26 @ 533 MHz
JEDEC #3 8.0-8-8-22-30 @ 609 MHz
JEDEC #4 9.0-9-9-24-33 @ 685 MHz
JEDEC #5 10.0-10-10-27-37 @ 761 MHz
JEDEC #6 11.0-11-11-28-39 @ 800 MHz
XMP profile XMP-1600
Specification PC3-12800
Voltage level 1.500 Volts
Min Cycle time 1.250 ns (800 MHz)
Max CL 9.0
Min tRP 10.88 ns
Min tRCD 10.88 ns
Min tWR 15.00 ns
Min tRAS 29.63 ns
Min tRC 40.88 ns
Min tRFC 160.00 ns
Min tRTP 7.50 ns
Min tRRD 6.00 ns
Command Rate 2T
XMP timings table CL-tRCD-tRP-tRAS-tRC-CR @ frequency (voltage)
XMP #1 9.0-9-9-24-33-2T @ 800 MHz (1.500 Volts)

DIMM # 3
SMBus address 0x52
Memory type DDR3
Module format UDIMM
Manufacturer (ID) A-Data Technology (7F7F7F7FCB0000000000)
Size 2048 MBytes
Max bandwidth PC3-10700 (667 MHz)
Part number DDR3 1600G
Number of banks 8
Nominal Voltage 1.50 Volts
EPP no
XMP no
AMP no
JEDEC timings table CL-tRCD-tRP-tRAS-tRC @ frequency
JEDEC #1 6.0-6-6-17-23 @ 457 MHz
JEDEC #2 7.0-7-7-20-27 @ 533 MHz
JEDEC #3 8.0-8-8-22-30 @ 609 MHz
JEDEC #4 9.0-9-9-24-33 @ 666 MHz

DIMM # 4
SMBus address 0x53
Memory type DDR3
Module format UDIMM
Manufacturer (ID) A-Data Technology (7F7F7F7FCB0000000000)
Size 2048 MBytes
Max bandwidth PC3-10700 (667 MHz)
Part number DDR3 1600G
Number of banks 8
Nominal Voltage 1.50 Volts
EPP no
XMP no
AMP no
JEDEC timings table CL-tRCD-tRP-tRAS-tRC @ frequency
JEDEC #1 6.0-6-6-17-23 @ 457 MHz
JEDEC #2 7.0-7-7-20-27 @ 533 MHz
JEDEC #3 8.0-8-8-22-30 @ 609 MHz
JEDEC #4 9.0-9-9-24-33 @ 666 MHz

srmojuze · Aug 29, 2016

InvalidError :

In the case of ARM and mobile GPUs I suspect that they have a lot more "headroom" than x86/x64 and particularly with Vulkan (and the lower level implementation in mobile GPUs) and whatever comes after that.

It's just a guess at this stage but it would appear bringing ARM+mobile GPU to "desktop class" is going to be a whole lot more achievable simply because they're going bottom-up, in a sense.

As for AMD, Intel and even Nvidia, that, definitely, they are at I feel close to the wall in terms of architecture. They can all probably weather the storm for another 10 years or so by going from 14nm down to say 2nm - but even at this point Intel has given up the ghost on mobile and supposedly they are going to cram x86/x64 into IoT (not promising IMO - they are probably going to switch to ARM in their IoT somewhere along the line or maybe somehow transition to being a sub-10nm manufacturer for a lot of ARM chips?).

Red vs Green and Red vs Blue aside, I'm really interested to see how ARM and mobile GPU scales up especially at sub-10nm - I think there's going to be some real action there before we hit the electron wall and have to switch to optical computing or what not.

Don't get me wrong, "PC enthusiasts" will be in for some treats as x86/x64 and desktop graphics tech are pushed to their literal limits. But the end is nigh.

bit_user · Aug 29, 2016

WFang :

Wow, you're in for a treat, when you finally do upgrade. I had apprehensions about my Sandybridge-E, but after reading that, I'll probably give it a whirl. At least I have PCIe 3.0 and ~40 GB/sec of memory bandwidth.

srmojuze :

There's some truth to that, in the sense that x86 isn't the easiest ISA to decode. That's pretty much it, though. Just a bit of savings on decode is the main architectural benefit ARM has over x86-64, as far as I can see. ARM also has a weaker memory consistency model, though I don't know how much that really helps. On the flip side, Intel has an easier time adding extensions, whereas ARM has to think about their whole ecosystem and not wanting it to fragment and splinter.

I predict a return to VLIW and beyond. Where the hardware gets simpler (like no micro-ops, software-controlled caches, and only coarse-grained instruction reordering), and most of the burden gets migrated into sophisticated just-in-time compilers and optimizers. This is the sort of thing Nvidia is doing, in their Denver core. Can't wait to see how the 64-bit version turns out.

InvalidError · Aug 29, 2016

bit_user :

I doubt micro-ops are ever going to go away due to how much flexibility it affords for re-arranging the architecture behind a given instruction set.

For VLIW and EPIC, Intel's re-order buffer and scheduler have probably reached a point where there would be little to no benefit in having the compiler package instruction based on dependencies other than reducing scheduling complexity. One major disadvantage of VLIW/EPIC architectures though is that optimal compiler output is heavily dependent on the target architecture. If the compiler model differs from the machine model, performance collapses and this means next to no freedom to improve without having to recompile code with the correct machine model. Will JiT compilers be enough to make VLIW/EPIC viable? Maybe. But I wouldn't hold my breath, especially if you want an open multi-vendor ecosystem where every chip designer may want to add their own spin to the architectural design without having to write and maintain their own chip-specific VLIW/EPIC compilers.

Everything Zen: AMD Presents New Microarchitecture At HotChips

Titan

Splendid

Titan

Titan

Titan

Titan

Commendable

Titan

Honorable

Splendid

Reputable

Titan

Editor in Chief (Interim)

Editor in Chief (Interim)

Titan

Titan

Cybernaut

Titan

Titan

Titan

Titan

Reputable

Commendable

Titan

Titan

Share this page