AMD CPU speculation... and expert conjecture

juanrga · Aug 26, 2013

Cazalan :

Care with that review! First, they didn't test final silicon but a prototype in a laptop. Second, the rest of hardware in the laptop was cheap, e.g. a slow HDD (Anand replaced it by a SSD for one of the tests). Third, they used software favouring Intel or optimized for Intel such as Cinebench or PCMark.

Fourth, they write misleading stuff such as: "In all of the lightly threaded cases, a 1.7GHz Ivy Bridge delivers...". They omit to mention that the supposed "1.7GHz" chip has a max. turbo of 2.6GHz; however, the A4-5000 is clocked at a fixed freq. of 1.5GHz.

blackkstar · Aug 26, 2013

Since we are on about rendering and compiler optimizations, here you go:

Here is Blender BWM benchmark on my rig at 4.7ghz.

Here are my CFLAGS: -march=bdver2 -Ofast -pipe -mprefer-avx128 -minline-all-stringops -fopenmp -fno-tree-pre -ftree-vectorize
Here are my results:

http://blenderartists.org/forum/showthread.php?239480-2-61-Cycles-render-benchmark&p=2453609&viewfull=1#post2453609

There's the thread of folks running the benchmark on their own. Notice how software optimizations can change the score immensely? Notice how I beat 12c Intel rig by 3 seconds? And I'm not even at final overclock because ambients are too high because of summer.

So basically, just stop with Cinebench, it's crooked. You can't even load it up into IDA to disassemble it without it whining about wanting Intel DLLs, like libguide40.dll. The only thing Cinebench is good for is testing performance gains with tweaking your system.

8350rocks · Aug 26, 2013

noob2222 :

True, however, that doesn't preclude the fact that the PS4 could easily come out @ 2.0-2.4 GHz and hit the intended TDP without issue.

Also, If the yields are good enough, I would anticipate this would eventually happen through the life cycle of the hardware anyway.

8350rocks · Aug 26, 2013

blackkstar :

This is exactly what I have been pointing out for months about compiler optimizations...I dropped the torch a while back because it fell on very deaf ears. That or people didn't grasp it, or perhaps, they just didn't care.

juanrga · Aug 26, 2013

hcl123 :

No. hUMA provides an Uniform Memory Access to procesors, NUMA provides Non-Uniform Memory Access to procesors.

HT/HTX breaks hUMA, because the memory in the main bus would be faster than the memory accessed through the HTX slot.

But HT/HTX doesn't break hUMA, For that reason Hypertransport is used in NUMA architectures. AMD itself uses Hypertransport in a proprietary extension: ccNUMA.

hcl123 :

If you follow the link that I gave to you, you can see yourself that it is amd developers site.

hcl123 :

No, it is one DDR3 pool or one GDDR5 pool, but not both at once. Kaveri comes with DDR3 support (hUMA), but there are rumours that a future high-end APU comes with GDDR5 pool, as in the PS4. As I said before a mixture of DDR3 and GDDR5 makes no sense because breaks hUMA.

hafijur :

You said that a decent processor has 100 GFLOP. The PS4 cpu has more than 100 GFLOP.

I also explained you that Intel was discharged very soon from the race, because lacks the technology.

I have explained in a recent reply to another poster why anandtech scores are invalid. Read it.

griptwister · Aug 26, 2013

lel, hafijur told a game dev he's delusional, then proceeded to say that A4's competition is the Intel Atom. Mean while benchmarks are showing the Atom getting completely smashed by the A4 Processor.

8350rocks · Aug 26, 2013

http://translate.google.com/translate?hl=en&sl=de&u=http://www.planet3dnow.de/cms/1291-amd-prasentiert-hsa-details-auf-der-hotchips-25/&prev=/search%3Fq%3Dnews%2Bfrom%2Bhot%2Bchips%2B25%26biw%3D1152%26bih%3D574

Interesting presentation from Hot Chips 25, AMD and HSA

hcl123 · Aug 26, 2013

hafijur :

Easy don't run CB11.5... which is optimized for intel and cripples other uarches, most focused on AMD (Jaguar withstand)... matter of fact it will not be possible to run any windows app in PS4, its FeeBSD. And there the applications are tuned for "compute", OpenCL, C++11 or 1y, Java/aparapi etc... will simply run circles around any CPU alone... not only intel, or intel most included.

Worst is not a fair comparative AT ALL... even if you run FreeBSD in intel and could run the same apps, if those are coded to run hUMA/HSA and avoid all those mem copies, its possible those apps couldn't run on intel or run notoriously slower, only due to this... Intel wont have hUMA or HSA ever, neither has anything similar now, that is why Sony probably would never had chosen intel, the CPU atgument is very weak and uninteresting for this multimedia systems If it were to run Fritz chess or similar "integer" apps, the intel might had been a consideration, but why a top power CPU when those applications are "INTERACTIVE", that is, benchmarking those is UBER STUPID because it depends on the user to provide constant input (interactive), and any user is multi orders of magnitude slower than any CPU.

So for Interactive Integer centered games a weak CPU can be has fast as the fastest of the fast CPU, because it depends on the user. An only with this intel could never win.

Just face it... the CPU argument and the integer single-thread argument, is only good for biased "passion blind funboys" do FREE ADVERTISING in any place they get into, they get the help of all those biased ATs around.

Jaguar at 2.75Ghz is not high clocks by any measure... more if its turbo... and its usual development systems and ES chips present quite lower everything. PS4 CPU frequency i'm quite convinced will be above the 1.6Ghz you parrot.

[UPDATE : And BE SURE the argument of CPU for games geometry is also overrated. The CPU only runs from graphics the "drivers" and "geometry", and rest assured that geometry could be done almost entirely on the GPU side. Also HSA provides a clear detachment for GPU scheduling (drivers). If is super bloated in PC world, and needs a fast CPU, i think we can endorse most of thanks to intel... but it hasn't to be that way.

So rest assured that PS4 games would need quite less CPU (though GCN is not yet a geometry monster)... and for those parts it does need more CPU there are 8 REAL cores, the drivers could multithread well, and in the end gain much more if less threads on a faster sequential CPU. ]

juanrga · Aug 26, 2013

blackkstar :

Yes. Evidently when Sony and Microsoft did run benchmarks and simulations to compare the different hardware before choosing the final console hardware they did not use Cinebench or similar useless stuff.

I have two questions what version of GCC and what WM/DE are using?

juanrga · Aug 26, 2013

8350rocks :

There are some slides that I want to emphasize. This one clearly shows AMD commitment to HSA APUs. The era of multi-core CPUs (8/12/16 cores) is gone

An illustration of performance (and energy efficiency) gain with a desktop/laptop workload

The APU is a 150% faster than the CPU alone. For the sake of comparison the Steamroller CPU would be only a 15--20% faster than a Piledriver CPU. The Haswell CPU is about a 5--10% faster than a Ivy Bridge CPU.

An illustration of performance (and energy efficiency) gain with a server workload

The APU is a 480% faster than the CPU alone. This explains very well why AMD is replacing the 8-core opteron CPUs by 4-core Berlin APU.

8350rocks · Aug 26, 2013

juanrga :

8350rocks :

There are some slides that I want to emphasize. This clearly shows AMD commitment to HSA APUs. The era of multi-core CPUs (8/12/16 cores) is gone

An illustration of performance (and energy efficiency) gain with a desktop/laptop workload

The laptop APU is a 150% faster than the CPU alone. For the sake of comparison the Steamroller CPU would be only a 15--20% faster than a Piledriver CPU. The Haswell CPU is about a 5--10% faster than a Ivy Bridge CPU.

An illustration of performance (and energy efficiency) gain with a server workload

The server APU is a 450% faster than the CPU alone. This explains very well why AMD is replacing the 8-core opteron CPUs by 4-core Berlin APU.

*For programs that are optimized to use HSA

noob2222 · Aug 26, 2013

^^ BY offloading data parallel computations

Ags1 · Aug 26, 2013

gamerk316 :

Yeah, yeah, yeah. I am a developer (Java, so I don't deal with the Windows specifics) and I have worked with multithreaded applications. Yes I agree, as you scale to vast numbers of threads, the law of diminishing returns kicks in. But essentially if you can spawn 4 worker threads to do some generally parallel task, you can just as easily spawn 20 or 30 such threads. The complexity, approached that way, does not increase.

hcl123 · Aug 26, 2013

juanrga :

Must have a lot of patience... this is child play... tricked by words LOL

Its hU-MA and NU-MA... the MA, "memory access", lets put it part. Then is *Heterogeneous Uniform* and *Non Uniform*. If it were only "uniform" as in "UNIFORM", it would had been just UMA... not hUMA.

The NUMA you mention also provides an "uniform" view of system DRAM, in the sense you imply, specially the cache coherent extension. Its an integral view of all DRAM attached, system memory, in an hierarchical view (from here the "N" of non comes). And is kind of opposite of UMA ( not hUMA ), which has no hierarchical view, its a completely flat view of "physical memory", and so forcibly only one system memory pool.

Lets consider a 4 sockets AMD server system with 64GB for each socket. With NUMA each CPU inside each socket will see the total of 256GB of system DRAM in an hierarchical way, its kind of an evolution of SMP (symmetrical multiprocessing), only would need expensive software mechanisms to ensure coherency and consistency across nodes/domains... With ccNUMA each CPU inside each socket will see the total of 256GB of system DRAM in an hierarchical way, PLUS all the cache pertinently associated with coherency protocols, dispensing almost all software for consistency.

hUMA is the same, only replace the same "physical memory" with the "same VIRTUAL MEMORY" (there is a difference if you didn't know), PLUS all the cache pertinently associated with coherency protocols ( as example in ARM the L1 is not coherent, its private, while in AMD L1 is coherent).

Actually ccNUMA is better and more straightforward for programming. The problem is that NOT ALL members of HSA use the same coherency schemes (as exposed), and AMD simply could not impose its coherency protocols on others, so something different " heterogeneous" had to be invented.

Actually i'm quite convinced that a hUMA prepared program could run unchanged on a ccNUMA system, that is, an AMD CPU + GPU in ccNUMA config... if that system TLBs and DMA understand the quirks of IOMMU, which in AMD i think they already do.

So its not "the opposite"... its "very similar".

juanrga :

So does "several processing elements" with several MMU/TLB, on a tightly integrated SoC. Several processing elements there must be arbitration/priorities, so faster and slower accesses. But if there is arbitration/priorities, this could very easily be made to deal with an I/O link in the middle, that is transparent for the processing elements operations.

The only "proprietary extension" on Hypertransport for this, is the MOESI coherency protocol that the HT PHYs must understand for ccNUMA with AMD CPUs. And since no other adopter of the Hypertransport.org does CPUs, remained subject to royalties... and will remain even if other IDM adopts HT and has its own coherency protocol adapted for those links PHYs. Then we will have 2 ccNUMA "proprietary extensions"... one for use with AMD CPUs, other to use with "other" CPUs.

There are more "royalty subjected proprietary" extensions on HT, besides this... HCN 2 (high count node), content aware routing... and more i think...

juanrga :

The same mistakes out of the same lack of knowledge... but time over... you understand it now, or you never will, because simply you don't want to.

THIS ARE IMAGES OF 2011 AFDS (fusion summit),
http://www.google.com/search?q=AFDS+2011&num=20&as_qdr=all&tbm=isch&tbo=u&source=univ&sa=X&ei=6rUbUp-pGYH27Aaz0IAY&ved=0CLIBELAE&biw=1784&bih=817

then there aren't yet talks of cache coherency, but there were talks of... "UNIFIED MEMORY ADDRESSING" ... "MMU AND IOMMU"... and "CLEAR SEPARATION OF GPU AND CPU, CPU WITH MMU, GPU WITH IOMMU"

http://www.google.com/imgres?q=AFDS+2011&sa=X&as_qdr=all&biw=1784&bih=817&tbm=isch&tbnid=ZGvfhnvtXNGd0M:&imgrefurl=http://adrenaline.uol.com.br/tecnologia/noticias/8878/arquitetura-com-coprocessador-escalar-sera-o-futuro-das-gpus-da-amd.html&docid=G1urLe9XA79dnM&imgurl=http://www.adrenaline.com.br/files/upload/noticias/2011/06/subzero/afds04.jpg&w=1035&h=609&ei=F7YbUqLLGYjQ7AbPxoDYAg&zoom=1&iact=rc&page=2&tbnh=172&tbnw=293&start=27&ndsp=29&ved=1t:429,r:39,s:0&tx=181&ty=82

ROADMAP up to "TOTAL SYSTEM INTEGRATION" (preemption/context)... WITH CLEAR INDICATION OF > "EXTEND *TO* DISCRETE GPU "

http://www.google.com/imgres?q=AFDS+2011&sa=X&as_qdr=all&biw=1784&bih=817&tbm=isch&tbnid=9o25jFXVYmt5vM:&imgrefurl=http://semiaccurate.com/2011/06/20/amd-talks-about-next-generation-software-and-fusion/&docid=jnE1kIZbX7FjCM&imgurl=http://semiaccurate.com/assets/uploads/2011/06/AFDS_Rogers_FSA_map.jpg&w=600&h=277&ei=F7YbUqLLGYjQ7AbPxoDYAg&zoom=1&iact=rc&page=2&tbnh=152&tbnw=311&start=27&ndsp=29&ved=1t:429,r:30,s:0&tx=147&ty=64

Now makes me lose it ... #"%"#$.... how on hell is they intent to ***extend it to discrete GPUs***, when now they are implementing hUMA that then BREAKS when they try to extend to discrete GPUs, because it will have links with the out of arse just for controversy "different memory access times " argument ??????? ... (EDT)

go bug another ... please...

8350rocks · Aug 26, 2013

Obviously, it's easier to keep hUMA and HSA on die, though it could easily be done to extend to the discrete GPU via HSA, you would have to have a CPU and GPU that were both HSA enabled and capable of hUMA though. However, the issue lies in the VRAM and DRAM, things loaded into VRAM would necessarily be a bit different, and so I am firmly in belief that the actual unified memory addresses would have to pertain only the DRAM the system was using.

This is more complicated, however, future SR FX series/replacement, could be made to work in such a manner. This is perhaps the reason for the "delay"...

hcl123 · Aug 26, 2013

You simply need a fast and furious interconnect, most pertinent and clear a point-to-point serial is better... which PHYs on both ends of this p to p can esealy be made to understand hUMA.

HT/HTX can do that. IOMMU, as in IOMMU for GPU and MMU for CPU, is already based on some Hypertransport "semantics" (pardon the language), so making those HT link PHYs talk hUMA when they now talk ccNUMA, is quite facilitated.

Those combo PCIe+HTX slot patents are real (discussed before)... so all is settled i think... only takes time.

hcl123 · Aug 26, 2013

8350rocks :

Particularly this, then is so easy that hurts. Make the VRAM private, that is why the tweak with cache coherency. And not all cache on a GPU needs to be coherent. So it could have only L2 and no VRAM.

OTOH having the VRAM integrally visible also doesn't make damage, actually is better, kind of needed for the "Virtual Addressing"... when programing for this systems you only see a single pool of virtual memory, the physical mem part would only be a consideration if you are doing kernel or other low level (driver) stuff. And there will be a "runtime" (with low level stuff), and even with automatic "profiling" that you only have to hint in your code... it will put your code where it needs to be... so don't worry having your game data splashed out of the VRAM.

Remember ALL other software will run on an HSA system, and real HSA software will always need to have a *runtime* but could be made to run on any other non-HSA systems, because this runtime is highly portable.

That is the real beauty of it... runs everywhere(some places better than others), its kind of JIT centered like Java, but with lower closer to metal bytecode form (HSAIL), so more encompassing, less restrictive, permitting several high level languages... otherwise no one would ever consider coding HSA if not heavily subsidized.

juanrga · Aug 26, 2013

hcl123 :

No. It is h-UMA and NU-MA. I already explained what means each one and why are different. I will only add that the former arch. can be considered an extension of UMA to heterogeneous processors.

If you don't want words then try this image 😀

hcl123 :

No. The leaked documents clearly stated that usage of DDR3 and GDDR5 is mutually exclusive. I already explained to you why.

juanrga · Aug 26, 2013

8350rocks :

It is not only easier, it is also more efficient. AMD could implement HSA in a discrete GPU, but there is no way them can implement hUMA.

My guess is that AMD could implement HSA in dGPU using an extension of its ccNUMA, let me call this hypothetical extension hNUMA. The problem is that this loses the advantages of hUMA.

esrever · Aug 27, 2013

juanrga :

you can have huma and a discrete gpu. All memory access would have to go through the PCI if you need for the cpu vram where the vram is only used when you run out of main memory the gpu will be the reverse setup. To maintain cache cooherency would be a bit of a problem but it won't be impossible. Ram is ram, it doesn't matter where it is, the management of huma is all on the memory controller side. AMD can easily have huma come to discrete once they get it moving.

juanrga · Aug 27, 2013

8350rocks :

juanrga :

*For programs that are optimized to use HSA

No completely optimized, because the above examples didn't use a fully enabled APU.

juanrga · Aug 27, 2013

esrever :

juanrga :

you can have huma and a discrete gpu. All memory access would have to go through the PCI if you need for the cpu vram where the vram is only used when you run out of main memory the gpu will be the reverse setup. To maintain cache cooherency would be a bit of a problem but it won't be impossible. Ram is ram, it doesn't matter where it is, the management of huma is all on the memory controller side. AMD can easily have huma come to discrete once they get it moving.

It is not possible to replicate the new APUs unified memory pool (hUMA) by reasons that I mentioned above. AMD implementation of HSA in dGPU uses a single address space split in two memory pools and them are having problems with the non-uniform access.

juanrga · Aug 27, 2013

The Stilt's rumour: Kaveri (thus Steamroller) comes with quad-channel memory support.

If top Kaveri APUs come with 2133MHz DDR3 support, then the bandwidth is the double than in Richland. This quad-channel would provide to Kaveri more bandwidth than future dual-channel DDR4 with 2400/3200/4000 MHz and about the same bandwidth than with the top-modules of 4266 MHz.

gamerk316 · Aug 27, 2013

Ags1 :

Complexity wise? No. Assuming all you are doing is data processing on a single structure. Its when you attempt to thread different functionality that all talk with eachother where management becomes a mess. Hence why game engines tend to be single-threaded; you can thread WITHIN a specific part of a game engine without too much trouble, but trying to thread each separate piece of the engine, have them all run independently, and keeping your data access sane without compromising performance is damn near impossible.

But remember the applications we mainly talk about here: Games. And in games, most of the parallel stuff is ALREADY offloaded to the GPU (rendering). The stuff left is, for the most part, non-parallel, with the notable exception of physics, which coincidentally, AMD and NVIDIA are both trying to push to the GPU as well.

The issue is this: GPU's are better at just about any parallel workload then any number of CPU cores will be. As a result, most of that type of work is being offloaded to the GPU. That results in CPU performance that is governed by single-core performance, rather then scaling. Hence why the i3-2320 generally matches against the FX-6300 in the overwhelming majority of games.

Now, given how games tend to be GPU limited anyway, is why I think we are going to see APU + GPU in a system, where the integrated GPU (or whatever we are going to call it going forward) handles things like physics processing, the discrete handles rendering, and the CPU handles everything else.

It is not only easier, it is also more efficient. AMD could implement HSA in a discrete GPU, but there is no way them can implement hUMA.

You could, though things start to get messy, given you'd have a CPU, APU, and discrete GPU all accessing the same data. How the discrete GPU works is another issue; do you get rid of VRAM (which would necessitate system memory bandwidth needs a massive boost), or keep it as essentially the GPU's L1/L2 cache? Lots of things to consider, but its doable. Not sure its worth it to move the discrete into a hUMA framework, but its doable.

Cazalan · Aug 27, 2013

juanrga :

That would be a significant advantage going forward for all APUs. Unfortunately they will need another socket. FM3 perhaps to handle the extra memory signals.

AMD CPU speculation... and expert conjecture

Distinguished

Honorable

Distinguished

Distinguished

Distinguished

Distinguished

Distinguished

Honorable

Distinguished

Distinguished

Distinguished

Distinguished

Honorable

Honorable

Distinguished

Honorable

Honorable

Distinguished

Distinguished

Splendid

Distinguished

Distinguished

Distinguished

Glorious

Distinguished

Share this page