AMD CPU speculation... and expert conjecture

hcl123 · Aug 28, 2013

Also the XBone Durango is half hUMA.. or half coherency set of the PS4, or more precisely I/O coherent

http://www.vgleaks.com/durango-memory-system-overview/

Video encode/decode engine. System coherency

There are two types of coherency in the Durango memory system:

Fully hardware coherent
I/O coherent

The two CPU modules are fully coherent. The term fully coherent means that the CPUs do not need to explicitly flush in order for the latest copy of modified data to be available (except when using Write Combined access).

The rest of the Durango infrastructure (the GPU and I/O devices such as, Audio and the Kinect Sensor) is I/O coherent. The term I/O coherent means that those clients can access data in the CPU caches, but that their own caches cannot be probed.

When the CPU produces data, other system clients can choose to consume that data without any extra synchronization work from the CPU.

The total coherent bandwidth through the north bridge is limited to about 30 GB/s.

Yet... i get the feeling that none of those, not even PS4 that uses Garlic and Onion for the GPU mem operations, implements the full set of hUMA possibilities!..

hcl123 · Aug 28, 2013

de5_Roy :

heck!!.. where does it says anything about size ?

OTOH, tough not represented in in the SoC diagram, XBone has an IOMMU... it could work together, seamless xfire, with a Tahiti HD7970 as example (can't hint the same for PS4, or perhaps the "High Performance Bus & Memory" controller includes it http://www.vgleaks.com/world-exclusive-orbis-unveiled-2/ ).. if it had an PCIe link available...

Slide 3 from top "GPU and GPU MMU clients" bottom right corner "Audio DMA (direct memory access) CPU-Cache-Coherent DRAM access Via IO MMU"

de5_Roy · Aug 28, 2013

hcl123 :

strange... may be you missed the first slide resulting your count is off by one. i meant this one:
http://semiaccurate.com/2013/08/26/xbox-one-details-in-pictures/xbox_one_soc/
main soc: 363 mm^2
fab: 28nm tsmc HPM
transistor count: 5 bill.
total 47 MB storage on the chip
etc.

hcl123 · Aug 28, 2013

8350rocks :

I think you are seeing it in reverse, those abstractions and languages don't really need to be re-writen, only the compiler toolchain that uses them has to be able to compile to HSAIL (AMDIL)... and some magic can be included in rumtimes, like the Sumatra enabled JVM... a DX runtime as example... which could be finalized or not...

What really needs is "application code" to take advantage of all this possible different methods... being a toolchain to HSA/HSAIL(AMDIL) the most straightforward of all... and that is exactly "the" major project of HSA Foundation, a full bloom HSA toolchain and the proper runtimes ( its mostly software not hardware definitions).

8350rocks · Aug 28, 2013

hcl123 :

8350rocks :

I think you are seeing it in reverse, those abstractions and languages don't really need to be re-writen, only the compiler toolchain that uses them has to be able to compile to HSAIL (AMDIL)... and some magic can be included in rumtimes, like the Sumatra enabled JVM... a DX runtime as example... which could be finalized or not...

What really needs is "application code" to take advantage of all this possible different methods... being a toolchain to HSA/HSAIL(AMDIL) the most straightforward of all... and that is exactly "the" major project of HSA Foundation, a full bloom HSA toolchain and the proper runtimes ( its mostly software not hardware definitions).

I don't see a way around them not having to be optimized or altered at least partially. Unless the HSAIL middleware is going to interface entirely with the OS and everything else...which is, in my mind, not very likely.

hcl123 · Aug 28, 2013

de5_Roy :

Yes you are absolutely right, damn firefox didn't load the first image ( must be the all noscript and ghostery lol)... now a page reload and it appears.. meaning you were always right about the slide order, my bad,sorry for that.

de5_Roy · Aug 28, 2013

hcl123 :

i tell you, it's that guy's hair! :pt1cable:

hcl123 · Aug 28, 2013

8350rocks :

Better... is going to interface directly with the hardware, what is going to interface entirely with the OS, and can be any OS, are the rumtimes.

Any piece of code compiled into HSAIL is going to need a rumtime... see HSA like a Low Level JIT paradigma, kind of functions in a virtual machine, like java, but this VM is the most low level possible yet maintain the flexibility, opening the possibility of having other VM on top of it or being integrated in current ones, and that is what Sumatra is about, and extension to Java that works with the finalizing part of the HSA definition, only none of actual Java apps will go to the GPGPU, if developers don't code it accordingly.

It will have, and be most compatible with OpenCL 2 "rumtime"... and other languages can also benefit if changed accordingly... there are a lot of possibilities... it will go piecemeal i think, until the all compiler and runtime definitions are set in stone.

So HSA is a low level kind of VM tat allows many "escapades" directly to hardware, userspace visible hardware task queues is a pertinent example, and this HSAIL representation is also quite low level. Games are also a VM/JIT paradigma, the same game runs on AMD, Nvidia or intel unchanged, all of them with quite different GPU ISAs and architectures... and so HSA could go below or added to those "driver" environments, but none of it will give any additional benefit, or next to none, if the code is not tailored to take advantage of it.

That is why i think "Never Settle" has gone "forever"... any HSA game will run on intel as example, the missing features of the platform, like hUMA, could be emulated in software, after all is kind of a VM environment, only it will run notoriously slower... and because of that fear is why Broadwell will be "closed"...

Any non-HSA game, or without HSA features, will run on HSA hardware as it does now, no difference at all, perhaps some little improvements were the rumtime environment can pick of the new features to give it a little boost, "compute" inside games can benefice for sure... the rest will be pretty much the same with HSA features or without them.

So give us HSA games lol..

8350rocks · Aug 28, 2013

hcl123 :

This is exactly my point, things will have to be coded to use it...you are saying by game developers, not API makers. Though that's even a bit more comforting, as game developers are always looking for a leg up to improve performance. Even so though, it doesn't solve the problem of having to cater to the lowest common denominator of your intended target audience. That means that a great many of the "less demanding" games will not see any benefit from HSA type improvements because the majority of 2-3 year old dual core systems won't benefit at all.

HSA games are at best 2 years out after kaveri launches on any sort of wide scale. By then more GPUs and CPUs/APUs will be better able to accommodate these systems. Now, does that mean that AMD won't have some "optimized" games for HSA? They probably will, most likely they'll be blockbuster console ports to show off the hardware...

Though, what impact that will have right away is limited at best, and may be ambiguous until it becomes widespread. The review sites will start to chime in about it when the first AAA title that supports HSA fully comes about...but when will that be? 1-2 years from now?

I am in the industry, and I can't predict that far into the future. Right now, more threading is where it's going...what impact HSA will have on that, even I cannot say.

hcl123 · Aug 28, 2013

But what are you complaining about?... after all both XBone and PS4 are hUMA and plenty of other HSA features...

Yes its seems confirmed XBone is hUMA and IOMMU, and so HSA (and ESRAM doesn't break anything... juanrga...); quite revealing articles
http://wccftech.com/amd-huma-playstation-4-graphics-boost-xbox-analysis-huma-architecture/ (funny the talk back/forward with AMD if true)
http://wccftech.com/microsoft-xbox-one-architecture-detailed-hot-chip-2013-features-5-billion-transistors/

So i'm quite convinced some (not all) of the game titles for this consoles will spill for the PC world.. there will be almost a guaranty we will see HSA games (of sorts)... and quite convinced sooner and in larger numbers than you anticipate, in 2 years we may be talking ray-tracing if HSA sets a definition that now lacks entirely even for games ( now its only about general purpose accelerators, like general purpose GPU).

boy!.. now starts to stink like rotten(dead) for the sides of Nvidia, the better they could do now is join the HSA foundation, and they will i'm convinced, when the numbers of losses starts to be greater than the fear of Intel, or the hatred for AMD lol

[update: if MSFT can have VS HSA aware, nothing prevents nvidia to have CUDA HSA aware]

jdwii · Aug 28, 2013

CUDA ha ha lol

griptwister · Aug 28, 2013

jdwii :

LOL! +1

juanrga · Aug 28, 2013

hcl123 :

That article doesn't change anything of what I said, but confirms the latency issues for HSA dGPUs that I mentioned.

As a final comment, note that they label the slide that you mention as "AMD Kaveri hUMA benefits".

de5_Roy · Aug 29, 2013

hcl123 :

speculating: doesn't nvidia get easy access to hsa and huma since arm is a member and nvidia happens to have an arm license (i dunno if it's an architecture license, assuming that's the one they have). i think intel is the only one really left out of hsa. however, i suspect that they might have implemented something like(but proprietary, as usual) hsa in their xeon+xeon-phi combo.

juanrga · Aug 29, 2013

de5_Roy :

There is no mention of hUMA on the HSA specification. Although hUMA is compatible with the HSA memory model.

To get a fully HSA APU you need both a HSA CPU and a HSA GPU. ARM is enabling HSA in both. Nvidia only could access to future HSA ARM CPUs, because Nvidia is using their own CUDA/OpenCL GPUs in Tegra. Moreover, Nvidia is moving to an own ARM CPU design (non-HSA) with Tegra 5.

hcl123 · Aug 29, 2013

juanrga :

"" * Probe filters and directories (coherency directories) will maintain power efficiency ""

Latency yes(which GPU is kind of immune)... but how any of what you say ? LOL

(impossible discrete with hUMA, impossible probe filters and coherency directories with what you say... and its not me saying, its not personal, are all those "official slides" and other analyses )

hcl123 · Aug 29, 2013

de5_Roy :

Yes but ARM licenses multi part "hardware" not software. And besides may be prepared for HSA but with Mali, not with Nvidia GPUs... those will have to obey the rules of HSA or face extensive license litigation. Is not as if anyone can come along, and since is an Open Standard, adopt it, use it , twist it, without saying "excuse me" to anyone.

If that would be possible, intel would adopt it tomorrow and twisted to the bone, exactly and specially to make it incompatible or very difficult to any one else.

Intel with "xeon+xeon-phi combo", has a different opinion... make somethings very similar to HSA, most included easier to program ... but use x86 has the enabling factor, that is, for the intel HSA, every accelerator must be centered around x86...

LOL and LOL ... nice play, only x86 is hardly the best ISA for a CPU, for an accelerator kind of thing is worst... which seems NOT to mind intel at all, they are position themselves to be the sole supplier of accelerators(including GPU lol), one possible coherent explanation, that is why the savage push of AVX . One got to love their arrogance... and laugh at it LOL...

More Itanics coming!... lol

de5_Roy · Aug 29, 2013

^^ so.. as long as nvidia follows hsa rules, they can adopt hsa in their gpus and socs? i ask this because nvidia is planning to put kepler uarch in their future arm socs (disregarding the impact and performance for now).

as for xeon+xeon-phi, yeah - basing everything around x86 is the major catch about that combo. intel's argument is that x86's ubiquity makes it easier for coders. i wonder how this'll play out in the future - intel's 'hsa' vs real hsa. imho i don't see it going intel's way in the long run. but, intel may have a better chance i.e. a headstart to push it into hpc sector.

i didn't understand the last bit with avx push and what is itanics?

edit: moar on xbone(R)'s hardware and stuff:
http://semiaccurate.com/2013/08/29/a-deep-dive-into-microsofts-xbox-ones-architecture/

hcl123 · Aug 29, 2013

juanrga :

Doesn't need to be... hUMA is kind of ccNUMA with any CPU side MMU able to inter-operate with the IOMMU side GPU/accelerators. And IOMMU is official part of the standard, anyone only has to do is adapt their MMUs and coherency protocols to work with IOMMU, and since there is only ARM as the other CPU implementer, this applies only to ARM. For the all other accelerators the only they have to do, is have IOMMU on those accelerators, and make the pertinent caches they want *prob-able* by coherency protocols... then they will have hUMA.

... that is how easy Nvidia could make hUMA... and in addition HSA by having userspace hardware task queues... and support the official *runtimes*. (neither which they will do, as long as CUDA as the hope of leading... dreams and illusions are hard to get over... )

hcl123 · Aug 29, 2013

de5_Roy :

I think the rule is they have to join the HSA and submit their anything hardware +software to "certification" testing procedures to carry the label HSA. Unless they go ARM + PowerVR and then all that will be done for them, they will only have to license.

There are small "fees" involved in most of this (except for universities i think)... but no royalties(certification is only once for any combination)... that is why in any case with all propriety, we can say HSA is an Open Standard (its open to anyone).

Its not money its not the small fees that drives Nvidia away. Its the adoption of a tech, and with it the "official endorsement" of a tech and implementation that is the worst enemy of their CUDA initiative. Besides they hope to ride the Intel wave to success, if one goes down the worst is for the smaller, which intel is giving an help to foresee beyond all consoles AMD and HSA and so games HSA, by closing Broadwell...

... in the end i think Nvidia will be forced to join HSA to survive, ARM will be HSA, most pertinent DSPs will be HSA, mobile will be HSA... and games will be HSA... Nvidia to sell must forcibly be HSA...

UPDATE:

but, intel may have a better chance i.e. a headstart to push it into hpc sector.

They already have that upstart since Wannabee.. err Larabee lol ... and almost 2 years since the debut of Phy X kind, and besides themselves, no other stated interest... none... licensing x86 might have a lot to do with it(if not all), besides not being the best approach to anything lol.

UPDATE:

i didn't understand the last bit with avx push and what is itanics ?

AVX is exactly to be the ISA of the potential "accelerators", pretends to skip wit this the *need* (and possibility) of specialized and fixed function processing elements... but worst of all, technical and commercial, is a logic extension of x86 (the good part is seamless integration with CPU and so much more easy to program, but all shortcomings means it must continue ballooning indefinitely for a long time -> adding more instructions)...

Seems to me Itanic (the Itanium IA64 'Titanic' CPU of intel)... worst, one that is already leaking before leaving harbor lol

blackkstar · Aug 29, 2013

hafijur :

The second one is more than likely rendering with the Titan unless render settings were changed. FX 8350 is clock for clock faster than 3930k when FX is in Gentoo and 3930k is in Windows.

8350rocks · Aug 29, 2013

http://semiaccurate.com/2013/08/29/a-deep-dive-into-microsofts-xbox-ones-architecture/

XBone architecture discussed in detail...Core clockspeeds to be ~1.9 GHz for XBone console.

juanrga · Aug 29, 2013

hcl123 :

Well, I didn't mention stuff such as coherency. I have said another thing. I repeat. A discrete GPU breaks uniformity. Discrete introduces two memory pools and memory access is broken into local and remote. Access to local memory will be faster than to remote memory due to latency and bandwidth.

Note: the comment about latency refers to the CPU accessing to the dGPU local memory.

There is no way that AMD (or Nvidia or ARM or Intel) can implement an Uniform Memory Access with discrete.

hcl123 · Aug 29, 2013

8350rocks :

Charlei is incorrect.

its not the "The blue arrows above are for coherent memory accesses, yellow for non-coherent traffic"... if he had payed attention and not eager to sell more fish out of the same basket, if he had included the slide "GPU and GPU MMU clients"(http://cdn2.wccftech.com/wp-content/uploads/2013/08/Xbox-One-GPU-Architecture-635x481.png )(edt) as seen here http://wccftech.com/microsoft-xbox-one-architecture-detailed-hot-chip-2013-features-5-billion-transistors/, the second yellow/(green) line from the top for the "Host Guest GPU MMU", totally exposed in the "SoC Components" slide he analyses, but without the *comments*, says: "DRAM Access Non-CPU-Cache-Coherent 68GB/sec peak (incl coherent BW)", and that link goes exclusively to the DRAM controller that has 68GB/s peak, meaning all DRAM (system DRAM) is coherent to the GPU caches at max bandwidth possible, its the CPU block the one restricted to 30GB/sec (~half)... so Durango is hUMA...

And Durango is not 1+ year old design, matter of fact compared with PS4 design that uses Garlic and Onions as seen here http://www.vgleaks.com/playstation-4-architecture-evolution-over-time/ seems more advanced than Orbis( not very fond of M$ like Charlei, but skip the phobia)... thought Orbis will perform better at games( but probably not that much)...

Matter of fact the design is pretty identical to "onion + onion+" of Orbis, most included that remarkable piece of Northbridge in the slide "Soc components" (CPU-Cache-Coherent Memory Access)(edt) that Charlei analyses saying: "Microsoft probably had to beef up the L2 to NB links to almost match that of the L1 to L2 links".. it was most most probably AMD NOT MSFT (perhaps for praising AMD, Charlei tung just sticks lol) since exactly the same thing is highly to suspect on the "Liverpool" SoC , just compare to http://www.vgleaks.com/world-exclusive-orbis-unveiled-2/ with http://www.vgleaks.com/world-exclusive-durango-unveiled-2/ (very identical *unified* Northbridge)... which *may* lead to conclude according the particularities, both SoCs will operate ~2Ghz for the CPU...

Also on that slide "SoC Components", the line from the bottom ESRAM to the CPU block, means that the 32MB ESRAM is visible from the CPU, probably as VGleaks suggested is I/O coherent, meaning the CPU can read it, but cannot write it in a coherent way (for this there is another link directly to GPU caches).

So Durango has a kind of large dedicated cache for the GPU (ESRAM), that is visible from the CPU, and is hUMA at total possible DRAM bandwidth... and is more advanced in some ways than Orbis, and could be a good preview for Kaveri.

hcl123 · Aug 29, 2013

juanrga :

Perhaps you should get a job at AMD... they and their HSA friends, supposedly experts from top companies are to embark in an impossibility LOL

http://t1.gstatic.com/images?q=tbn:ANd9GcRvGiOpiosFpbI_dQ0Fhf-2w6RanklsgXec8ZHRWDd0vWt-JvHlSQ

The last thing that is supposed to finish the total specification, foreseen for middle next year (mid 2014 http://www.theregister.co.uk/2013/08/25/heterogeneous_system_architecture_deep_dive/?page=3 ) is exactly "Extend to discrete GPU" for this last push of "system integration" ... just see that image link, damn it lol ... (gee! could it be... HSA dumb juanrga smart!? LOL)

UPDATE :
And according to some analyses ( yes for one thing i would also like more details) http://mygaming.co.za/news/hardware/53750-amds-plan-for-the-future-huma-fully-detailed.html ... its not only for future, but hUMA of sorts already works with Discrete adapters of the HD7000 family

For bonus points, this works with discrete AMD graphics cards from the HD7000 family as well, although there will be some latency issues when accessing GDDR5 memory. Its possible to have the CPU, integrated GPU and the discrete GPU all working on the same thing.

Though it could be, even with PCIe (cc emulated), i think none of the drivers versions so far really supports this.

AMD CPU speculation... and expert conjecture

Honorable

Honorable

Splendid

Honorable

Distinguished

Honorable

Splendid

Honorable

Distinguished

Honorable

Splendid

Distinguished

Distinguished

Splendid

Distinguished

Honorable

Honorable

Splendid

Honorable

Honorable

Honorable

Distinguished

Distinguished

Honorable

Honorable

Share this page