News AMD CTO Mark Papermaster: More Cores Coming in the 'Era of a Slowed Moore's Law'

The only issue I have is software takes a very long time to catch up. Core utilization in the mainstream consumer market is not going to scale as fast as HPC markets do.

I still don't see an advantage to 16 cores for the mass majority and very little for enthusiasts outside of gamers who also want to stream.

Whats needed is a major boost to IPC until software can actually, at the core lik the OS, utilize multiple cores efficiently enough.
 
The only issue I have is software takes a very long time to catch up. Core utilization in the mainstream consumer market is not going to scale as fast as HPC markets do.
You need a decent amount of data to process per core for it to make a difference and that will just never happen, OS operations do not operate on a lot of data.
Anything that needs a lot of calculations is already being pushed to iGPU,AVX and so on.
 

JayNor

Honorable
May 31, 2019
456
102
10,860
Both Norrod at AMD and Keller at Intel have stated that 3D manufacturing will be required for continued performance gains. Papermaster did not offer an opinion on that topic ...
 

InvalidError

Titan
Moderator
Anything that needs a lot of calculations is already being pushed to iGPU,AVX and so on.
AVX runs on the CPU, so "pushing stuff to AVX" is synonymous with beefier CPU cores and more of 'em.

The big problem with more cores is that tons of everyday algorithms don't lend themselves to being split into multiple threads cleanly or efficiently and those will always favor CPUs with high single-thread performance (ex.: core control logic for games) regardless of how much more AVX stuff (like AI and physics) gets tacked on top of it.

We're almost back to the MMX days where Intel though it could make GPUs obsolete with ISA extensions. At some point, extending AVX any further will become detrimental to performance in what general-purpose CPUs are meant to excel at (general-purpose stuff) and bulk math will have to be delegated to accelerator hardware like iGPGPU.
 
Aug 29, 2019
35
7
35
I think it is a major mistake for AMD not to support Intel's OneAPI, This could be most significant development change in computers in at least decade. A single development API across CPU/GPU and across multiple of levels of demands.

I would hope one day this could lead to modular platforms - if you need more performance instead of replacing computer add a new module. I also impressed with board with dual CPU and six GPU's on it.

Just adding more cores to CPU is not solution but a band aid

The following is a very interesting read on future of what is coming and will be standard for next decade

 

Dijky

Prominent
Dec 6, 2019
6
13
515
The only issue I have is software takes a very long time to catch up. Core utilization in the mainstream consumer market is not going to scale as fast as HPC markets do.

I still don't see an advantage to 16 cores for the mass majority and very little for enthusiasts outside of gamers who also want to stream.

Whats needed is a major boost to IPC until software can actually, at the core lik the OS, utilize multiple cores efficiently enough.
I think the mainstream market is actually doing pretty well within its technical limitations (Amdahl's Law).
Operating systems are actually using multiple threads. in that most kernels do handle interrupts and system call on every core/hardware thread without a global lock (for most operations) and auxiliary services are split into multiple processes that spread out to all available cores.
But an OS is just an enabler for the actually desired software to run on, and most mainstream OSes do that just fine.

On the gaming front, I think we are seeing a pretty fast adoption of higher core counts. So much so, that the common recommendation for a minimum decent gaming system has gone from "quad core i5 with no HT is fine" to "at least 6c/6t but take 6c/12t if you can spare the cash" in just two years.

On the desktop front, we've seen all major browsers adopting multithreaded DOM rendering and Javascript execution, and offloaded auxiliary tasks (IO etc.) in recent years (usually with no more than one thread per page because that's good enough for now).
Besides that, most mainstream workloads barely need one core, so there's really no reason to scale out.

The ecosystem will work with what it gets. Moaning the absence of 16-core optimization not even half a year after the first 16-core mainstream CPU ever has launched is not fair.

The big problem with more cores is that tons of everyday algorithms don't lend themselves to being split into multiple threads cleanly or efficiently and those will always favor CPUs with high single-thread performance (ex.: core control logic for games) regardless of how much more AVX stuff (like AI and physics) gets tacked on top of it.
Tons of everyday workloads actually can be multithreaded, because most of them are made up of more than just one instance of one sequential algorithm.
Even a single hardware thread is exploiting opportunities for parallelization in "singlethreaded" code. We see this in vector instructions, instr. reordering, pipelining and superscalarity.

The biggest hurdles I see there are tools to make multithreaded development easily accessible, and efficient communication between threads.

The former is being addressed over time.
For example, Go has introduced an easy, race-free approach to multithreading with goroutines.
C++ has taken until 2011 to finally provide a standardized threading library and has just laid the groundwork for "easy" multithreading with execution policies in C++17, with more on the way for C++20 and 23.
Unity's relatively new ECS and Job System provide the foundation to reorganize monolithic code into discrete jobs that can be executed by multiple threads.

The latter is something to keep in mind while developing. Communication between cores is relatively slow (in terms of CPU cycles), so there is a tradeoff between execution resources (cores) and communication overhead.
With more and more cores that require more complex on-chip networks and have longer wires, this only going to get worse.

And the of course, there is necessity. Nothing gets done at scale unless it's beneficial in some way. A lot of everyday workloads are too simple to even bother with multithreading.
Now that even Intel embraces core count again because they are hitting a brick wall, we'll see more investment into multithreading because there is just no other way.
And with availability, we'll see more adoption as well in industries that just target whatever is there (like gaming).

I think it is a major mistake for AMD not to support Intel's OneAPI, This could be most significant development change in computers in at least decade. A single development API across CPU/GPU and across multiple of levels of demands.

Papermaster didn't specifically deny oneAPI support in the future. He just said that AMD is already well into this topic and they are doing basically the same with their ROCm work to assure the readers that AMD is no stranger to heterogeneous compute.
But oneAPI is a competitor's initiative, so AMD will want to see whether it takes off and then jump on the bandwagon if necessary like they did with CXL.
oneAPI is a high-level programming model and API specification, agnostic to the underlying implementation stack, so it can most likely be bolted on top of the ROCm stack (in parallel to or on top of HIP).
The biggest problem with ROCm is that the entire stack is atrociously immature. For example, the entire HCC path was just dropped, Navi and APUs are still not supported, a SYCL wrapper is only provided by a third-party developer, and don't even think about Windows support.
It often feels like there is maybe one developer at AMD working on it part-time.
 
Last edited:
  • Like
Reactions: bit_user

TJ Hooker

Titan
Ambassador
I think it is a major mistake for AMD not to support Intel's OneAPI, This could be most significant development change in computers in at least decade. A single development API across CPU/GPU and across multiple of levels of demands.

I would hope one day this could lead to modular platforms - if you need more performance instead of replacing computer add a new module. I also impressed with board with dual CPU and six GPU's on it.

Just adding more cores to CPU is not solution but a band aid

The following is a very interesting read on future of what is coming and will be standard for next decade

I'm curious why you consider OneAPI to be potentially the "most significant development change in computers in at least decade", and yet seemingly don't think that OpenMP (which has been out for a year and seems to target the same use cases, and which AMD is a member of) is worth mentioning?
 
Last edited:

TJ Hooker

Titan
Ambassador
Tons of everyday workloads actually can be multithreaded, because most of them are made up of more than just one instance of one sequential algorithm.
Even a single hardware thread is exploiting opportunities for parallelization in "singlethreaded" code. We see this in vector instructions, instr. reordering, pipelining and superscalarity.

The biggest hurdles I see there are tools to make multithreaded development easily accessible, and efficient communication between threads.
Sure, but there are practical limits to instruction level parallelism. You need increasingly complicated branch prediction and scheduling, increasingly wide execution backends, etc. all of which consume die space and power. And of course the deeper you go the larger the penalty for a branch misprediction. You reach a point of diminishing returns, and I don't know how improved coding resources can change that.
 
  • Like
Reactions: bit_user
I think the mainstream market is actually doing pretty well within its technical limitations (Amdahl's Law).
Operating systems are actually using multiple threads. in that most kernels do handle interrupts and system call on every core/hardware thread without a global lock (for most operations) and auxiliary services are split into multiple processes that spread out to all available cores.
But an OS is just an enabler for the actually desired software to run on, and most mainstream OSes do that just fine.

On the gaming front, I think we are seeing a pretty fast adoption of higher core counts. So much so, that the common recommendation for a minimum decent gaming system has gone from "quad core i5 with no HT is fine" to "at least 6c/6t but take 6c/12t if you can spare the cash" in just two years.

On the desktop front, we've seen all major browsers adopting multithreaded DOM rendering and Javascript execution, and offloaded auxiliary tasks (IO etc.) in recent years (usually with no more than one thread per page because that's good enough for now).
Besides that, most mainstream workloads barely need one core, so there's really no reason to scale out.

The ecosystem will work with what it gets. Moaning the absence of 16-core optimization not even half a year after the first 16-core mainstream CPU ever has launched is not fair.


Tons of everyday workloads actually can be multithreaded, because most of them are made up of more than just one instance of one sequential algorithm.
Even a single hardware thread is exploiting opportunities for parallelization in "singlethreaded" code. We see this in vector instructions, instr. reordering, pipelining and superscalarity.

The biggest hurdles I see there are tools to make multithreaded development easily accessible, and efficient communication between threads.

The former is being addressed over time.
For example, Go has introduced an easy, race-free approach to multithreading with goroutines.
C++ has taken until 2011 to finally provide a standardized threading library and has just laid the groundwork for "easy" multithreading with execution policies in C++17, with more on the way for C++20 and 23.
Unity's relatively new ECS and Job System provide the foundation to reorganize monolithic code into discrete jobs that can be executed by multiple threads.

The latter is something to keep in mind while developing. Communication between cores is relatively slow (in terms of CPU cycles), so there is a tradeoff between execution resources (cores) and communication overhead.
With more and more cores that require more complex on-chip networks and have longer wires, this only going to get worse.

And the of course, there is necessity. Nothing gets done at scale unless it's beneficial in some way. A lot of everyday workloads are too simple to even bother with multithreading.
Now that even Intel embraces core count again because they are hitting a brick wall, we'll see more investment into multithreading because there is just no other way.
And with availability, we'll see more adoption as well in industries that just target whatever is there (like gaming).

Yes a OS knows how to handle threads and assign them to other cores as needed. But in terms of using 8 cores for a single program? Thats not as heavily done. The mass majority do not need more than 8 or 4 cores. Hell even dual cores are enough for the largest slice of people.
 
Aug 29, 2019
35
7
35
I'm curious what why you consider OneAPI to be potentially the "most significant development change in computers in at least decade", and yet seemingly don't think that OpenMP (which has been out for a year and seems to target the same use cases, and which AMD is a member of) is worth mentioning?

It just my opinion, but I think the goal of hardware abstraction would be significant in todays work especially when consider it works across both CPU's and GPU's and on multiple levels of performance - between notebooks and supercomputers.

From what it looks like OpenMP is a lower level than OneAPI and actually could and probably should be used to implement the parrilla taskings in OneAPI implementation. Intel of course is also member of OpenMP.,

OpenMP appears to me more related to compiler technology for parallel programming while OneAPI is system api for the future access multiple architextures.

Time will tell actually where it will go but it would be nice for software to take advantage of new hardware with minimal change
 

InvalidError

Titan
Moderator
Even a single hardware thread is exploiting opportunities for parallelization in "singlethreaded" code. We see this in vector instructions, instr. reordering, pipelining and superscalarity.
Pipelining, superscaling, out-of-order, speculative and SIMD execution aren't multi-threading, they are oportunistic techniques to extract instruction-level parallelism from a single instruction flow to improve single-thread performance and make better use of available execution resources.

The biggest hurdles I see there are tools to make multithreaded development easily accessible, and efficient communication between threads.
For the most part, easy and efficient tend to be mutually exclusive as being efficient typically requires extreme care and meticulous planning. Any hobo can write multi-threaded code that will get stuck on IPC sync 90% of the time and never achieve any meaningful performance scaling. Only a highly skilled developer will get that down to the sub-0.1% time wasted on inter-process communications required to achieve scaling to thousands of cores. For that reason, most multi-threaded gains will be limited to stuff that is made available as neatly packaged and engineered libraries like Intel's Math Kernel Library. However, most everyday software contains heaps of proprietary code that will never get anywhere near that degree of care, which means that most software's multi-core scaling will ultimately be limited by how much of total execution time will be spent in heavily threaded libraries vs proprietary code.

Sure, you can do "easy multi-threading" by doing rough things like giving each browser, text editor, spreadsheet, etc. tab their own threads, but you could already do that before by simply spawning completely independent browser/whatever instances anyway albeit at the expense of extra RAM, so anything that trivial (running a bunch of fundamentally independent processes) does not really count.
 
  • Like
Reactions: TJ Hooker

Chung Leong

Reputable
Dec 6, 2019
494
193
4,860
I think it is a major mistake for AMD not to support Intel's OneAPI, This could be most significant development change in computers in at least decade. A single development API across CPU/GPU and across multiple of levels of demands.

The major mistake was not creating something like OneAPI. Heterogeneous computing was the point of the ATI acquisition. AMD didn't buy the company in order to get into the game console business.
 

Dijky

Prominent
Dec 6, 2019
6
13
515
Sure, but there are practical limits to instruction level parallelism. You need increasingly complicated branch prediction and scheduling, increasingly wide execution backends, etc. all of which consume die space and power. And of course the deeper you go the larger the penalty for a branch misprediction. You reach a point of diminishing returns, and I don't know how improved coding resources can change that.
Of course there are.
But the great lengths which CPU designers have gone to exploit ILP is a testament to how much parallelity there is in singlethreaded code.
Pipelining, superscaling, out-of-order, speculative and SIMD execution aren't multi-threading, they are oportunistic techniques to extract instruction-level parallelism from a single instruction flow to improve single-thread performance and make better use of available execution resources.
These techniques exploit the opportunity to do stuff in parallel that doesn't need to run sequentially.
They do it at a smaller scale than multithreading, and my mention of it was really just a tangent to illustrate that not everything that seems inherently sequential actually needs to be.
For the most part, easy and efficient tend to be mutually exclusive as being efficient typically requires extreme care and meticulous planning. Any hobo can write multi-threaded code that will get stuck on IPC sync 90% of the time and never achieve any meaningful performance scaling. Only a highly skilled developer will get that down to the sub-0.1% time wasted on inter-process communications required to achieve scaling to thousands of cores. For that reason, most multi-threaded gains will be limited to stuff that is made available as neatly packaged and engineered libraries like Intel's Math Kernel Library. However, most everyday software contains heaps of proprietary code that will never get anywhere near that degree of care, which means that most software's multi-core scaling will ultimately be limited by how much of total execution time will be spent in heavily threaded libraries vs proprietary code.
There are plenty of opportunities to get a benefit from multithreading without meticulous care. It won't be the best possible result, but often still better than singlethreading.
If the difference between a singlethreaded, sequential std::for_each and a parallel std::for_each amounts to a single function parameter, one might be more inclined to just try it and benchmark it than if the entire supporting code for a manual (and platform-dependent before C++11) implementation needs to be written first.
Then, there is nuance. Taking a slight bit of care to not spawn a thousand threads for a parallelized oneliner is still easier than handrolling synchronization correctly.

And even if the benefits are only reaped in third-party libraries, that's still better than nothing. And on that front, having a standardized threading library is helpful because it's one less added dependency to manage.
Sure, you can do "easy multi-threading" by doing rough things like giving each browser, text editor, spreadsheet, etc. tab their own threads, but you could already do that before by simply spawning completely independent browser/whatever instances anyway albeit at the expense of extra RAM, so anything that trivial (running a bunch of fundamentally independent processes) does not really count.
Well sure you could, but most people don't.
And transforming e.g. Firefox to a multi-process model was quite a lot of work and involved deprecating the old extensions/add-on system because it was too much of a burden to migrate.
There is actually a decent amount of shared state between tabs/windows that had to be reorganized to enable content processes.
Keeping a low memory footprint was actually a design goal as well, so that required some engineering work

Yes a OS knows how to handle threads and assign them to other cores as needed. But in terms of using 8 cores for a single program? Thats not as heavily done. The mass majority do not need more than 8 or 4 cores. Hell even dual cores are enough for the largest slice of people.
Well, these users don't need more performance, so they are irrelevant in a discussion about more performance.
 
  • Like
Reactions: bit_user

Arbie

Distinguished
Oct 8, 2007
208
65
18,760
Paul Alcorn: To "beg the question" does NOT mean "raises the question" - almost the opposite! Instead of damaging our language further please use the latter phrase, which everyone understands, rather than one you don't.
 
  • Like
Reactions: bit_user

TJ Hooker

Titan
Ambassador
Paul Alcorn: To "beg the question" does NOT mean "raises the question" - almost the opposite! Instead of damaging our language further please use the latter phrase, which everyone understands, rather than one you don't.
Err...
"If a statement or situation begs the question, it causes you to ask a particular question"

"Begging the question means 'to elicit a specific question as a reaction or response'"

https://dictionary.cambridge.org/amp/english/beg-the-question
 

InvalidError

Titan
Moderator
If the difference between a singlethreaded, sequential std::for_each and a parallel std::for_each amounts to a single function parameter, one might be more inclined to just try it and benchmark it than if the entire supporting code for a manual (and platform-dependent before C++11) implementation needs to be written first.
Your handling function is still going to be proprietary code and unless the data is either entirely self-contained or can be considered constant at least for the duration of the multi-threaded portion, you will still need to write thread-aware code with many of the same pitfalls. Attempts at implementing language-level automatic threading haven't been progressing at a snail's pace for 10+ years for no reason.

Also, having "some" low-quality threading is not necessarily better than having no threading whatsoever since improper or unnecessary use of threads means that many more things that can possibly go wrong and cause random lock-ups. General wisdom is to use the fewest threads necessary to accomplish a goal, give or take a few if it may contribute to neatness such as following power-of-two alignment.
 

bit_user

Titan
Ambassador
At some point, extending AVX any further will become detrimental to performance in what general-purpose CPUs are meant to excel at (general-purpose stuff) and bulk math will have to be delegated to accelerator hardware like iGPGPU.
I think AVX-512 is already a step too far. Just look at how badly segmented it is, right out of the gate:


By my count, that's 19 different features! There's no CPU that supports all of it, or probably even the majority of it!

Worse, AVX-512 can throttle CPU clocks so badly that it can have a net-detrimental impact on code which uses it in only a few hot spots. For instance:

By using ChaCha20-Poly1305 over AES-128-GCM, the server that uses OpenSSL serves 10% fewer requests per second. And that is a huge number! It is equivalent to giving up on two cores, for nothing.
the AVX-512 workload consumes only 2.5% of the CPU time when all the requests are ChaCha20-Poly1305, and less then 0.3% when doing ChaCha20-Poly1305 for 10% of the requests. Irregardless the CPU throttles down, because that what it does when it sees AVX-512 running on all cores.

Even if we assume this clock throttling behavior is substantially improved in future generations, there's a more troubling hazard that's intrinsic to the ISA:

the processor is keeping track of whether the most significant bits of the vector registers are “clean” (initialized to zero) or dirty (potentially containing data). When the registers are “clean”, then the processor can just treat the 128-bit registers as genuine 128-bit registers. However, if the most significant bits potentially contain data, then the processor actually has to treat them as 512-bit registers. And thus we get a significant (10%) slowdown of the program even if we are not, ourselves, using AVX-512 instructions or registers.

And that's not only a slowdown, but surely more power utilization (which matters a lot to cloud operators), as the full 512-bit register contents are being copied around for every floating point operation!

I doubt AMD will go beyond AVX2.
 
  • Like
Reactions: Dijky

bit_user

Titan
Ambassador
I think it is a major mistake for AMD not to support Intel's OneAPI, This could be most significant development change in computers in at least decade. A single development API across CPU/GPU and across multiple of levels of demands.
That describes OpenCL, which AMD has been backing since its inception, more than 10 years ago.

In fact, oneAPI is built on OpenCL, via SYCL, which is wrapper layer around it. As was pointed out above, potentially Intel's Data Parallel C++ could be used atop AMD's backend.

However, I doubt the oneAPI toolkit libraries will work on AMD hardware, though, as they probably have various dependencies on the Level 0 API

one_api_sw_stack.png

That said, an independent developer is working to provide support for oneAPI on Nvidia GPUs. So, maybe full AMD support would be possible.


The major mistake was not creating something like OneAPI. Heterogeneous computing was the point of the ATI acquisition.
ATI/AMD originally had a proprietary GPU programming language, which they ditched in favor of OpenCL. Later, they pushed HSA, which tried to move a step beyond where OpenCL could readily go, but I guess it died from lack of customer interest.

AMD also worked with Microsoft on C++ AMP (probably similar to Intel's Data Parallel C++) and to support DirectCompute.

Lately, they've been offering a translation layer from Nvidia's CUDA, called HIP. It's not binary compatibility though.
 
Last edited:

bit_user

Titan
Ambassador
The biggest hurdles I see there are tools to make multithreaded development easily accessible, and efficient communication between threads.

The former is being addressed over time.
For example, Go has introduced an easy, race-free approach to multithreading with goroutines.
C++ has taken until 2011 to finally provide a standardized threading library and has just laid the groundwork for "easy" multithreading with execution policies in C++17, with more on the way for C++20 and 23.
Oh, C++ developers had plenty of options for multithreading, before 2011. It just wasn't included in the standard library, and the language took no formal stance on memory consistency.

One of the best frameworks for multithreading, TBB (Thread Building Blocks) was originally developed by Intel (now it's opensource), at least as far back as 2006 - basically when they started making a big push for dual-core and even quad-core CPUs.



Communication between cores is relatively slow (in terms of CPU cycles), so there is a tradeoff between execution resources (cores) and communication overhead.
Communication is actually fast - it's the synchronization that's typically the slow part. You have to be careful about when and how you synchronize, or you could quickly lose most of the benefits of multithreading. Also, data sharing needs to be done with some care to keep the CPU cache hierarchy working efficiency.

With more and more cores that require more complex on-chip networks and have longer wires, this only going to get worse.
It's more the switching and routing that gets to be a challenge.
 

bit_user

Titan
Ambassador
I'm curious why you consider OneAPI to be potentially the "most significant development change in computers in at least decade", and yet seemingly don't think that OpenMP (which has been out for a year and seems to target the same use cases, and which AMD is a member of) is worth mentioning?
Um, OpenMP dates back all the way to 1997. I guess you're probably remembering the announcement of version 5.0, which was released about a year ago.


Anyway, I'm not a big fan of OpenMP. It's only well-suited to a certain class of problems, but it's probably the easiest way to harness multi-core or GPU compute power, if it fits your use case.

In my opinion, OpenCL is much more interesting. Of course, I'm more of a nuts & bolts guy, so I'm a bit biased towards the framework that gives me the most control (and yet is still generic and portable).

 
  • Like
Reactions: TJ Hooker

bit_user

Titan
Ambassador
Yes a OS knows how to handle threads and assign them to other cores as needed. But in terms of using 8 cores for a single program? Thats not as heavily done.
A lot of frameworks and libraries spin up one software thread per hardware thread, and dispatch work to one or more queues that are serviced by those "worker" threads. So, it doesn't really depend on the developer explicitly deciding how many threads to spin up.

See: https://en.wikipedia.org/wiki/Work_stealing

The mass majority do not need more than 8 or 4 cores. Hell even dual cores are enough for the largest slice of people.
This is just a function of what your trying to run.

In the case of game developers, they are going to tune the game to target the bulk of the market. However, increasing certain settings or using a really fast GPU might increase the CPU workload enough to load up more cores. If you use a slow GPU, then it's probably going to be the bottleneck, thus reducing the demand on your CPU. So, maybe the majority don't need a beefy CPU, because they're using it with a weak GPU (or use the iGPU).
 
Last edited:

bit_user

Titan
Ambassador
For the most part, easy and efficient tend to be mutually exclusive as being efficient typically requires extreme care and meticulous planning. Any hobo can write multi-threaded code that will get stuck on IPC sync 90% of the time and never achieve any meaningful performance scaling. Only a highly skilled developer will get that down to the sub-0.1% time wasted on inter-process communications required to achieve scaling to thousands of cores. For that reason, most multi-threaded gains will be limited to stuff that is made available as neatly packaged and engineered libraries like Intel's Math Kernel Library.
That's been the status quo, but @Dijky is right that the industry appreciates the need to make it both easier and safer for developers to efficiently expose the concurrency in programs.
 

bit_user

Titan
Ambassador
@PaulAlcorn: To "beg the question" does NOT mean "raises the question" - almost the opposite!
Yup.


As @Arbie suggested, instead of "begs the question", just say "raises the question". It's a trivial substitution and mollifies your more punctilious readers.

Instead of damaging our language further
Related:



Even though I'm surprised there ever was an "Apostrophe Protection Society", I can't help but feel a little disheartened by their surrender. Of course, there's a note of hypocrisy in that, as I've long ago resigned myself to the reality of people misusing the above phrase.