News AMD CTO Mark Papermaster: More Cores Coming in the 'Era of a Slowed Moore's Law'

bit_user · Dec 7, 2019

TJ Hooker said:
Err...
"If a statement or situation begs the question, it causes you to ask a particular question"

https://dictionary.cambridge.org/amp/english/beg-the-question

You only quoted their first definition.

The second is:

to talk about something as if it were true, even though it may not be

TJ Hooker said:
https://www.merriam-webster.com/words-at-play/beg-the-question

Right at the top, it says:

Begging the question means "to elicit a specific question as a reaction or response," and can often be replaced with "a question that begs to be answered." However, a lesser used and more formal definition is "to ignore a question under the assumption it has already been answered." The phrase itself comes from a translation of an Aristotelian phrase rendered as "beg the question" but meaning "assume the conclusion."

(emphasis added)

With two definitions so very much at odds, I'd think it advisable to simply avoid the phrase.

bit_user · Dec 7, 2019

InvalidError said:
Your handling function is still going to be proprietary code and unless the data is either entirely self-contained or can be considered constant at least for the duration of the multi-threaded portion, you will still need to write thread-aware code with many of the same pitfalls. Attempts at implementing language-level automatic threading haven't been progressing at a snail's pace for 10+ years for no reason.

I think the example given of a parallel-for is not materially different than OpenMP - it's just a different syntax and formally included in the core language. As such, I don't consider it a very interesting or meaningful step towards the sort of intrinsic and ubiquitous concurrency we need. That's not to say there's not real progress, elsewhere.

InvalidError said:
Also, having "some" low-quality threading is not necessarily better than having no threading whatsoever since improper or unnecessary use of threads means that many more things that can possibly go wrong and cause random lock-ups.

Well, lockups and crashes tend to get fixed, however poor scaling tends not to. You might have an inexperienced, overeager programmer tying up 8 cores for only a 2-3x speedup and thinking that's a good idea. Meanwhile, the poor user's CPU fan spins up and the lights dim, yet they have no say in the matter.

The advice I would give is not to use concurrency if you don't need it. If you do, use a widely-accepted language or framework and follow the best practices for that particular technology. Obviously, your example of internally-threaded libraries is valid, but those are usually freebies and not available for the majority of needs.

Dijky · Dec 7, 2019

bit_user said:
Oh, C++ developers had plenty of options for multithreading, before 2011. It just wasn't included in the standard library, and the language took no formal stance on memory consistency.

One of the best frameworks for multithreading, TBB (Thread Building Blocks) was originally developed by Intel (now it's opensource), at least as far back as 2006 - basically when they started making a big push for dual-core and even quad-core CPUs.

https://en.wikipedia.org/wiki/Threading_Building_Blocks

I'm aware of the options that existed, but welcome the standardization nonetheless.
The current gcc/libstdc++ implementation of the Parallelism TS actually uses TBB as the backend, and clang/libc++ doesn't implement it yet.

bit_user said:
Communication is actually fast - it's the synchronization that's typically the slow part. You have to be careful about when and how you synchronize, or you could quickly lose most of the benefits of multithreading. Also, data sharing needs to be done with some care to keep the CPU cache hierarchy working efficiency.

Synchronous communication, I should have said. It takes a relatively long time to make a write visible to a different core. The only way around that is to not do it, or to find other useful things to do to mask the delay.
This is the prime reason why naive parallelization attempts (like indiscriminately using std::atomic concurrently) perform subpar (apart from senselessly spawning tons of threads).

Christopher1 · Dec 7, 2019

jimmysmitty said:
The only issue I have is software takes a very long time to catch up. Core utilization in the mainstream consumer market is not going to scale as fast as HPC markets do.

I still don't see an advantage to 16 cores for the mass majority and very little for enthusiasts outside of gamers who also want to stream.

Apparently you have never tried to run more than one program at once. Yes, for the average person even, more cores = more performance because background tasks can use cores that are under-utilized while allowing programs that are single-thread bound to use the full or near full capabilities of their core.

TJ Hooker · Dec 7, 2019

bit_user said:
You only quoted their first definition.

The second is:

Right at the top, it says:

(emphasis added)

With two definitions so very much at odds, I'd think it advisable to simply avoid the phrase.

My point was that the meaning of the phrase as used in this article is a widely accepted meaning of the phrase. Your own source (wikipedia) agrees:
"In modern vernacular usage, however, begging the question is often used to mean 'raising the question' or 'suggesting the question'"

The fact that there are additional (original) meanings doesn't invalidate that.

As an anecdote, I typically see "begs the question" being used as it is in this article, while "begging the question" seems to be more common when referring to the logical fallacy in debate.

Sure, I agree that it would probably be better to use "raises the question" in this case. But I'm not about to call someone wrong for using "begs the question".

bit_user · Dec 7, 2019

Dijky said:
The current gcc/libstdc++ implementation of the Parallelism TS actually uses TBB as the backend, and clang/libc++ doesn't implement it yet.

TBB does a lot, though I've only used it in limited capacities.

Anyway, if you're only using it as a back end, then you're probably missing out on a lot of its capabilities.

Dijky said:
It takes a relatively long time to make a write visible to a different core.

Not really. Atomic primitives and lock-free datastructures are fast, so long as you're not simultaneously hammering on them from multiple cores.

Synchronization is the slow part.

Dijky said:
The only way around that is to not do it, or to find other useful things to do to mask the delay.

Yeah, lock contention is one of the main issues with code that scales poorly. Though, I rarely ever bother with try-locks.

Dijky said:
This is the prime reason why naive parallelization attempts (like indiscriminately using std::atomic concurrently) perform subpar (apart from senselessly spawning tons of threads).

I guess it's another case where it helps to have done everything yourself. Then, you can look at a modern API, knowing what you want it to do underneath, and work out the best way to use it.

If you don't actually know what it's doing , and don't follow best practices when using it, then it can be easy to stumble into pitfalls. That pretty well describes C++ as a whole, in fact.

bit_user · Dec 7, 2019

TJ Hooker said:
Sure, I agree that it would probably be better to use "raises the question" in this case. But I'm not about to call someone wrong for using "begs the question".

Because that usage is so common, I'm never one to raise the matter. But I'm always thinking it, and figured I'd take the opportunity to chime in, as I've noticed @PaulAlcorn is a repeat "offender".

Anyway, I thought it was uncharacteristically sloppy or disingenuous of you to be so selective, when quoting sources to rebut @Arbie. If you have a good case to make, then it should easily withstand acknowledging other positions.

Chung Leong · Dec 7, 2019

bit_user said:
Communication is actually fast - it's the synchronization that's typically the slow part.

Yup, lock contention is a serious barrier to scaling. A lot of performance is sacrificed in order to prevent corruption of the application state.

bit_user · Dec 7, 2019

Christopher1 said:
Apparently you have never tried to run more than one program at once. Yes, for the average person even, more cores = more performance because background tasks can use cores that are under-utilized while allowing programs that are single-thread bound to use the full or near full capabilities of their core.

At work, it was night-and-day when my desktop got upgraded from dual-core + HDD to a quad-core + SSD.

Before, the main issue I had was the virus scanner and other automated tasks. My PC became virtually unusable at 5 PM, every day, when some sadistic corporate sysadmin configured my PC's virus scanner to run. But, with 4 cores (still no HT) and a SSD, I can barely even tell when anything is running in the background. I wouldn't even know, except I always keep an eye on the CPU load.

Obviously, most of that is HDD -> SSD. But I do think quad core helps immensely. Especially with browsers all being heavily multithreaded. As I'm about to go to 6 cores, we'll see if I notice any further improvement.

bit_user · Dec 7, 2019

Chung Leong said:
Yup, lock contention is a serious barrier to scaling. A lot of performance is sacrificed in order to prevent corruption of the application state.

Obviously, use locks when you need them. But, the trick is to try to partition your work and pick data structures to minimize read/write access to shared state.

BTW, barriers are one thing GPUs do really well. Much better than CPUs, because they basically implement synchronization and context switching directly in hardware. On CPUs, the OS intervenes whenever you actually block on something.

InvalidError · Dec 7, 2019

bit_user said:
Obviously, use locks when you need them. But, the trick is to try to partition your work and pick data structures to minimize read/write access to shared state.

And situations where a data set can be neatly split into multiple independent parts that can be dumped on individual threads is where HPC gets its massive-scale scalability from. The simulation has billions of individual elements to simulate, those billions of elements can be divided into however many chunks as there are available threads, each thread can work on its chunk without having to communicate with anything else until neighbors are done with their respective chunks and need to solve boundaries before the next simulation step can begin.

nofanneeded · Dec 7, 2019

I Think we need a new Operating System and/or a New Programming languge that automatically split the work between cores without the programmer even noticing.

we are stuck in old ways OS/Languages and we need something big and huge. we are stuck in the old ways of coding ..

It is silly that we are in almost 2020 and programmers still have to cod etheir software to take advantage of more cores manually.

InvalidError · Dec 7, 2019

nofanneeded said:
I Think we need a new Operating System and/or a New Programming languge that automatically split the work between cores without the programmer even noticing.

New programming languages and new OSes will never change the fact that algorithms are a collection of operations that have to happen in a relatively sequential order to achieve a given end-goal and the burden of describing said algorithm rests solely on the programmer. The only way to relieve programmers of having to explicitly multi-thread performance-critical code would be to develop a AI sufficiently advanced to figure out the most effective way of splitting the algorithm and its data set for thread-level parallelism.

bit_user · Dec 8, 2019

nofanneeded said:
I Think we need a new Operating System and/or a New Programming languge that automatically split the work between cores without the programmer even noticing.

we are stuck in old ways OS/Languages and we need something big and huge. we are stuck in the old ways of coding ..

Relevant:

Soft Machines' 'Virtual Cores' Promise 2-4x Performance/Watt Advantage Over Competing CPUs

Soft Machines' Virtual Instruction Set Architecture uses the concept of "virtual cores" to better distribute threads between two or more physical cores. The company claimed this offers better performance per watt than competing CPU architectures.

www.tomshardware.com

Haven't heard anything from them, but I think they might've at least been heading in the right direction.

nofanneeded said:
It is silly that we are in almost 2020 and programmers still have to cod etheir software to take advantage of more cores manually.

The programming language & OS communities are aware of these issues and I'm sure there've been many developments, over the past decade. If you're interested in the subject, I'm sure it's not hard to find & pick up on some and learn more.

Lately, C/C++/C#/Java are starting to give way to a new generation of languages. Since I'm not very familiar with any, I'll leave you with a few suggestions that I believe address concurrency as a fundamental part of the language.

I'll reserve judgement on Intel's Data Parallel C++, until I know more. The fact that it's built atop SYCL/OpenCL is at least promising, although this also suggests a more restrictive concurrency model than you might like for specifically multicore CPU programming.

I should also add Scala, except it's been around for about 15 years and (AFAIK) doesn't seem to be gaining much traction:

https://en.wikipedia.org/wiki/Scala_(programming_language)#Concurrency

I know Scala heavily influenced Kotlin, but I'm not sure if they adopted many of its concurrency features. Google is backing Kotlin in a pretty big way. Both Scala and Kotlin compile to Java bytecode, run on the JVM, and can use Java libraries. So, they're a natural fit for Android development.

bit_user · Dec 8, 2019

InvalidError said:
New programming languages and new OSes will never change the fact that algorithms are a collection of operations that have to happen in a relatively sequential order to achieve a given end-goal and the burden of describing said algorithm rests solely on the programmer. The only way to relieve programmers of having to explicitly multi-thread performance-critical code would be to develop a AI sufficiently advanced to figure out the most effective way of splitting the algorithm and its data set for thread-level parallelism.

I don't have a specific answer to this point, other than to say that it's one area where declarative languages rule. Unfortunately, most usage of declarative languages is relegated to domain-specific applications, and I don't foresee any stepping up to tackle general-purpose computing problems.

Generally speaking, I think I'm more optimistic on the matter.

nofanneeded · Dec 8, 2019

InvalidError said:
New programming languages and new OSes will never change the fact that algorithms are a collection of operations that have to happen in a relatively sequential order to achieve a given end-goal and the burden of describing said algorithm rests solely on the programmer. The only way to relieve programmers of having to explicitly multi-thread performance-critical code would be to develop a AI sufficiently advanced to figure out the most effective way of splitting the algorithm and its data set for thread-level parallelism.

Algorithms today include dividing work between threads . but the problem is , it has to be done manually.

We need an Operating sysytem/Laguage Combo to allow calculations to be automatically divided between cores without the need of assigning it manually inside an Algorithms.

That is to take advantage of all cores available in the background. Like a layer between you and the CPU that takes commands from your Language and AI knows how to distribute them between cores.

This is not an easy AI task but we must move on from the old ways.

InvalidError · Dec 8, 2019

nofanneeded said:
We need an Operating sysytem/Laguage Combo to allow calculations to be automatically divided between cores without the need of assigning it manually inside an Algorithms.

Except that dividing work into threads isn't an OS or even programming language issue, it is the simple reality of how software writing works. There is no simple way for developers to tell the compiler how to parallelize stuff in their stead since there are so many different ways individual pieces of code may need to interact with each other and only the developer knows the intent.

For multi-threading to become an automatic compiler or run-time environment thing, you will need an AI sufficiently advanced to reverse-engineer the software developer's intent and re-factor it. This is not going to happen any time soon if ever.

jimmysmitty · Dec 8, 2019

bit_user said:
A lot of frameworks and libraries spin up one software thread per hardware thread, and dispatch work to one or more queues that are serviced by those "worker" threads. So, it doesn't really depend on the developer explicitly deciding how many threads to spin up.

See: https://en.wikipedia.org/wiki/Work_stealing

This is just a function of what your trying to run.

In the case of game developers, they are going to tune the game to target the bulk of the market. However, increasing certain settings or using a really fast GPU might increase the CPU workload enough to load up more cores. If you use a slow GPU, then it's probably going to be the bottleneck, thus reducing the demand on your CPU. So, maybe the majority don't need a beefy CPU, because they're using it with a weak GPU (or use the iGPU).

A game can be programmed to use more cores, yes. Crysis was one of the first games that needed a dual core. Would run on a single core but without sound as sound was offloaded to a secondary core/thread. I wont argue that.

But the primary process for a game or program still cannot be split between multiple cores. A program can have multiple processes and they can be assigned to different cores/threads to keep the primary cores load lighter and Windows will also assign them based on core load.

In the enthusiast market we have levels. We have the regulars and we have the extreme. Both have uses but I still find it hard to see a direct use for more than 16 cores in most enthusiast uses, even on the extreme end.

Until they find a way to split a single process over those 16 cores the only benefit to more and more cores is being able to run more processes before the CPU bogs down.

Gurg · Dec 8, 2019

As I've read and re-read this article and comments a number of time I'm left with a question: Is this an AMD problem with getting higher stable frequencies out of its CPUs or a more universal problem? TH just got 6.9ghz out of an Intel 8 core CPU on all 8 cores, which with upcoming die shrinks demonstrates a 40% potential for future consumer CPUs frequency increases. Yet AMD can't seem to get all its shrunk cores to match Intel's 5,0 ghz all core 9900K, 9700K and 9600Ks. Yes with its new technology AMD can add additional cores that few other than servers or heavy use workstations can even begin to utilize.

75% of Steam users have 4 cores or less and 95% have 6 cores or less. Are game developers to incur massive costs to develop game programs specialized for 8+ core CPUs that only 5% of gamers now own? Are consumers to pay big bucks for CPUs that have many core that are siting idle and not being anywhere near fully utilized,

InvalidError · Dec 8, 2019

Gurg said:
TH just got 6.9ghz out of an Intel 8 core CPU on all 8 cores, which with upcoming die shrinks demonstrates a 40% potential for future consumer CPUs frequency increases.

How well a chip is able to overclock under liquid nitrogen or helium is not representative of any sort of clock frequencies ever likely to become available for mainstream use. For a clock frequency to become attainable without cryogenic cooking, first you need to reduce power draw by 2-3X to make it manageable using conventional cooking, the process also has to become about twice as fast in the normal operating temperature range and then 100% of that process performance gain needs to be dedicated to achieving higher clock frequencies instead of improving IPC or adding other features.

Process shrinks deliver relatively modest clock frequency gains because the bulk of speed gains gets dedicated to extra features and IPC instead. As both AMD (FX) and Intel (Netburst) have discovered at their own expense, pursuing clock frequency leads to horrible power-efficiency and poor overall performance.

bit_user · Dec 8, 2019

InvalidError said:
There is no simple way for developers to tell the compiler how to parallelize stuff in their stead since there are so many different ways individual pieces of code may need to interact with each other and only the developer knows the intent.

Compilers are very good at analyzing data dependencies. Sometimes, they're hampered by limitations of existing languages, which where programming language technology enters the picture.

https://en.wikipedia.org/wiki/Alias_analysis

What's hard is that compilers lack the runtime context to know how big certain data sets or data structures will be, or how likely it is that certain branches will be taken. However, this can largely be addressed through profile-driven optimization, where profiling data is collected from one run and used to re-optimize subsequent runs. It's not perfect, but you can get a lot of mileage from a straight-forward implementation of it. In fact, I wonder if JIT-compiled languages like Javascript don't already do some amount of it.

You can then take it another step and start tracking correlations between different branch decisions, and compile/optimize multiple versions, dynamically switching based on which pattern seems to be the best for the current execution pattern.

InvalidError said:
For multi-threading to become an automatic compiler or run-time environment thing, you will need an AI sufficiently advanced to reverse-engineer the software developer's intent and re-factor it. This is not going to happen any time soon if ever.

I don't see AI getting involved at that level. At least, not any time soon. On that, we can agree.

The problem I think AI can solve, in the near term, is task scheduling. Given a set of concurrent tasks, you need to balance their prioritization in such a way that the computational resources are best utilized. To do this, you need to make sure that the longest paths in the data dependency graph receive proportionately higher priority, so that they finish at about the same time as the shorter paths. What AI could do is learn patterns to help it predict which tasks are on the critical path and prioritize accordingly.

nofanneeded said:
That is to take advantage of all cores available in the background. Like a layer between you and the CPU that takes commands from your Language and AI knows how to distribute them between cores.

That actually not a bad way of summarizing it.

bit_user · Dec 8, 2019

jimmysmitty said:
But the primary process for a game or program still cannot be split between multiple cores.

That's not even true, though.

Think of a program as an execution graph - a set of dependencies on tasks and data. You can absolutely break up the main loop of a game into pieces that can be concurrently dispatched to separate cores. I don't know that it's yet the most common way to do it, but there's really no reason it can't be done.

The library I mentioned, TBB, has facilities for doing this. But it's not too hard to roll your own and there are other frameworks that can be used. Roughly speaking, this is known as the Proactor Pattern.

https://en.wikipedia.org/wiki/Proactor_pattern

jimmysmitty said:
In the enthusiast market we have levels. We have the regulars and we have the extreme. Both have uses but I still find it hard to see a direct use for more than 16 cores in most enthusiast uses, even on the extreme end.

Well, if you're designing a game, you need to make it playable on mass market hardware. So, if it can scale down well to a Pentium, then scaling framerate alone (i.e. by using a high-end GPU) won't create enough extra load to saturate a 16-core CPU.

And I'm just imagining there probably aren't too many CPU-based features that you could use in a multi-player context that wouldn't somehow be an unfair advantage or put you at a disadvantage. It's not like a 16-core user can run with a more sophisticated physics engine - their view of the game world must match everyone else's.

jimmysmitty said:
Until they find a way to split a single process over those 16 cores the only benefit to more and more cores is being able to run more processes before the CPU bogs down.

I'm not sure that multithreading technology is the real limitation, here. I think it's just about having enough work to do, and that's hard when you're talking about a game spanning CPUs with > an order of magnitude different capabilities.

bit_user · Dec 8, 2019

Gurg said:
Yet AMD can't seem to get all its shrunk cores to match Intel's 5,0 ghz all core 9900K, 9700K and 9600Ks. Yes with its new technology AMD can add additional cores that few other than servers or heavy use workstations can even begin to utilize.

Clock speed isn't the whole story. Back when AMD had less efficient CPUs, they tried to compensate by clocking them faster and they burnt a ton of power. Intel is basically now doing the same thing - juicing up its 14 nm to try and stay competitive.

However, Intel is having some problems reaching higher clock speeds with its 10 nm CPUs and AMD seems to have faltered here, as well, with its 7 nm generation. So, I'm not sure we're going to continue seeing clock speed gains as node sizes drop.

Gurg said:
75% of Steam users have 4 cores or less and 95% have 6 cores or less. Are game developers to incur massive costs to develop game programs specialized for 8+ core CPUs that only 5% of gamers now own? Are consumers to pay big bucks for CPUs that have many core that are siting idle and not being anywhere near fully utilized,

Core counts are going up for all market segments. Look back a few years and most Steam users will have had only dual-core. So, it does make sense for developers to invest in utilizing more cores.

I wouldn't tell someone they need 16 cores, but they might find increasing utilization of them, over the lifespan of that CPU. If you just buy a CPU to meet today's needs, you might be regretting it, tomorrow.

InvalidError · Dec 9, 2019

bit_user said:
Core counts are going up for all market segments. Look back a few years and most Steam users will have had only dual-core.

And some Renoir rumors say those things will go up to 8c16t, wonder what the prices on those are going to be compared to non-APUs. Could be a game-changer for "budget" systems.

TerryLaze · Dec 9, 2019

bit_user said:
The problem I think AI can solve, in the near term, is task scheduling. Given a set of concurrent tasks, you need to balance their prioritization in such a way that the computational resources are best utilized. To do this, you need to make sure that the longest paths in the data dependency graph receive proportionately higher priority, so that they finish at about the same time as the shorter paths. What AI could do is learn patterns to help it predict which tasks are on the critical path and prioritize accordingly.

Windows already has that implemented,it still needs developers to do it right.
This works fine for finite things but for infinite things like game loops it's a plague that has caused almost all of the problems modern games have at running stutter free.
If you look at the video you can see the small PiP in the top right corner is the way it runs normally with "the longest path in the data dependency graph receive proportionately higher priority" while in the main window the priority is lowered so that the supporting threads can provide the data the main thread needs to go on.
You don't even need AI for the task manager to recognise a game .exe name and reducing priority on it's own.
And you wouldn't need this at all if developers would just put in a condition in the main thread like "wait while thread_xx not responds"

View: https://youtu.be/J8DQl_VRiPw

The division is another one that was made pretty public about how bad it is running on dual cores,again it's all about the main thread running with a high priority.

View: https://youtu.be/qVrKusAxp60

bit_user said:
That's not even true, though.

Think of a program as an execution graph - a set of dependencies on tasks and data. You can absolutely break up the main loop of a game into pieces that can be concurrently dispatched to separate cores. I don't know that it's yet the most common way to do it, but there's really no reason it can't be done.

You don't understand, anything that is left over after what you explain that is the main loop,basically in a perfect world it would be

get input
calculate response(s)
draw graphics accordingly

and that's what you can't split up anymore,the faster you are able to run this loop the more FPS you will get even if you are on a dual core if that part runs faster everything will run faster.

News AMD CTO Mark Papermaster: More Cores Coming in the 'Era of a Slowed Moore's Law'

Titan

Titan

Prominent

Distinguished

Titan

Titan

Titan

Reputable

Titan

Titan

Titan

Respectable

Titan

Titan

Titan

Respectable

Titan

Champion

Distinguished

Titan

Titan

Titan

Titan

Titan

Titan

Share this page