Question Transition to 3nm and beyond ?

omar80747326

Distinguished
BANNED
Dec 16, 2017
174
0
18,690
Since FinFET'll be dead soon, & (GAAFET) is coming, so:

What does it offer for future generations of electronics: CPUs, GPUs, storage, RAM, SoCs, etc... in terms of TDP, heat, efficiency & performance, etc...?

How did GAAFET solve FinFET issues and its potential problems accompanying it as well as smaller transistor nodes from now on?
 
Laws of diminishing returns are going to apply. You can only go so far on certain structures, like silicon, before the gaps between become to small to sustain power/thermal transmission.

AMD's answer was to seperate the cores which added size to the die but offers better protection from radiated heat, Intel just went to a larger die, but I don't see that lasting too much longer, too much concentrated heat in the center cores.

Cooling is also a major factor. Right now Intel is hitting limits not of their making, but others. Only the very largest of aircoolers can deal with a 13900k, and not very well at that, and with the distrust many have over liquid cooling, Intel will annex itself out of mainstream use if they push for more than 300w from a cpu.

Which brings up GAA Fet. Which offer the ability to channel more current than FinFet, so unless Intel figures out how to keep/gain performance and drastically drop vcore, the increase in wattage should mean custom loops for the flagship cpus.

As far as I can see anyways.
 
The problem FinFET and GAAFET solve is at smaller sizes, a planar gate has a harder time controlling current flow between the source and drain of the transistor. FinFET is likely running into the same problem planar gates were having back at beyond 22nm

However, the real problem comes from the fact that FETs rely on an insulation layer to act as the gate. At some point, if you get small enough, quantum tunneling starts to become a problem.

But otherwise, TDP, power consumption, etc. isn't going to go down per se. That is, we're not going to see the rise of 35W 64-core CPUs that can leave EPYC in the dust. That's not ever going to happen. Instead, designers work around a target power limit and figure out what they can get out of that.
 
Light. According to some. I'm not really sure, nobody is, how does anyone know what's going to happen next. Hasn't been invented yet, the same way nobody could predict the use of 3.3nm GAAFets back when they were still using 22+nm planar gates.

It could also be where dedicated multiple cores are a thing of the past, and cpus develop back into a single core, that has a super sized bandwidth capable of splitting into multiple threads, instead of the 2 used by HT, 1 core being 64 threads with Super HT. Or even unlimited variable thread count, depending on the size of the used bandwidth compared to the size of available bandwidth.

Oh, they already sorta invented that, it's how a gpu works.
 
Last edited:
So the question is:
What kind of technology will be used directly when transistor downscaling stops.
Is it graphene, photonic chips, GaN, or what?
As long as we use electrons as the control medium, then the problem of quantum tunneling (ignoring other effects like how electric fields will interact with each other at that scale) will continue to linger.

Photons I don't think will really work outside of information buses because we can't directly manipulate them while in flight.
 
There's also the software to consider. It's one thing to change platforms, like going from 3rdGen Intel to Zen4 Amd, the software is still compatible, nothing really has changed, but changing exactly how a cpu 'thinks' by use of Photons, gpu type architecture etc will have a strong affect on how software, whether it's the OS or apps, will interact.

The general public would have an absolute fit, AMD would see a popularity it never did before, if Intel decided that only Win12 and only apps written in gpu use code would work on 14th Gen, all prior apps being incompatible.

Intel and AMD both have put themselves in a corner they can't get out of. Parallel instead of Serial won't be a thing until someone like ARM gets off its Duff and becomes the third Major Player in the game, offering enough of a difference that there's actually a transition, instead of a cutoff in how cpus work.
 
Parallel instead of Serial won't be a thing...
Intel has already tried this. It failed miserably for them. Transmeta also tried this, and they only lasted one product release. Elbrus seems to be trudging around still, but they're not exactly winning any performance awards. There's also the issue that not every software algorithm benefits from being parallel or can be made parallel. No matter what, you can't make these statements run at the same time:
C:
x = input + 1;
y = x + 2;
z = x + y;
 
That's what I said lol, you just explained it better. I mean a gpu already uses parallel, very well as it goes, rendering by gpu is somewhat faster than by cpu.
DLSS is the difference. It actually 'thinks' which is what parallel has issues with. But thats all dependent on the algorithms used.

Code today is written serial. The trick would be to use multiple serial, in parallel, which is hyperthreading. But that would require BF4 thinking that's open ended to rollover, not locked like CSGO.

I can't see why HT/SMT is locked at 2 threads a core, if the bandwidth is large enough in a core it should be possible to run multiple threads parallel.

A single core at 1GHz running 5 threads simultaneously will get the same amount of work done as a single core at 5GHz pushing 1 thread.

I'm thinking that's where Intel is headed with its scheduler in Windows, not just splitting workloads between E/P cores but also to include threads eventually and pushing towards cpu utilization at a constant 100% by not wasting bandwidth.
 
That's what I said lol, you just explained it better. I mean a gpu already uses parallel, very well as it goes, rendering by gpu is somewhat faster than by cpu.
DLSS is the difference. It actually 'thinks' which is what parallel has issues with. But thats all dependent on the algorithms used.
DLSS doesn't look at the code stream, it looks at the data. DLSS is nothing more than a real-time implementation of how video codecs work.

EDIT: Also what you're suggesting here doesn't really make sense, because whatever the developer wrote tends to be optimized to hell (first by the compiler, then by whatever hardware mechanisms like OOO are available to the processor) even before it has a chance to run on the execution units. Although ML algorithms are being used in branch prediction, but that's really only place that it can be used since everything else can be deterministically... determined.

Code today is written serial. The trick would be to use multiple serial, in parallel, which is hyperthreading. But that would require BF4 thinking that's open ended to rollover, not locked like CSGO.
But again, you can't make some sequence of instructions run at the same time no matter what, because mathematically it doesn't make sense. Imagine trying to bake a cake for instance. You can't bake a cake until have batter, unless you're somehow a time lord that just happens to have it right there. And there are cases where doing something before something else is ready isn't ideal. For games, the graphics rendering shouldn't happen until after the game logic's been processed, because graphics give the player a visual snapshot of the game world at that point. If the graphics get rendered before the game logic completes, you'll be getting an outdated representation of that world.

I can't see why HT/SMT is locked at 2 threads a core, if the bandwidth is large enough in a core it should be possible to run multiple threads parallel.
There's plenty of CPUs that have more than 2 threads per core. UltraSPARC had a few, POWER8/9 had some implementations. I'm working with a CPU at my current company that uses hardware threads for 1-cycle context switching. But the thing is, none of these implementations have all of those threads running at once, because there's not enough resources in the backend to support them all. ILP is a thing, and if one thread can take up 90% of the backend, that doesn't leave a whole lot for the rest.

The only reason why these CPUs have so many hardware threads is to reduce latency. However, reduced latency doesn't necessarily mean increased throughput.

A single core at 1GHz running 5 threads simultaneously will get the same amount of work done as a single core at 5GHz pushing 1 thread.
Only if those 5 threads have the same runtime, assuming the only thing different about the CPUs is the clock speed and simultaneous thread handling capabilities.

For instance, if all 5 of those threads took 1 second on the 1GHz CPU to complete, yes, it'll take 1 second on the 5GHz CPU to complete. But if they had say 1, 2, 3, 4, and 5 seconds to complete on the 1GHz processor, the completion time would be 5 seconds. The 5 GHz processor would get it done in (1 + 2 + 3 + 4 + 5 )/ 5 = 3 seconds.
 
Last edited:
For instance, if all 5 of those threads took 1 second on the 1GHz CPU to complete, yes, it'll take 1 second on the 5GHz CPU to complete. But if they had say 1, 2, 3, 4, and 5 seconds to complete on the 1GHz processor, the completion time would be 5 seconds. The 5 GHz processor would get it done in (1 + 2 + 3 + 4 + 5 )/ 5 = 3 seconds.
True, but a good scheduler would also be able to shove a 4+3+2+1 in right after the 1,2,3,4,5.
5+5+5+5+5=5 vs ((1+2+3+4+5)/5)+((1+2+3+4)/5)=5. Same 5 seconds, same amount of work. The difference being that the 1GHz would require less power as it wouldn't need to maintain such high speeds, which means less heat. That's essentially how the Playstation platform works, using higher core cpus to spread the load wider, parallel, instead of concentrating the workload in less cores at faster speeds like the pc did.

Games like CSGO were written back when 1/2 Pentium was popular, pushing 3.4GHz etc, but written on the PS would have used all 8 of its 1.8GHz Amd/jaguar cores.

Which is why I said intel/amd is stuck in a rut they can't get out of, because it'd cost them far too much to flip to PS way of thinking, it'd make most current apps and software obsolete, who'd buy an intel cpu that didn't run any games because they weren't yet written to run that way or needed an emulator to transpose the older game code into something the cpu could use.

It's why the FX bombed, software was written in Intel, highly serial, didn't make use of multiple cores, just 4 at most, but also why years later Intel finally got off its duff and started making higher than 4/8 cpus, because software was complex enough to be finally written to take advantage of multiple cores using more than 8 threads.
 
Last edited:
True, but a good scheduler would also be able to shove a 4+3+2+1 in right after the 1,2,3,4,5.
5+5+5+5+5=5 vs ((1+2+3+4+5)/5)+((1+2+3+4)/5)=5. Same 5 seconds, same amount of work. The difference being that the 1GHz would require less power as it wouldn't need to maintain such high speeds, which means less heat. That's essentially how the Playstation platform works, using higher core cpus to spread the load wider, parallel, instead of concentrating the workload in less cores at faster speeds like the pc did.
This is on the assumption that an ideal situation like this will happen more often than not. But the reality is, the ideal situation is just that: ideal.

Also power consumption is an instantaneous measurement, which isn't really that useful. Energy consumption (power used over time) needs to be looked at. For example, let's say I ran a task one hour at 75W. If I dial the processor down to 45W, the performance loss is such that the task needed 72 minutes to complete. Now from an energy consumption standpoint, the 45W scenario wins (75W-sec vs 54W-sec). But if we include 12 minutes of downtime onto the 75W scenario, the normalized energy consumption for the 75W scenario becomes 60W-sec. Sure the 45W scenario still wins overall, but if your goal is energy efficiency, it really depends on what you're after. For sustained loads, yes the lower power spec would be better. But for periodic bursty loads, a higher power spec is better, because the work gets done sooner and the chip goes to sleep for longer.

Games like CSGO were written back when 1/2 Pentium was popular, pushing 3.4GHz etc, but written on the PS would have used all 8 of its 1.8GHz Amd/jaguar cores.
It would've by necessity to extract all of the power out of what was basically a netbook CPU core. Even if CS:GO was made with the PS4/XB1 in mind and used all those cores, the faster cores of Intel Sandy Bridge would've more than made up for the lack of cores.

Which is why I said intel/amd is stuck in a rut they can't get out of, because it'd cost them far too much to flip to PS way of thinking, it'd make most current apps and software obsolete, who'd buy an intel cpu that didn't run any games because they weren't yet written to run that way or needed an emulator to transpose the older game code into something the cpu could use.
With regards to what most people use computers for (web browsing, watching videos, typing up documents, etc.), most of these tasks are I/O bound (and one of them doesn't even run on the CPU anymore), not compute bound. Adding more cores does nothing for I/O bound scenarios.

It's why the FX bombed, software was written in Intel, highly serial, didn't make use of multiple cores, just 4 at most, but also why years later Intel finally got off its duff and started making higher than 4/8 cpus, because software was complex enough to be finally written to take advantage of multiple cores using more than 8 threads.
The FX bombed because of a bad design.

Also continuously repeating "software is serial" only tells me that you don't work in software. But if you want to try and convince me that you do know something, here's a code snippet:
JavaScript:
    function changeTab(e) {
        if (e.which == 37  && e.ctrlKey) {
            let prevTab = $('ul.chat-tabs>li.active').prev();
            if (prevTab.hasClass('thumb') === true) {
                prevTab = $('ul.chat-tabs>li.active').parent().prev();
            
                if (prevTab.length > 0) {
                    $(prevTab[0].children[1]).click();
                }
            }
            else {
                prevTab.click();
            }
            setTimeout(() => {$('textarea.active').focus();}, 100);
            
            return false;
        }
        else if (e.which == 39 && e.ctrlKey) {
            let nextTab = $('ul.chat-tabs>li.active').parent().next();
            if (nextTab.length === 0) {
                nextTab = $('ul.chat-tabs>li.active').next();
            
                if (nextTab.length > 0) {
                    nextTab.click();
                }
            }
            else {
                $(nextTab[0].children[1]).click();
            }
            setTimeout(() => {$('textarea.active').focus();}, 100);
            return false;
        }
    }
If you want the summarized version, this is an event handler for a button press that checks to see if CTRL + Left or CTRL + Right was pushed to switch tabs on a UI left or right. But it also has to make sure that there's another tab to actually go to. There's some other stuff to check based on how the GUI was designed. So tell me, how can you break this up so that it can run on multiple cores at the same time? Or in an extreme case, how can every instruction run at the same time?

Oh, and the funny thing about GPUs and graphics rendering, since that's a commonly brought up example of "embarrassingly parallel" type operations: the thing is, the actual process of determining a pixels' color is highly serialized. The only thing that makes it "embarrasingly parallel" is you can perform the same operation for each pixel at the same time. So theoretically it scales with n pixels.
 
Right. So that snippet takes up a thread. And the snippet doing the same thing for the shift key takes up a thread, and the one for the alt key takes up a thread. I'm not suggesting splitting code to run one task on multiple threads, but multiple tasks run on multiple threads, 5 tasks run in 5 threads in parallel simultaneously, instead of those 5 tasks run in serial on 1 thread.

Pc's right now are still thinking like Pentiums, shoving tasks through 1 after the other, regardless of size, instead of 'scaling with n pixels' and shoving the tasks through 'with' the others. It would most likely require larger cache, and a Master core to put all the results in correct order.

To use your analogy, you get 4 or 5 prep cooks, each working on 1 item seperately, and the chef takes the results, combines the batter and shoves the cake in the oven, instead of 1 or 2 cooks doing 2-3 seperate tasks in order and the chef making the batter and throwing the cake in the oven.
 
Right. So that snippet takes up a thread. And the snippet doing the same thing for the shift key takes up a thread, and the one for the alt key takes up a thread. I'm not suggesting splitting code to run one task on multiple threads, but multiple tasks run on multiple threads, 5 tasks run in 5 threads in parallel simultaneously, instead of those 5 tasks run in serial on 1 thread.
Except you don't need to do that for those keys. If I wanted to check for Shift + Left/Right or Alt + Left/Right, all I would do is:
JavaScript:
function changeTab(e) {
    if (e.which == 37) {
        if (e.ctrlKey) {
            // Do Ctrl + Left stuff
        }
        else if (e.altKey) {
            // Do Alt + left stuff
        }
        else if (e.shiftKey) {
            // Do Shift + left stuff
        }
        return false;
    }
}
There's no point in spawning a thread to handle them or even spawn a thread to look to see if those key combinations are being pressed. If you did, then you'd have a lot of threads running that don't do anything useful most of the time. And considering each thread still has the overhead of thread state to maintain and scheduling overhead, those threads may not get to be run at the time you'd like.

Pc's right now are still thinking like Pentiums, shoving tasks through 1 after the other, regardless of size, instead of 'scaling with n pixels' and shoving the tasks through 'with' the others.
No, CPUs can run things in parallel for software that's designed that way and scale accordingly. How do you think Cinebench works? It's basically rendering graphics on a CPU.

It would most likely require larger cache, and a Master core to put all the results in correct order.
CPUs already do this. It's part of the out-of-order execution feature.

To use your analogy, you get 4 or 5 prep cooks, each working on 1 item seperately, and the chef takes the results, combines the batter and shoves the cake in the oven, instead of 1 or 2 cooks doing 2-3 seperate tasks in order and the chef making the batter and throwing the cake in the oven.
So 1 cook gets the flour, 1 cook gets the eggs, 1 cook gets the butter, 1 cook gets the sugar, 1 cook mixes the ingredients, and 1 cook puts the batter in the oven? So let me poke at why this isn't really a good idea:
  • Two of the cooks are waiting around because they're dependent on the steps before them to complete
  • Four of the cooks are useless after a certain point, because they'll be waiting around for the subsequent steps to complete (or get to a point where they can be useful again)
  • Your paying 6 times the price to save maybe 10 minutes of prep time.
I mean, if absolute performance is your goal, then sure. But these are engineered products. And engineers have to work with constraints and tradeoffs.
 
The first 2 cooks are grabbing dishes, the second four cooks are going back to the fridge for more ingredients....

Keep all the cooks busy doing Something productive and your only limitation will be on waiting for the cakes to cook, but even then, if using a pizza roller oven, you can prep cakes, ice cakes, do dishes and keep everything running with no downtime. Yes it's 6x the pay, but the result is 6x the cakes or more, which is more profit overall.

Having just 2 ppl make batter, ice cakes, do dishes, get ingredients is going to hit a brick wall quick, a person can only work so fast, forcing work to pile up. Latency. It's why any cpu faster than @ 3.2GHz hits 100% on 2 cores in CSGO. The only fps gains are from IPC. If that code had been written with rollover, you'd be looking at a lot more than 500fps from a 13900k. Latency would be non-existant basically.
No, CPUs can run things in parallel for software that's designed that way and scale accordingly. How do you think Cinebench works? It's basically rendering graphics on a CPU.
Exactly my point. Games and most apps were/are written to take advantage of Intel. Low core, high speed, high IPC. If software would change, and be written like BF4 was, (which landed the FX-8350 as 2nd highest fps, beaten only by Intel i7's) and take advantage of thread counts/core counts instead of GHz, basically parallel instead of serial and more akin to CB, you'd see things like the Epyc and TR and Intels high core enthusiast class cpus go back to prominence over mainstream.

Spend $4-500 for a PS, or $1500 for a PC that could beat it in fps and graphics. All due to software and how it was coded to use the cpu, because as far as hardware went, the PS was little better if at all than a laptop 1 generation behind.