The only real difference today as far as software design goes is the primary rendering thread is no longer the only thread that can do all the rendering; you can offload that to multiple different threads now as of DX12/Vulkan. (Note: DX11 allowed this under very constrained circumstances).
And as I've long noted, your maximum parallelism is limited by Amdhal's Law. There's only so much you can do by adding more threads, and there's a point where the necessary thread scheduling/memory management starts to become an issue for non-trivial tasks.
I doubt we'll get smooth 4k, simply because the GPUs won't be able to push it reliably. 4k/60 is probably out of reach for anything that's computationally intensive. 4k/30 is doable.
Also remember that software on consoles can make a lot of assumptions and optimizations that PCs can't, due to having one HW spec. That's why no one really expected the PS4 Pro/XB1 X, but those were necessary due to their CPUs being cripples (as I continue to note: The PS3 has a more powerful CPU then either current gen console). This generation was really hamstrung by using early gen APUs as their main processors.
While this is somewhat true it doesnt paint the entire picture. There are some task that do scale extremely well with large number parallel task or we wouldnt have the monsters that the los alamos and the like are building. It all depends on how atomic your task is. Some task are extremely atomic with large data sets. (Seti at home and folding at home are two more examples)
Now as to next gen consoles there will be several tricks played to scale to 4k. Some will be hardware. Some will be software.
The biggest crusher of apu performance is memory bandwidth. We see the benefits of edram (iris pro and xbox one (original)) at speed improvements. Iris pro actually did quiye well over standard intel graphics despite being a crap arhitecture compared to amd apus.
Heat generation is really a concern also. Therefore a low energy cost memory solution is needed. So im guessing there will be either a 2 or 4gb hbm module on chip package. This is especially true if reports of navi heat issues are true.
There are also several types of memory used. Amd has been playing with the idea of heterogenious memory for years now. Different memory types all laid out like its one big memory pool.
Why would you want to do this? Its simple. Switching memory models and copying memory from one buffer to another waste efficiency. So what do you do? You allow all computing resources access that memory like a big pool. Each memory type uas its benefits and drawbacks. System memory is cheap slow and high density. Gddr memory is moderately fast generates heat, and less dense. Local edram or sram memories are fast but eat up a ton of space.
Then there is cache and cache coherency which the ccx deal with.
So basically amds future apus are going to have both hbm (on package maybe stacked) and sram (on io die). The sram on the io die is just big enough for draw calls. Whats a draw call? A draw call is a set of directions set up by the cpu side to tell the gpu what to draw and how (ie: draw a race car). In ywrms of memory they are relatively small. But a past bottleneck was the transfer of a standard memory draw call buffer over the pcie bus to the gpu which must copy and process it. Using a shared memory approach eliminates a lot of that overhead.
Another bottleneck of the past was draw calls were pretty much linear in nature. Render orc 1 the render orc 2. Then render azeroth player 1. These draw calls are then composted into a final scene and the page flip called. AOS & vulkin/mantle apis showed us what we can accomplish with mgpu and assymetric multithreaded draw calls. They no longer have to be that linear in nature. The problem was no programming house really wanted to deal with the overhead of mgpu. I think i figured out amds secret sauce here and specific deep mgpu programming might no longer be necessary. For years people said chiplette tendering just wasnt possible because sli crossfire proved it was inefficient. Everytime i heard this counter argument i wanted to slap someone silly. Its not crossfire/sli. With some clever tile based rendering and parallel draw calls chiplettes do maoe sense. I worked theough the functional blocks and pseudo code with possible penalties for memory thrash/conflict. Amds infinity fabric, unified memory, and caching addresses a lot of the memory thrash issue. Clever algorithms address the rest.
On the software side of things there will be tricks thrown similar to xbox one x. Different objects will be rendered at different resolutions amd then scaled. This is just one of a few tricks that make 4k 60 possible.
Anyway im ranting. So ill just shut up.