AMD's Mysterious Fenghuang SoC Spotted in Chinese Gaming Console

Page 3 - Seeking answers? Join the Tom's Hardware community: where nearly two million members share solutions and discuss the latest tech.
Status
Not open for further replies.


I'm sure that if you propose a compelling financial case to AMD, they'll jump right on that.
 


AMD read the thread they were ready to tear up their entire product line and re start with his plans, they even had a new APU slot already planned with 1 extra pin to be just different enough.

😀
 

Completely the wrong reasons - putting DDR4 on DIMMs makes DDR4 more expensive too. PCs use DIMMs because PCs need to cover a wide range of applications with a wide range of memory size requirements that wouldn't be practical to cover otherwise.

The single biggest reason why GDDRx is only ever used soldered-on is simply that managing signal integrity at 8+GHz through a DIMM connector is not practical. Even the ability to have two DIMMs per channel barely made it into the DDR4 spec and the JEDEC spec for that only goes to 2133MT/s.

Put simply: wide busses, high frequencies and connectors don't mix well.

Edit: also, using connectors means longer traces between the CPU/GPU and DRAM chips, which is also bad for signal integrity on wide high frequency busses. That's why HBM is on-package to eliminate as much trace length and as many intermediate joints as possible.
 
Reflected signal from empty sockets is always a fun addition for any engineer, especially when they question if those sockets will ever even get used.

And of course, if you give the general populace an opportunity, somebody will be mixing modules. "I saved slots so I could upgrade later, why doesn't my new GDDR work with my old?" This is bad enough with more tolerant DDR memory.

How about the extra power capacity needing to be built into the supplying circuitry for that rare occasion somebody populates all four slots with high wattage memory. We're talking a 5 - 10x increase in power consumption for the memory.
 
In addition to your point, it would be an impractical feat to route all of the traces for HBM on a standard PCB. I doubt we would have seen anything close to Fury X with 4096 traces for memory alone, not to mention all of the power and data.
 

Thank you. I had a sense this must be true, but not the authority to assert it.
 

HBM trades lower frequencies for a wider datapath. If you don't have such a wide datapath, then you need to use higher frequencies, which burns more power.

AFAIK, GPUs have maxed out at 512-bit datapath to off-package memory, while CPUs are currently up to 384-bit, in Intel's latest server socket and 512-bit in AMD's EPYC. I feel like that's got to be close to the practical upper limit.
 

The 'practical' limit is mainly dictated by socket size and how many board layers can fit in your budget for the projected sales volume. If enough people/companies are willing to pay big bucks to go bigger to make a decent profit out of it, they'll make it happen even if they need a 10mm thick 50 layers ceramic board with buried vias that costs over 10k$ a pop (bare PCB) to get there.
 

Yeah, but I meant practical (i.e. economical) for the markets they're serving.

I could believe it, if somebody chimed in about some mainframe or other specialty chip with > 512-bit off-chip data bus.

Anyway, I just saw that Intel's Whitley platform (LGA 4198) will take them to 512-bit (8-channel), as well. I honestly wonder if we'll ever see more, since the core counts of these server chips seem to be growing past the point of diminishing returns.
 

Depends on the workloads. For HPC-style applications where algorithms are finely tuned to the underlying system architecture and scale beyond 100k cores for current record-holders, there would no doubt be many cases where you could put 64+ full-blown cores on a CPU with 256bits memory architecture and still have no meaningful bottleneck because the algorithms are designed to keep the bulk of their working set within the CPU's caches with the rest flowing smoothly and timely within the limits of available bandwidth and latency.

Not something you are going to see in server or consumer software but there is a market for it with a small army of genius programmers and engineers dedicated to making it work.
 

I specifically said "core counts of these server chips". I don't know why you would broaden the conversation to core counts of clusters, but it's not relevant to my point.


We've been down this rat hole, before. My premise is that cache coherency ain't free. Even if your workload is not bottlenecked by memory bandwidth, your energy efficiency will drop, by virtue of more cache coherency overhead and more on-chip interconnect links to traverse, in the process.

For workloads that are highly-scalable, they can already be run across multiple machines, so the benefit of ultra-high core count chips is negligible. It's really a question of when the TCO of adding another cluster node is less than the efficiency loss of adding more cores per node.

This ties into memory, in the sense that additional memory channels are only needed as long as core counts keep climbing. If/when core counts plateau, then memory just needs to keep pace with core clock increases. Packing HBM2/HMC cache in-package, as Intel did in the Xeon Phi 7200-series, might even enable them to roll back some of these DDR channel increases.
 

You only need hardware-based cache coherence when you don't trust the software developers to handle it themselves in a far more efficient manner, which is why we have non-CC NUMA for situations where performance outweighs idiot-proofing the hardware - programmer says he knows what he's doing, let him hang off his own rope if he doesn't. The benefits of cramming more cores in fewer systems are lower operating costs, higher density, lower latency, higher bandwidth between cores, etc.
 



No they can not. SOC has nopinouts. It is soldered to the motherboard.
 

Cool story. So, where do we have this?

The only models that have seemed to gain traction is cache-coherent and hard-partitioned. The value proposition of hard-partitioning ever-bigger chips is unclear to me. At some point, it's got to cheaper just to plunk down more smaller CPUs, on separate busses.
 

non-CC memory access was part of the AGP/GART interface and lives on today as part of the PCIe spec. In everyday PCs, it is used by GPUs/IGPs to grant them unfettered access to memory pages managed by the GPU drivers since it doesn't make sense to incur the overhead of CC-NUMA when the IGP/GPU has fundamentally exclusive access to those memory pages and the drivers are managing any interactions with the OS/user-land.
 

Hmmm... not really a massively multi-core CPU, then? So, not relevant.

For my part, I offer you Xeon D. Why more cores isn't always better...

https://www.nextplatform.com/2015/03/09/intel-crafts-broadwell-xeon-d-for-hyperscale/
 

Again, not relevant to my point. I was clearly talking about CPUs, when I pondered whether they've maxed out at 512-bit.
 

As with any bottleneck question, the answer is heavily dependent on workload. In most modern pro/prosumer software though, the embarrassingly parallel stuff gets delegated to GPUs when available.
 
Status
Not open for further replies.