AMD's Mysterious Fenghuang SoC Spotted in Chinese Gaming Console

monitorlizard · Aug 7, 2018

newsonline5000000 :

I'm sure that if you propose a compelling financial case to AMD, they'll jump right on that.

bigpinkdragon286 · Aug 8, 2018

Too bad. Surely we were just 1 rational conversation away from clearing everything up.

Rogue Leader · Aug 8, 2018

bigpinkdragon286 :

AMD read the thread they were ready to tear up their entire product line and re start with his plans, they even had a new APU slot already planned with 1 extra pin to be just different enough.

😀

InvalidError · Aug 8, 2018

newsonline5000000 :

Completely the wrong reasons - putting DDR4 on DIMMs makes DDR4 more expensive too. PCs use DIMMs because PCs need to cover a wide range of applications with a wide range of memory size requirements that wouldn't be practical to cover otherwise.

The single biggest reason why GDDRx is only ever used soldered-on is simply that managing signal integrity at 8+GHz through a DIMM connector is not practical. Even the ability to have two DIMMs per channel barely made it into the DDR4 spec and the JEDEC spec for that only goes to 2133MT/s.

Put simply: wide busses, high frequencies and connectors don't mix well.

Edit: also, using connectors means longer traces between the CPU/GPU and DRAM chips, which is also bad for signal integrity on wide high frequency busses. That's why HBM is on-package to eliminate as much trace length and as many intermediate joints as possible.

bigpinkdragon286 · Aug 8, 2018

Reflected signal from empty sockets is always a fun addition for any engineer, especially when they question if those sockets will ever even get used.

And of course, if you give the general populace an opportunity, somebody will be mixing modules. "I saved slots so I could upgrade later, why doesn't my new GDDR work with my old?" This is bad enough with more tolerant DDR memory.

How about the extra power capacity needing to be built into the supplying circuitry for that rare occasion somebody populates all four slots with high wattage memory. We're talking a 5 - 10x increase in power consumption for the memory.

bigpinkdragon286 · Aug 8, 2018

InvalidError :

In addition to your point, it would be an impractical feat to route all of the traces for HBM on a standard PCB. I doubt we would have seen anything close to Fury X with 4096 traces for memory alone, not to mention all of the power and data.

bit_user · Aug 8, 2018

InvalidError :

Thank you. I had a sense this must be true, but not the authority to assert it.

bit_user · Aug 8, 2018

bigpinkdragon286 :

HBM trades lower frequencies for a wider datapath. If you don't have such a wide datapath, then you need to use higher frequencies, which burns more power.

AFAIK, GPUs have maxed out at 512-bit datapath to off-package memory, while CPUs are currently up to 384-bit, in Intel's latest server socket and 512-bit in AMD's EPYC. I feel like that's got to be close to the practical upper limit.

InvalidError · Aug 8, 2018

bit_user :

The 'practical' limit is mainly dictated by socket size and how many board layers can fit in your budget for the projected sales volume. If enough people/companies are willing to pay big bucks to go bigger to make a decent profit out of it, they'll make it happen even if they need a 10mm thick 50 layers ceramic board with buried vias that costs over 10k$ a pop (bare PCB) to get there.

bit_user · Aug 8, 2018

InvalidError :

Yeah, but I meant practical (i.e. economical) for the markets they're serving.

I could believe it, if somebody chimed in about some mainframe or other specialty chip with > 512-bit off-chip data bus.

Anyway, I just saw that Intel's Whitley platform (LGA 4198) will take them to 512-bit (8-channel), as well. I honestly wonder if we'll ever see more, since the core counts of these server chips seem to be growing past the point of diminishing returns.

InvalidError · Aug 9, 2018

bit_user :

Depends on the workloads. For HPC-style applications where algorithms are finely tuned to the underlying system architecture and scale beyond 100k cores for current record-holders, there would no doubt be many cases where you could put 64+ full-blown cores on a CPU with 256bits memory architecture and still have no meaningful bottleneck because the algorithms are designed to keep the bulk of their working set within the CPU's caches with the rest flowing smoothly and timely within the limits of available bandwidth and latency.

Not something you are going to see in server or consumer software but there is a market for it with a small army of genius programmers and engineers dedicated to making it work.

bit_user · Aug 9, 2018

InvalidError :

I specifically said "core counts of these server chips". I don't know why you would broaden the conversation to core counts of clusters, but it's not relevant to my point.

InvalidError :

We've been down this rat hole, before. My premise is that cache coherency ain't free. Even if your workload is not bottlenecked by memory bandwidth, your energy efficiency will drop, by virtue of more cache coherency overhead and more on-chip interconnect links to traverse, in the process.

For workloads that are highly-scalable, they can already be run across multiple machines, so the benefit of ultra-high core count chips is negligible. It's really a question of when the TCO of adding another cluster node is less than the efficiency loss of adding more cores per node.

This ties into memory, in the sense that additional memory channels are only needed as long as core counts keep climbing. If/when core counts plateau, then memory just needs to keep pace with core clock increases. Packing HBM2/HMC cache in-package, as Intel did in the Xeon Phi 7200-series, might even enable them to roll back some of these DDR channel increases.

InvalidError · Aug 9, 2018

bit_user :

You only need hardware-based cache coherence when you don't trust the software developers to handle it themselves in a far more efficient manner, which is why we have non-CC NUMA for situations where performance outweighs idiot-proofing the hardware - programmer says he knows what he's doing, let him hang off his own rope if he doesn't. The benefits of cramming more cores in fewer systems are lower operating costs, higher density, lower latency, higher bandwidth between cores, etc.

akamateau · Aug 9, 2018

newsonline5000000 :

akamateau :

they can make a Socket that takes SOC as well . not a big deal .

No they can not. SOC has nopinouts. It is soldered to the motherboard.

TJ Hooker · Aug 9, 2018

akamateau :

Usually yes, but not always. For example, AMD's Kabini APUs were socketed SoCs (available on the AM1 platform).
https://www.anandtech.com/show/7933/the-desktop-kabini-review-part-1-athlon-5350-am1

bit_user · Aug 9, 2018

InvalidError :

Cool story. So, where do we have this?

The only models that have seemed to gain traction is cache-coherent and hard-partitioned. The value proposition of hard-partitioning ever-bigger chips is unclear to me. At some point, it's got to cheaper just to plunk down more smaller CPUs, on separate busses.

InvalidError · Aug 9, 2018

bit_user :

non-CC memory access was part of the AGP/GART interface and lives on today as part of the PCIe spec. In everyday PCs, it is used by GPUs/IGPs to grant them unfettered access to memory pages managed by the GPU drivers since it doesn't make sense to incur the overhead of CC-NUMA when the IGP/GPU has fundamentally exclusive access to those memory pages and the drivers are managing any interactions with the OS/user-land.

bit_user · Aug 9, 2018

InvalidError :

Hmmm... not really a massively multi-core CPU, then? So, not relevant.

For my part, I offer you Xeon D. Why more cores isn't always better...

https://www.nextplatform.com/2015/03/09/intel-crafts-broadwell-xeon-d-for-hyperscale/

InvalidError · Aug 11, 2018

bit_user :

Modern GPUs are breaking 4000 hardware threads...

bit_user · Aug 11, 2018

InvalidError :

Again, not relevant to my point. I was clearly talking about CPUs, when I pondered whether they've maxed out at 512-bit.

InvalidError · Aug 11, 2018

bit_user :

As with any bottleneck question, the answer is heavily dependent on workload. In most modern pro/prosumer software though, the embarrassingly parallel stuff gets delegated to GPUs when available.

AMD's Mysterious Fenghuang SoC Spotted in Chinese Gaming Console

Reputable

Splendid

It's a trap!

Titan

Splendid

Splendid

Titan

Titan

Titan

Titan

Titan

Titan

Titan

Reputable

Titan

Titan

Titan

Titan

Titan

Titan

Titan

Share this page