That the GPU is soldered, memory is soldered, and differences in chip count are all insignificant to my point. They prove that it can be done and in the past, we've had FB-DIMMS with huge transfer rates
You do realize that FBDIMMs are actually SERIAL using TMDS signaling, right? This exactly what I said... serial interfaces have a much easier time getting through sockets and slots at very high speeds because every bit can be locked on individually, no need to worry about clock skew between bits the way wide high speed parallel interfaces do.
DDR uses strobe signals to break down the 64-bits DIMM bus into separately "clocked" 8bits groups but even this starts getting touchy beyond 2GT/s: if you have a 100ps setup and hold time on the bus, you have less than 300ps of wiggle-room left for everything else.
Again, no serious loss for DDR3. Furthermore, many newer high-frequency DDR3 modules with decent timings can be underclocked and timing-tightened much tighter than even DDR2 modules achieved.
But the overall latency x period yields roughly the same effective latency in terms of nanoseconds and gimping your clock rates achieves nothing more than sacrifice bandwidth. In benchmarks, higher bandwidth with lower effective latency practically always wins even if latency is numerically higher.
Most DDR2-800 (and even DDR2-1066) RAM I can see on Newegg is 5-5-5, not 6 or 7 while DDR3-1066 is overwhelmingly 7-7-7.
Lets crunch some numbers...
- DDR2-800-5 = 5 / 800 = 6.25ns effective latency
- DDR3-1066-7 = 7 x 1/1066 = 6.5ns
- DDR3-1600-9 = 9 / 1600 = 5.625ns
- DDR4-2133-14 = 14 / 2133 = 6.5ns (based on photos of pre-launch packaging)
The ~50% cycles latency bump at the transition boundary between technologies usually favors the older stuff in benchmarks. DDR3-1600 may have higher latency cycle count than DDR2 or lower speed DDR3 but it still has lower effective latency even without having to pay for premium low-latency bins at higher frequency and latency cycle count.
If the main objective is feeding an IGP, it makes no sense to sacrifice bandwidth for lower latency and in most CPU benchmarks, it makes little to no sense either when the clock rises high enough to offset latency cycle bumps. For mainstream RAM, whatever the current definition of mainstream may be, we we have been around the 6ns mark for most of the past 10 years.
As for "chip count not being a problem", interface width and JEDEC are.
Chips with wider data busses draw more current from their IOB power plane, need their internal architecture to be that much wider and likely slower. This means wider chips run significantly hotter (2/4/8X as much stuff happening inside, may actually require heat spreader) and also means it becomes more difficult to adequately bypass/filter the power supply to support higher speeds.
The other thing is that JEDEC defined the DIMM interface as having one strobe signal per 8bits data group so having one 8bits chip per strobe signal (or a pair for double-sided DIMMs) is practically dictated by the standard itself - try finding a 1GB DDR3 DIMM that isn't 8x128MB configuration even though today's 1GB ICs would make a single-chip DIMM theoretically possible.
Many things sound nice in theory but hit brick walls in practice.