Don't worry. People will state "this will never happen" or "that will never happen". Then it happens.
I am sure Intel is looking at stacked RAM. In fact they are, its just a matter of if the benefits outweigh the risks, much like why Intel went gate last for HK/MG (which as I said before, AMD and the industry will follow).
I never said stacked memory wouldn't go on a CPU die, I said it wouldn't replace main memory. If the die has a low enough TDP and you know you'll be able to deal with the thermal issues, then there is no reason not to put some fast local memory on there. It won't replace desktop main memory due to how cheap it'll make system memory. How dense will expansion memory sticks become if they can pack 4GB into the space of a single CPU die? DIMM's have four to sixteen modules on top them. Assuming each of those modules contains the same amount of memory as is stacked on the CPU your looking at 4~16x more memory per stick, or 8~32x more memory in the typical dual stick setups.
http://www.marketwatch.com/story/jedec-publishes-breakthrough-standard-for-wide-io-mobile-dram-2012-01-05
If your looking for SoC then it makes perfect sense, cell phones / handhelds or even gaming consoles all would benefit tremendously from this. Pico-ITX solutions would also benefit from having the memory soldiered directly onto the board or a chip.
Intel / AMD will take their time and only fully implement this in low-end solutions. Memory fragmentation problems go beyond what NUMA introduces. You end up with two separate memory bus's, one for the fast local memory and the other for the slow(er) external memory. A similar situation happens with Hybrid HDD's (the ones with SDD caches inside them). You have two locations that you can put something, and only the OS knows where stuff is.
There are two solutions, one is to run the local memory at the same speed as the external memory and just treat it like you would any other DIMM. This solves the fragmented memory map issues but your wasting the potential of having such fast local memory. Especially when you consider that by the time you get 4GB of memory on a Intel chip, the system will have 32-64GB of main memory installed. The other solution involves turning the fast local memory into a fast cache, similar to the older cache sticks from the early 90's. In this case the faster memory can be used to hold frequently accessed data and serves as a buffer between the CPU and system memory.
In either case, 4GB looks big ~now~ but by the time you see this you'll have 32~64GB memory made from the same stuff in your computer. You'll also have a slower / lower power CPU then you would of otherwise due to the increased thermal insulation. That is the trade off you pay for.
Here is a picture of what it looks like
http://blog.stericsson.com/wp-content/uploads/2011/12/WideIO.png
You have a package material with a CPU on top of it. On top of the CPU a heat spreader is plated for thermal interface with the cooling solution. When a 3D memory stack is implemented it's put on top of that chip and is between the CPU and it's thermal plate. Silicon has worse thermal transfer properties then metal, in the context of this design the memory stack is acting as a thermal insulator between the CPU and it's thermal plate. Not only that, but all 100% of the heat generated from the CPU must
pass through the 3D memory stack which in turn heats up the memory stack before being dissipated by the thermal transfer plate. This means whatever thermal envelop you had for the CPU just shrank as not only must you deal with the increased heat lead of the memory stack but also the decreased thermal transfer efficiency. This will manifest itself as lower clock speeds.