Here's a thought experiment, showing where frame extrapolation breaks down. Imagine you're playing a first-person game of some sort. You're standing near a corner or some kind of large obstacle that someone could hide behind. If they step out from behind it, then the algorithm isn't going to know what to do in those trailing-edge pixels that are newly revealed in each successive, extrapolated frame.
You might be right that they try to do some sort of AI in-painting, but models like what Adobe uses for that are probably huge and complex, nowhere near realtime. More likely, they just smear the object or reuse previous frame's pixels at that location. Basically, you'd see this sort of ghosting effect, along that trailing edge. Worse yet, it'd probably flicker as each real frame corrects it, drawing even more attention to the artifact. At high native frame rates, the effect might be subtle enough that you wouldn't really notice, but when the native framerate is low, then it'd be very pronounced.
Similar to what I said before, I think the proper solution to this problem is just to natively rasterize and shade these areas. Assuming Nvidia still uses tile-based rasterization, they could actually do this without a ton of overhead, though you would need to reprocess the geometry for that frame. With ray-tracing, it's even easier just to shoot some assorted rays where & when you need them, although there's again the problem of not only needing to do the geometry transforms, but also building/updating the BVH structure.
This is a contrived example that doesn't really happen in practical use. Think about this: how many frames are required for a person to step out from behind an obstacle? Particularly if a game is rendering at 30+ FPS, it's not like they pop out for a single frame and then disappear one frame later. There will be
dozens of frames where the other person is coming out and then returning.
If there's really fast camera motion, things will break down somewhat. But that's
always the case. And it really doesn't matter as much as when you spin the camera really fast, everything blurs together and looks ugly regardless. (And your monitor pixel persistence will contribute to this as well.)
What will really happen is that things shift and, in most situations, you'll have edges of maybe a few pixels where the correct data is missing. Those would get filled in by a fast in-painting algorithm, and they'll be visible for maybe 10–20 ms at most. Then a fully rendered frame would come along and you get the correct pixels everywhere.
This is all for a hypothetical framegen projection future, of course, and only for projecting one frame, not multiple frames. But if you're doing multiple frames, all you need is faster hardware. Like if you're running at 100 FPS native, you have 10 ms between real frames. Now if you want to project in frame into that space, it needs to be done in 5 ms. If you want to do two projected frames, each has to be ready within 3.3 ms. Three frames? 2.5 ms per frame. And probably shave off a few tenths for each of those to be ready with some room to spare... and at higher FPS, you'd either need faster projection or fewer projected frames.
But again, the more I look at that scenario, I'm sure it's precisely what Intel, Nvidia, and AMD are working on right now. And you have other more reasonable scenarios. What if the base framerate is only 60 FPS? How you have ~8 ms to project a single frame, ~5 ms each for two frames, or ~4 ms each for three frames and you could get a frame generated 120, 180, or 240 FPS. Is the hardware fast enough to do that, with in-painting, right now? Probably on a 5090 you could at least do one or two frames that way. On a future 5060, or with an RTX 4060, maybe only a single frame is possible.
And in-painting in some ways becomes easier if you're projecting multiple frames. Like suppose the there's a projected shift in camera position of ~6 pixels to the right. With a single frame projection, the algorithm has to fill in a whole six pixel wide stripe along the left side of the screen. With three frames, it only has to do a two pixel wide frame at each step.
Given what we know about Jensen (i.e. he's always thinking of the next step, planning ahead, working on the future), I suspect what he told me wasn't exactly wrong... it's just not what DLSS 4 will be doing with RTX 50-series when they launch. Instead, he was probably talking about what the next generation DLSS 5 or whatever is going to do in a year, or maybe for the RTX 60-series. Because 50-series shipping right now means the hardware has been done for six months or more, and a lot of the key people are already working on the next, and next-next generation GPUs and software solutions!