@TheGreatGrapeApe: Seven corrected Vista's RAM hungriness by removing the 'shadow' buffer on WDDM 1.1 drivers: now, it reads back from VRAM. Depending on your interface, this can be VERY slow (forget about AGP, for example) on some operations. It does mean, however, that on very large screens with compositing (Aero) enabled, video RAM gets paramount, otherwise you get constant 'swapping' between system and video RAM.
A problem that often gets mentioned on Xorg mailing lists is how do you optimize a driver so that it doesn't need to read back from VRAM too often.
The easy way (Vista) is to keep a shadow copy in system RAM; this way, you don't care about where your data is in VRAM, you just overwrite it. Fast, safe, easy, eats RAM up for breakfast - especially on composited multiple screen setups.
The 'hard' way (Seven, Linux) is to manage your video RAM as a dedicated part of your RAM: you store objects in it, and you retrieve them when you need them. However, you can't manage it like system RAM, as it behaves the opposite way: it's fast and easy to write to it, but reading from it is expensive. Moreover, you now have two processors that deal with its content: your CPU, and the GPU, and you have to deal with concurrent access. It also needs to be able to 'spill' on system RAM if free video RAM gets low.
On Linux (and other FOSS systems which are seeing ports), this was solved by Tungsten Graphics and Intel with TG providing a VRAM controller (called TTM) and Intel providing a simple interface to a RAM controller (GEM), the whole developed in common with all other driver writers.
Now, let's take this case: you have opened a window, it is in VRAM. You want to resize it. That entails:
- modifying its contents, as in rewrapping text, drawing new icons, pulling newer content (if it gets bigger)
- modifying its borders
If you have a shadow copy, you simply re-draw the window in RAM and blit it. A single write, it's fast.
If you have a well-written driver, you merely need to send your card commands so that it needs to move existing glyphs, add new glyphs, add new graphics elements, and let the card do the works.
If your driver is incomplete however (one operation included in the resize isn't accelerated), you need to let it process what it can, read back the results, perform operations on the results in system RAM, and send those new modifications back to the card. And such for every pixel of movement.
Add to that constraints such as refresh rate and forced reraws (blitting video), and you get jerky, slow operations, with heavy CPU use: in fact, said CPU use comes from constant wait() states.
And this is why a driver can get suddenly very fast: a very frequently performed operation among several wasn't accelerated, which slowed the system down a lot over zero acceleration (since zero acceleration pretty much means drawing in system RAM and sending the complete results to video, so no back and forth moves), is finally accelerated: well, no more back and forth operations, the system gets snappy. If, on a set of six frequently used functions, Nvidia accelerated six on NV40, five on G80 and AMD accelerated five on R600/700 and four on R800/Evergreen, you hit system bus bottlenecks (say, 7,000 reads and writes/second) much sooner: you don't on 6/6 (as you only perform writes), you perform 7,000/second read/write for a single operation on 5/6, you perform 7,000 for two operations on 4/6 - meaning that you only render half as much windows per second on Evergreen than you do on R700 or G80, and those pale in front of NV40 which doesn't hit the bottleneck.
Add acceleration for those few operations, and suddenly the card flies.
So, why don't they do it by default? Because 2D acceleration is HARD! Harder than 3D acceleration, even - as a mistake on a 2D operation means on-screen corruption, which would require a complete redraw to remove, which is very expensive so it can't be done automatically (no detection is possible), and it looks very bad - which is much more obvious to the user than mere slowness. And it happens much more easily than in 3D mode, as you may have several concurrent operations in progress (at least one per window) on unrelated programs (one per window) that still interact (window over window: transparency, shadows etc.; TrueType change: all glyphs need redrawing; theme change: all window need redrawing and repositioning; fast switcher: all windows must be rendered as preview and stored; fast preview: same thing), and now add VSYNC, video blitting or overlays (they have been re-enabled in Windows 7) into the mix...
So, they'd rather write drivers that render stuff correctly if slowly, rather than drivers that render stuff extremely fast but may corrupt your screen.
In that aspect, FOSS drivers are interesting. As you see them in development, you see the effect of what a missed operation means: magenta splotches, noise on screen, misaligned glyphes... And you have daring users that try those very experimental developments in extreme, but real use, cases, then report it to the programmer directly, who can more often than not reproduce it (software is available) who will then issue a patch, which is then tried by the daring user, who will report that it works - and then the acceleration is made available on the next driver release. If the problem comes from the application asking wonky stuff from the driver that for some reason wasn't visible on unaccelerated mode, even the app (or toolkit) programmer may pitch in.
You may have, for example (this is not a real case), noise on curved KDE window borders; Qt4 (KDE's toolkit), when accelerated, may issue an extra alpha channel instead of simple RGB (you thus have ARGB issued instead of RGB), the driver didn't expect that and cuts it at the third byte (dealing with ARG instead of RGB) and you get noise. You will then get a fix in Qt4, (it broadcasts what data type it uses) and a function in the driver that checks that what data is received matches the layout of expected data - or logs an error and falls back to unaccelerated.
Which means that for 2D, FOSS drivers eat proprietary ones, which don't have such extensive, fast, direct feedback, for breakfast.