Truform is an excellent technolgy, and your right too bad it died out, hopefully Ogl will bring it back with shader 3.0 support, who knows though,
ok the 96 to 128 bit comaparision, another fellow developer and me were discussing about it, this is what we came up with
His thoughts:
___________________________________________________________
Ok well i'll break down roughly what ATi are doing as although very clever it is also extremely questionable.
ATi Radeon 9-Series Cards have 128bit Colour, they all have.
Microsoft's minimum shader spec for Pixel Shader 1.x and 2.0 is 24bit FP; you can also use 32bit FP, and NVIDIA card are capable of using 16bit HP ... however when NVIDIA do it, they're simply lowering the precision so calculations take less time.
Now although visually the difference between 24bit per pixel and 32bit per pixel is neglegable; it is obvious that 16bit you get noticeable banding.
Effectivly how the Radoens work is by using the lower 24bit Precision Pipeline in DirectX (they do in OpenGL as well but they can't actually achieve what they do with DirectX, but i'll explain why later).
So firstly they've lightened their pipeline but roughly 8bit.
However, 24bit calculation is actually slower than 32bit...
16bit Calculations use a Half-Register as you'll most probably know, in order to do 24bit you need to use 3/4 which means that you'd need to just use either a single register or 2 half registers for calculation.
The extra codecing required actually makes the process slower than 32bit colour. So from step 1 ATi would appear to be doing thier colour calculations in order to slow them down right?
Wrong. What they're doing is accessing a lower palette.
24bit colour gives you around 16million colours to play with, this is easier and quicker to calculate a 16bit floating point into than 4billion colours.
So basically what goes on behind the scenes is they take a texture
-> convert into 16bit HP Floating Point Colour, run it through via a half-register and then convert to 24bit FP Floating Point Colour when rendering and giving DirectX it's data.
This makes everything in your system relatively happy and instantly something which should be performing slower is actually performing faster.
If only this was actually the end of the 'optimization'...
Radeons go one step further by converting everything in thier memory to 16bit, this is why they take longer even though their memory interface should be almost identical speed to an equivalent Geforce FX.
This way the card says it's doing 32bit Colour in DirectX which allows it to use the entire 128bit Floating Point; DirectX Operations are told they're using 24bit Precision; and as the driver upload the textures as 16bit Precision the card sets into Half Register mode and believes it is using that.
Obviously in order to trick DirectX into believe this you need to rework some of the aspects of it such as texture access, and pipeline access...
This is now 'underground' general knowledge simply for the fact that when Half-Life 2 was stolen, in the source amongst other things is a build for DirectX. This build happened to have laced through it comments from a guy (can't remember his name right now) but he 'was' based in ATi Canada; i very much doubt he has his job anymore. But basically it was outline explanations for the Valve staff on how to optimize their access of DirectX 8.1 and 9.0 specifically for what Radeon cards were doing.
Don't get me wrong it is an extremely clever system to achieve alot of speed without alot of colour loss which ATi are slowly compensating for with each driver release with a form of 'digital vibrance', kinda why Cat no longer has these color compensation controls like Forceware now.
I am still learning alot about the Radoen pipeline currently and i don't remember hearing anything about the integer pipeline. As I've seen the integer pipelines actually seem not that far in speed from each other, so my first guess would be that it is relatively unchanged because you can't really achieve the same level of optimization on it.
___________________________________________________________
My thoughts
___________________________________________________________
The Geforce FX should be running FP16 quite a bit quicker, that was the whole point in the format; but even so NVIDIA still use a Full Register for each Floating Point;
Ati Radeon however don't... they use a Half-Register. And the reason they don't loose quality like the FX is because they're calculating 16bit within a 24bit space; So unless you are using extreme ranges in light you won't notice the banding because it isn't actually using 16bit color output.
An easier way to understand all of this is to think about CPU's and how they work...
Particularly the CISC-Based X86 and RISC-Based PowerPC.
Now as most will understand, the X86 works on the basis of broken down registers for data.
8 -> 16 -> 32bit all within the same register, this allows the processor to double or even quadruple up information and process it all at once.
(Hence why they're CISC/MISC processors because of Multiple Operations at once)
the PowerPC however has a 32bit Register, and 32bit is what it will process.
This is exactly how the Geforce FX and Radeon 9-Series work

So basically how the Radeon is gaining such a large speed advantage over the FX is partly in the architecture of the VPU, but also in the driver that adapt things to take advantage of this architecture.
Radeons can stack up 2x 16bit Operations within a Texture Pass; Geforce can stack up 1x 32 Operations within a Texture Pass; Both cards have a maximum of 16 Passes Per Pipeline, and between 4 and 8 pipelines depending on the class of card.
This effectively allows the Radeons to process double the data in a single pass.
the FP24's speed depends entirely on what is going on; but it is slower... because the decoding for the format is done via the CPU; The CPU can use 1 Register for 16bit, 1 Register for 32bit, but it doesn't have a 24bit Register; so it has to do 3x 8bit Register Operations.
Or it can up the range to 32bit; then the palette is larger and it has to compact it back down once finished. Everything to do with 24bit operations really is a waste considering no one uses 24bit Registers.
ATI's site clearly states that the Radeon has 128bit Color, in order to have this it require 128bit (4x32bit) Register for that Color.
It is impossible to use 96bit (4x24bit) Registers and end up with 128bit Color; There's a document, that I'll have to find again that specifically states that ATi cards calculate using a 96bit Register but output up to 128bit Color (which technically just means it is using a 128bit Palette not Color)
Anyway you look at these cards, ALOT doesn't add up about them from face value. Especially considering if you look at the color modes they can achieve - 24bit / 32bit / 48bit / 64bit / 128bit see now that makes no sense to me! ... if they are using 24bit Registers why is the ONLY native color mode they can do not there?!
I think if you do some tests you will find that they're not using 24bit like ATi claims.
It's using a 24bit Palette, yes; but not 24bit Color...
Very similar to what the PlayStation 2 does to get so much of its speed.
A good demo to show this off is actually the Light Accentuation DirectX9 demo; as it perfectly shows the banding that occurs.
The Geforce FX has banding problems, but this is exclusive to the FX range; you test it using PS 1.1 with a Radeon 9-Series against a Geforce 4 Ti and open in Photoshop, you'll see the differences.
As I've said, I think it is an extremely clever method to achieve a lot more speed than the card should be pushing, especially as color perception at real-time isn't noticeable, particularly in games.
(however Carmack has commented on the difference before which he found strange)
___________________________________________________________