55 Watts probably because they aren't compute bound but limited by DRAM bandwidth. My RTX 4090 stops its fans runing any model that won't fit inside 24GB, because there is so little to do but wait.
I try to simplify things by converting a token to a syllable and making words 2-4 syllables (I'm German, my words are longer

).
A little more than one word per second? Try that on your wife and see what that does to her patience!
As a benchmark it's perhaps not bad, but in practical terms I'd consider it unusable.
Have a closer look at NVlink, not all versions and variants are created equal. For the 3090 it's a little over 100GByte/s, pretty much DRAM speed these days and only two cards max for anything PC.
CUDA code is designed to exploit the Terabyte/s aggregate bandwidth of massive register files. VRAM access is already falling off a cliff, so much so that common subexpression elimination, a typical staple of compiler optimization on CPUs, is essentially reversed, because that would often be slower than recomputing inside registers.
And even with the greatest and latest NVlink switches (7200 Gbyte/s on Hopper), that's not counting latency.
HPC and AI hardware is a little more complex than just putting Lego bricks together. And yeah, I hoped it was much simpler, too. But then I had the opportunity to test and researched a bit deeper.
And now I understand better why prices are the way they are.