It depends on what you want to achieve. If all you want is an AI that can recognize a small set of things in a controlled environment, you can optimize the heck out of it.
Again,
how do you know? How many models have
you trained? How many academic papers on deep learning did you even read in your entire life? Did you ever implement one, from first principles?
I think you have
no idea of their networks' architecture, how they operate, and therefore how much room for optimization might exist.
Don't BS about stuff you know so little about. That has a high potential for spreading misinformation.
If you want to replace VAs with AI-generated voices that won't give you an urge to ALT-F4 and downvote the crap out of the game like Nvidia's demo, you need a model that can generate dozens of unique voices with human-like quirks, context-appropriate pacing, tone, etc. and the ability to go anywhere from whispers to shouting in all voices that will call for it. A model that can convincingly and consistently achieve all of that won't be small.
By putting all of that complexity into the TTS model, you're taking a very narrow view of the problem. The TTS model needs to be able to make a voice sound stressed, excited, angry, hesitant, etc. - but, it doesn't need to be the thing which
decides when to do it. It doesn't even need to be the thing which decides where to inject pauses or stammers.
Given how long you've been involved in technology, you don't seem allowing much room for improvement. Their demo is like Alpha -grade. It's a tech demo - like the equivalent of a concept car. I don't expect this tech to be integrated into mainstream AAA titles for probably 3-5 years. Maybe more. A heck of a lot can change, in that time.