News Nvidia ACE Brings AI to Game Characters, Allows Lifelike Conversations

Page 2 - Seeking answers? Join the Tom's Hardware community: where nearly two million members share solutions and discuss the latest tech.
Anyway, tech generally improves. So, let's see how it matures. I think it would be silly to make sweeping pronouncements about it, so soon. AI is a rapidly-evolving field, not least text-to-speech and "deep fake" audio.
I'm not disputing the possibility that it will improve, only questioning the size of the AI model required to produce a diverse number of fully fleshed out TTS voices using Microsoft's as a reference for how such models could be 100s of GBs.

Just now, I accidentally clicked an AI dub on YT using the Honest Movie Trailer guy's voice and went: "WTF? He does news now?" for a second until I remembered how good some AI voices are these days. I doubt that level of output quality comes cheap.
 
Last edited:
I'm not disputing the possibility that it will improve, only questioning the size of the AI model required to produce a diverse number of fully fleshed out TTS voices using Microsoft's as a reference for how such models could be 100s of GBs.
When I started dabbling with object detection in 2016, we used a model that was a couple hundred MBs and barely managed 2 streams of video on an intel iGPU. Five years later, we deployed an optimized model from Intel, using their OpenVINO framework, and it managed 8 streams of video on the same iGPU, with much better accuracy and a model less than 1/10th the size. It just goes to show how much potential exists for optimizing these sorts of things.

When tackling a new problem domain, the first step is always to improve accuracy at the expense of all else. Only when the accuracy problem has been "solved" is there usually much focus on tuning and optimization. I also have to wonder whether that MS group is sufficiently incentivized and experienced in scaling down their solutions. Researchers are typically focused on innovation, while engineers are the ones who do most of the performance tuning and optimization. My knowledge of MS is dated, but most of the innovation typically happens in MS Research, which is a completely separate organization from the product development parts of the company, and mostly staffed by post docs and profs.

Perhaps its developers tried training smaller models on the same data, but that might've been a perfunctory exercise, or at least secondary to their quest to find the best quality solution. Without knowing much more about that group, I wouldn't reach much into what they say about scaling. I also don't know that they're even the foremost experts on TTS.
 
When I started dabbling with object detection in 2016, we used a model that was a couple hundred MBs and barely managed 2 streams of video on an intel iGPU. Five years later, we deployed an optimized model from Intel, using their OpenVINO framework, and it managed 8 streams of video on the same iGPU, with much better accuracy and a model less than 1/10th the size. It just goes to show how much potential exists for optimizing these sorts of things.
If you make the model 10X smaller, compute requirement goes down ~10X too, not much of a surprise there. 1/20th of 1TB for a nearly perfect generalized TTS model would still be 50GB.
 
If you make the model 10X smaller, compute requirement goes down ~10X too, not much of a surprise there. 1/20th of 1TB for a nearly perfect generalized TTS model would still be 50GB.
A problem with your analogy is that the model we started with (in 2016) was already one of the smaller models available for the task. If you were to compare it with the largest/best model from 2016, then there would be closer to a 100x reduction, yet the new model still offered better accuracy.

BTW, I wouldn't claim my experience exactly maps to whatever MS is doing. My point was really just to illustrate why it's worth being a bit humble in the assumptions you make about this this stuff. Only an expert practitioner who's familiar with their research would be qualified to make any sort of estimates about how much potential might exist for optimizing their models.

That said, it's well worth noting that models don't only get bigger, and that sometimes better accuracy and smaller size aren't mutually exclusive.
 
That said, it's well worth noting that models don't only get bigger, and that sometimes better accuracy and smaller size aren't mutually exclusive.
It depends on what you want to achieve. If all you want is an AI that can recognize a small set of things in a controlled environment, you can optimize the heck out of it. If you want to have a generalized AI that can recognize thousands of different objects in any environment from any perspective, any reasonable distance for the input resolution and any lighting, the model will inevitably get much larger even after optimization.

If you want to replace VAs with AI-generated voices that won't give you an urge to ALT-F4 and downvote the crap out of the game like Nvidia's demo, you need a model that can generate dozens of unique voices with human-like quirks, context-appropriate pacing, tone, etc. and the ability to go anywhere from whispers to shouting in all voices that will call for it. A model that can convincingly and consistently achieve all of that won't be small.
 
It depends on what you want to achieve. If all you want is an AI that can recognize a small set of things in a controlled environment, you can optimize the heck out of it.
Again, how do you know? How many models have you trained? How many academic papers on deep learning did you even read in your entire life? Did you ever implement one, from first principles?

I think you have no idea of their networks' architecture, how they operate, and therefore how much room for optimization might exist.

Don't BS about stuff you know so little about. That has a high potential for spreading misinformation.

If you want to replace VAs with AI-generated voices that won't give you an urge to ALT-F4 and downvote the crap out of the game like Nvidia's demo, you need a model that can generate dozens of unique voices with human-like quirks, context-appropriate pacing, tone, etc. and the ability to go anywhere from whispers to shouting in all voices that will call for it. A model that can convincingly and consistently achieve all of that won't be small.
By putting all of that complexity into the TTS model, you're taking a very narrow view of the problem. The TTS model needs to be able to make a voice sound stressed, excited, angry, hesitant, etc. - but, it doesn't need to be the thing which decides when to do it. It doesn't even need to be the thing which decides where to inject pauses or stammers.

Given how long you've been involved in technology, you don't seem allowing much room for improvement. Their demo is like Alpha -grade. It's a tech demo - like the equivalent of a concept car. I don't expect this tech to be integrated into mainstream AAA titles for probably 3-5 years. Maybe more. A heck of a lot can change, in that time.
 
Last edited:
By putting all of that complexity into the TTS model, you're taking a very narrow view of the problem. The TTS model needs to be able to make a voice sound stressed, excited, angry, hesitant, etc. - but, it doesn't need to be the thing which decides when to do it. It doesn't even need to be the thing which decides where to inject pauses or stammers.
While the TTS model may not need to decide when to do what, it does need to know how to naturally blend it all together under all circumstances including mid-word mood transitions (various degrees of surprise from pleasant to panic/horror) for a broad enough voice catalogue, variations and blends thereof to avoid immersion-breaking obvious reuse and mechanical transitions reminding you that the speech was computer-generated.

It seems to me like you are the one greatly underestimating the complexity of TTS sufficiently advanced and diverse to send mid-grade VAs into retirement at least as far as models small enough to pack into a game and run on a consumer-grade GPU alongside game graphics are concerned.
 
While the TTS model may not need to decide when to do what, it does need to know how to naturally blend it all together under all circumstances including mid-word mood transitions (various degrees of surprise from pleasant to panic/horror) for a broad enough voice catalogue, variations and blends thereof to avoid immersion-breaking obvious reuse and mechanical transitions reminding you that the speech was computer-generated.

It seems to me like you are the one greatly underestimating the complexity of TTS sufficiently advanced and diverse to send mid-grade VAs into retirement at least as far as models small enough to pack into a game and run on a consumer-grade GPU alongside game graphics are concerned.
You can describe vocal transitions in the input stream. Playwrights do that sort of thing, all the time.

As for my thinking about model complexity, it's entirely possible that I'm being too complacent about the challenges of doing it well. However, what I see that perhaps you might not appreciate is the power of higher-order features, which is something deep learning does very well. That avoids the need to have a network with combinatorical complexity, for dealing with all the various cases, characters, and scenarios.

For instance, ChatGPT isn't just a jumble of words. It actually models higher-order concepts. The same can be true for a TTS model. Once it knows how to do a vocal fry, it doesn't need a separate representation of how to do that for each voice it can model. There would be some additional complexity needed for it to adapt the quirk to its various voices, but there's real economy in having one model that can do multiple voices.
 
Last edited:
You can describe vocal transitions in the input stream. Playwrights do that sort of thing, all the time.
Description and execution/delivery are two considerably different things, which is why some lines can take dozens of takes before either everyone is satisfied with how it came through, settle for good enough, rearrange the line into something that better suits the actor's delivery or give up.
 
Description and execution/delivery are two considerably different things, which is why some lines can take dozens of takes before either everyone is satisfied with how it came through, settle for good enough, rearrange the line into something that better suits the actor's delivery or give up.
Training and rehearsal are two sides of the same coin.
 
Might get better but right now that was bad. Sounded like a 1980s Speak & Spell with all the emotion of an Action Man doll while speaking. They need to learn the old cartoon animation tricks where you have to "over-react" in terms of body language to things to make it interesting.