News Nvidia ACE Brings AI to Game Characters, Allows Lifelike Conversations

If their LLM is any good, I'd expect it's going to chew up a lot of VRAM. That's not good, when people are already complaining about insufficient VRAM in recent graphics card models.

The situational awareness is another aspect that really intrigues me. I'm guessing it's quite restricted in what sorts of things the NPC is aware of.
 
Last edited:
  • Like
Reactions: PEnns
In game, why would I want to talk to a software routine?
Usually, you talk to NPCs to get information needed to play the game (e.g. find items, reveal story line, complete quests, etc.). AI could enable more free-form queries, rather than being limited to multiple-choice (which is sort of "cheating", in a way) or having to phrase things very particularly.

Also, the speech-to-text aspect is nice, since gaming with a headset means you can just speak it at them.
 
  • Like
Reactions: rluker5
If their LLM is any good, I'd expect it's going to chew up a lot of VRAM. That's not good, when people are already complaining about insufficient VRAM in recent models.
The way Nvidia pitches it as "offering high-speed access to three components that already exists" makes me think this may be an AI-as-a-service thing rather than something intended to run locally.

If you want NPCs to feel relatively unique, you need to have dozens of AI model so you don't feel like you are running into the same AI every 3rd NPC you encounter. If each "personality" has a 8GB model like the smallest portable GPT variant, your games would end up having 100+GB of AI models if this stuff ran locally.
 
  • Like
Reactions: PEnns
The way Nvidia pitches it as "offering high-speed access to three components that already exists" makes me think this may be an AI-as-a-service thing rather than something intended to run locally.
Yeah, that thought crossed my mind, but I thought maybe not, considering the scalability challenges of hosting server resources for it + the online dependency. I guess I'm old-school in my thinking about the latter, but still... the costs of centralized processing would seem to be an obstacle.

If you want NPCs to feel relatively unique, you need to have dozens of AI model so you don't feel like you are running into the same AI every 3rd NPC you encounter.
I'd imagine that can be controlled via implicit prompting. Since the vast majority of the LLM would be the same, between different NPCs, you'd rather have a single instance of the model loaded and just prompt it to behave differently.
 
  • Like
Reactions: salgado18
If the AI learns from scouring the internet:
PC: "ok, I'll go and see this drug lord"
NPC: "wait, PC. You must see the terminal by door for a better ship before you go. Try something at StarCitizen.Nvidia.DLC. Now is not the time to be cheap! You can get 1950coins for discount in the next 5 minutes! I used to be a detective like you, but then I took a bullet to the knee!"
 
One of my main pet peeves when playing a game with a storyline is that it's mostly canned, including all of the convos. It's not interactive.

The Nvidia ACE (Avatar Cloud Engine) perked my interest in that respect. But watching the demo, it seems that it would be very expensive (in HW resources) to implement, and not much improvement over the canned stuff.

The main problem is that it still looks like a mannequin piping out the audio. The delivery is very mechanical, with no inflection. It's not enough that the mouth/face move in sync with the words. Normal people gesticulates when they talk; they emote. This doesn't feel real; I don't get any extra immersion from it.

But the extra "conversational personality" for NPCs (using LLM) would be a win in making the storyline less canned, less on-rail. I think we can get that just with a text interface, like with ChatGPT, or perhaps with speech recognition so you can talk to NPC w/o typing.

BTW, for those who prefer text rather than trudge thru the keynote video, you can read all the doodads on the Nvidia blog.

 
  • Like
Reactions: bit_user
>If their LLM is any good, I'd expect it's going to chew up a lot of VRAM. That's not good.

Not necessarily; not if the LLM is not your PC but in the vendor's cloud. Look at it from the vendor's perspective. It's a good excuse to require an online connection. If nothing else, it will obviate the problem of piracy. And if the quality of dialog is good enough and the scope open-ended enough, a subscription wouldn't be unthinkable.

That's why I'll probably disable most of the "copilots" that Microsoft will push into future Windows iterations. The obvious caveat to their use would be that you would have to link your Win PC to a Microsoft acct 24/7.
 
  • Like
Reactions: PEnns
I'd imagine that can be controlled via implicit prompting. Since the vast majority of the LLM would be the same, between different NPCs, you'd rather have a single instance of the model loaded and just prompt it to behave differently.
Then your one model would need to be trained with all of those alternate traits to prompt and variants of those traits for flavor in both the LLM chat and text-to-speech generator and still be substantially larger, definitely too large to fit in VRAM on top of graphics.

From what little I read on the topic, high quality generalized AI TTS like Microsoft's VALL-E use models ranging from 16GB to over 500GB in size with 1TB planned. Good luck running the higher-quality variants locally.

Imagine needing 16-24GB of VRAM for graphics and another 200+GB for AI between high-quality TTS, GPT-like natural text prompting, personality, etc. models.
 
What I want from AI conversation tech: more interactions with companion characters without needing expensive additional voice acting that needed to be hard coded into the game.

What I will get from developers: NPC I don't give a fk describing what I just saw or did in the game before giving me the next fetch quest, and probably try to to sell me some new micro-transactions.

I would rather developers use the compute budget to get NPC to use more animations or have better physics that were promised five, ten, twenty years ago.
 
Last edited:
"Presumably, this would stop NPCs from answering inappropriate, off-topic prompts from users."

Darn!! I was just wondering (after watching the demo) if I could say something like "yo, Hiroshi, stop bellyaching about crime and make me a Sushi omelette with extra Miso, pronto!!"
 
In game, why would I want to talk to a software routine?
more generally, why do I want to talk to my computer? can download GPT so you can talk to documents on pc... um... why?

the next feature Nvidia can make people want their cards for... the vram question is interesting.
It does sound like it runs on cloud. SO no way to play these games offline unless game is massive... expect offline mode to be limited in speech patterns.

I don't need companions in games that talk about real world events etc as games are meant to be an escape.
 
Usually, you talk to NPCs to get information needed to play the game (e.g. find items, reveal story line, complete quests, etc.). AI could enable more free-form queries, rather than being limited to multiple-choice (which is sort of "cheating", in a way) or having to phrase things very particularly.

Also, the speech-to-text aspect is nice, since gaming with a headset means you can just speak it at them.
The free conversation thing would be very nice.
It seems like a stretch that an ai routine would be able to stay in character personality wise, and only know what the character was supposed to.
Badly written characters can ruin a game.
 
Badly written characters can ruin a game.
Then good thing that with generative AI chatbots, there is no writing!

Keeping what each AI knows or learns partitioned between NPCs, game saves, players for multi-player games, etc. could be a challenge. You either end up with gigantic save files or having to retrain the starting AI on load to match the save state.

If you bake everything into the original AI and freeze it as-is, you have basically created regular scripted NPCs and using AI to run the scripts instead of manually hard-coding them.
 
Then your one model would need to be trained with all of those alternate traits to prompt and variants of those traits for flavor in both the LLM chat and text-to-speech generator and still be substantially larger, definitely too large to fit in VRAM on top of graphics.
So, you're an expert on deep learning, now?

Like I said, I think good LLMs are much more similar than they are different. It'd be more efficient just to make one versatile enough to handle the various characters via prompts, than to have distinct ones for each character.

From what little I read on the topic, high quality generalized AI TTS like Microsoft's VALL-E use models ranging from 16GB to over 500GB in size with 1TB planned. Good luck running the higher-quality variants locally.
Well, guess what? The article didn't say they used Microsoft's VALL-E, it said they use their own Riva SDK. It does both Automatic Speech Recognition and Text-to-Speech. Furthermore, it's a "SDK for building and deploying fully customizable, real-time AI pipelines that deliver world-class accuracy in all clouds, on premises, at the edge, and on embedded devices." The release notes reference issues with fitting certain languages on a 8 GB embedded platform. Since those are unified memory devices, it's unclear whether they mean the model is larger than 8 GB or just that it won't fit whatever portion of it would be available to the GPU.

A quick web search is all you'd have had to do, if you wanted to actually have some relevant knowledge, instead of just BS'ing.

Similarly, Nvidia Omniverse Audio2Face lists a GPU with 8 GB as the system requirements. That's presumably for film & video production-quality results. Perhaps their model for games could be much smaller.

Imagine needing 16-24GB of VRAM for graphics and another 200+GB for AI between high-quality TTS, GPT-like natural text prompting, personality, etc. models.
You don't need the degree of encyclopedic knowledge that ChatGPT has, so I think your estimate is off by at least an order of magnitude.

Keeping what each AI knows or learns partitioned between NPCs, game saves, players for multi-player games, etc. could be a challenge. You either end up with gigantic save files or having to retrain the starting AI on load to match the save state.
You don't have to persist the entire state of the transformer - just the sequence of prompts which brought it to that state. Much, much smaller.
 
Last edited:
Well, guess what? The article didn't say they used Microsoft's VALL-E, it said they use their own Riva SDK. It does both Automatic Speech Recognition and Text-to-Speech. Furthermore, it's a "SDK for building and deploying fully customizable, real-time AI pipelines that deliver world-class accuracy in all clouds, on premises, at the edge, and on embedded devices." The release notes reference issues with fitting certain languages on a 8 GB embedded platform. Since those are unified memory devices, it's unclear whether they mean the model is larger than 8 GB or just that it won't fit whatever portion of it would be available to the GPU.
The VALL-E article I read said there were massive quality improvements going from 16GB to 160GB to 570GB model sizes. I'd imagine a 8GB model would have pretty crappy TTS quality and flexibility, the kind of AI voices that immediately give themselves away.

If I was to play games with entirely AI-generated voices, I'd want them to be nearly indistinguishable from voice actors, which means high-ish quality models would be required.

BTW,
AI pipelines that deliver world-class accuracy in all clouds
sounds very cloud-based. I doubt the 8GB embedded version is for anything that goes much beyond a fancy IVR.
 
All this means is more buggy, cash grab, unfinished trash, over priced games we've been getting this year, need I mention which games, NO everyone knows.
 
The VALL-E article I read said there were massive quality improvements going from 16GB to 160GB to 570GB model sizes. I'd imagine a 8GB model would have pretty crappy TTS quality and flexibility, the kind of AI voices that immediately give themselves away.
The link I posted has a sample web app that you can use to try it, right from your web browser. The sample I tried sounds natural enough, but I wouldn't have trouble distinguishing it from a professional announcer reading the same text.

If I was to play games with entirely AI-generated voices, I'd want them to be nearly indistinguishable from voice actors, which means high-ish quality models would be required.
I think voice actors in games can be pretty cringe. I wouldn't compare this with the best of the best, but of mid-quality human voice actors and below. If you just use it in those contexts, it would probably be an improvement.

Anyway, tech generally improves. So, let's see how it matures. I think it would be silly to make sweeping pronouncements about it, so soon. AI is a rapidly-evolving field, not least text-to-speech and "deep fake" audio.

sounds very cloud-based. I doubt the 8GB embedded version is for anything that goes much beyond a fancy IVR.
Where does it say there are different models for cloud vs. embedded?