What Makes Conversational AI Human Like?

By
Yujian Tang
5
min read
September 27, 2024
Table of Contents
Contributors
Build AI video with Tavus APIs
Get Started Free
Share

How long do you wait after your friend finishes talking for you to begin talking? Experts say just about 200 milliseconds. This is true not just for English, but across all languages and cultures that we've surveyed. 

We call this “utterance to utterance” time. In quick conversations with friends this time is roughly 200 milliseconds, but in conversations requiring more thought, this could be up to 900 milliseconds. In some conversations, where there’s really a lot of rapid back and forth happening, we could even get down to negative 1000 milliseconds because humans interrupt each other. If this sounds fast to you, just think about your last really good conversation with a friend. Then, think about the last time you chatted with ChatGPT.

Screenshot_2024-06-04_at_7.07.26_PM-removebg.png

The good news is that LLMs like ChatGPT can think pretty fast. The bad news is that having the LLM chat with you is only a small piece of the puzzle. In order to have a video AI chat with you, you also need: speech detection, transcription, text to speech, audio to video, sending the video frames, sending the audio, and computer vision to observe the background. So, how do you put all of this together? 

The Tavus Engineering team uses four pillars:

  • Streaming
  • Parallelism
  • Resource Management
  • Memory Management

Streaming

The main reason we stream the data is so we can get to the first utterance faster and therefore provide a more human-like experience. 

There’s two ways to move data from one place to another - streaming and batch. Moving data in batches means moving more data at once. Using streaming lets us move smaller amounts of data at a time. With video data, this translates into moving data in frames. 

The Tavus Conversational Video Interface renders at approximately 25 frames per second. To stay ahead of the curve, we need to be able to render much more than 25 frames per second. When you stream data, there’s not just the data you have to stream, but there’s also overhead for streaming the data. So, as you stream less frames at a time and send frames more frequently, you actually once again face some latency issues. 

We’ve found that streaming a few frames at a time (three to five) gives us the best performance so far.

Parallelism

When it comes to conversational video interface, there’s many things happening at once, so many things can be parallelised to increase speed. As you can see from the image above, we can begin to conceptualize how conversational video interface works by separating it into 10 processes.

  • Capturing Streaming Audio/Video (input)
  • Speech Detection (input)
  • Audio (input)
  • Speech-to-text (input)
  • Visual Question Answering (input)
  • LLM as a Chatbot (input) and (output)
  • Sentence Endpoint (output)
  • Text-to-Speech (output)
  • Replica Inference (output)
  • Publish Streaming Audio/Video (output)

Since the conversation is happening in real time, the input and output pieces can be parallelized, and some of the input itself can even be parallelized. As you can see from the image, visual question answering can happen at the same time as the speech input. 

Resource Management

When it comes to resource management, there’s five resources we think about:

  • CPU
  • GPU
  • RAM
  • Disk
  • Network

The CPU and GPU both serve as compute, and both be used for inference. RAM is used to keep the data you need for multiple computations. Disk is used to keep your Replica data or other data. Network is used to pass information around. Each of these has to be managed and optimized for the best possible performance.

Memory Management

When we first started building this, we started with the data stored in S3. However, moving data from S3 to memory is inefficient. When you’re trying to be under 1 second in latency, every millisecond counts.

The first thing we did was stop moving the data. Instead, we just moved the pointers to the data. This means much less data going around, and much faster transport. This move alone shaved 1.5 seconds off our latency. 

Remember, the closer your data is to the memory, the faster it can be acted on. Moving pointers was for CPU memory, but what if we could move data even closer? What if we could move it into GPU memory? That’s exactly what we did, and we saw another 4x improvement in latency.

Summary of What Makes Conversational AI Human Like

To summarize, we walked through four key techniques that the Tavus team used to create a human-like conversational video interface. We started with streaming data. Instead of sending data in large chunks, we broke it down to send it nearly frame by frame. 

Next, we looked at parallelism. What kind of processes could we parallelize? We found that there were 10 tasks going on behind the scenes to make real time conversational AI work, and that we could parallelize both the input and output tasks. Third, we looked at resource management. We didn’t dive deeply into how this was done, but we looked at the five different resources we have to manage while building this kind of application.

Fourth, we looked at memory management. Both on the GPU and CPU side. We saw that a CPU optimization of using pointers instead of the data itself gave us a 1.5 second speed up from our original setup. Then, using GPU memory gave us a 4x speedup from that. Combining all of these techniques ultimately helped us get under 1 second latency for utterance to utterance.

Watch this in talk form from Brian Johnson here.

Research initiatives

The team is at the forefront of AI video research and pushes model updates every two weeks based on the latest research and customer needs.

No items found.
Industry
min read
This is some text inside of a div block.
min read

Voice Activity Detection: What it is & How to Use it in Your Technology [2025]

Learn how voice activity detection powers modern speech applications. Discover performance metrics and how to integrate VAD into your tech stack.
Industry
min read
This is some text inside of a div block.
min read

12+ Best AI Tools for Developers [2025]

Discover the best AI tools for developers in 2025. From code generation to video APIs, learn how these tools enhance productivity and enable advanced features.
Industry
min read
This is some text inside of a div block.
min read

How to Create an AI Santa: Step-by-Step Guide

Learn how to create an AI Santa video with this step-by-step guide. Discover top tools and techniques for building interactive holiday experiences at scale.

AI video APIs for digital twins

Build immersive AI-generated video experiences in your application