How long do you wait after your friend finishes talking for you to begin talking? Experts say just about 200 milliseconds. This is true not just for English, but across all languages and cultures that we've surveyed.
We call this “utterance to utterance” time. In quick conversations with friends this time is roughly 200 milliseconds, but in conversations requiring more thought, this could be up to 900 milliseconds. In some conversations, where there’s really a lot of rapid back and forth happening, we could even get down to negative 1000 milliseconds because humans interrupt each other. If this sounds fast to you, just think about your last really good conversation with a friend. Then, think about the last time you chatted with ChatGPT.
The good news is that LLMs like ChatGPT can think pretty fast. The bad news is that having the LLM chat with you is only a small piece of the puzzle. In order to have a video AI chat with you, you also need: speech detection, transcription, text to speech, audio to video, sending the video frames, sending the audio, and computer vision to observe the background. So, how do you put all of this together?
The Tavus Engineering team uses four pillars:
- Streaming
- Parallelism
- Resource Management
- Memory Management
Streaming
The main reason we stream the data is so we can get to the first utterance faster and therefore provide a more human-like experience.
There’s two ways to move data from one place to another - streaming and batch. Moving data in batches means moving more data at once. Using streaming lets us move smaller amounts of data at a time. With video data, this translates into moving data in frames.
The Tavus Conversational Video Interface renders at approximately 25 frames per second. To stay ahead of the curve, we need to be able to render much more than 25 frames per second. When you stream data, there’s not just the data you have to stream, but there’s also overhead for streaming the data. So, as you stream less frames at a time and send frames more frequently, you actually once again face some latency issues.
We’ve found that streaming a few frames at a time (three to five) gives us the best performance so far.
Parallelism
When it comes to conversational video interface, there’s many things happening at once, so many things can be parallelised to increase speed. As you can see from the image above, we can begin to conceptualize how conversational video interface works by separating it into 10 processes.
- Capturing Streaming Audio/Video (input)
- Speech Detection (input)
- Audio (input)
- Speech-to-text (input)
- Visual Question Answering (input)
- LLM as a Chatbot (input) and (output)
- Sentence Endpoint (output)
- Text-to-Speech (output)
- Replica Inference (output)
- Publish Streaming Audio/Video (output)
Since the conversation is happening in real time, the input and output pieces can be parallelized, and some of the input itself can even be parallelized. As you can see from the image, visual question answering can happen at the same time as the speech input.
Resource Management
When it comes to resource management, there’s five resources we think about:
- CPU
- GPU
- RAM
- Disk
- Network
The CPU and GPU both serve as compute, and both be used for inference. RAM is used to keep the data you need for multiple computations. Disk is used to keep your Replica data or other data. Network is used to pass information around. Each of these has to be managed and optimized for the best possible performance.
Memory Management
When we first started building this, we started with the data stored in S3. However, moving data from S3 to memory is inefficient. When you’re trying to be under 1 second in latency, every millisecond counts.
The first thing we did was stop moving the data. Instead, we just moved the pointers to the data. This means much less data going around, and much faster transport. This move alone shaved 1.5 seconds off our latency.
Remember, the closer your data is to the memory, the faster it can be acted on. Moving pointers was for CPU memory, but what if we could move data even closer? What if we could move it into GPU memory? That’s exactly what we did, and we saw another 4x improvement in latency.
Summary of What Makes Conversational AI Human Like
To summarize, we walked through four key techniques that the Tavus team used to create a human-like conversational video interface. We started with streaming data. Instead of sending data in large chunks, we broke it down to send it nearly frame by frame.
Next, we looked at parallelism. What kind of processes could we parallelize? We found that there were 10 tasks going on behind the scenes to make real time conversational AI work, and that we could parallelize both the input and output tasks. Third, we looked at resource management. We didn’t dive deeply into how this was done, but we looked at the five different resources we have to manage while building this kind of application.
Fourth, we looked at memory management. Both on the GPU and CPU side. We saw that a CPU optimization of using pointers instead of the data itself gave us a 1.5 second speed up from our original setup. Then, using GPU memory gave us a 4x speedup from that. Combining all of these techniques ultimately helped us get under 1 second latency for utterance to utterance.
Watch this in talk form from Brian Johnson here.