Featured

Phoenix-2: Advanced Techniques in Talking Head Generation — 3D Gaussian Splatting

Christian Safka & Keyu Che
July 24, 2024
min read
Contributors
Build AI video with Tavus APIs
Get Started Free
Share

This paper will cover the past, present and future of the talking-head generation research field. Specifically, we will dive deep into the trending 3D scene representations (NeRF -> 3DGS) and the benefits of employing 3DGS in avatar applications.

Outline

  1. Development history of talking-head model structure
  2. Tradeoffs in architectures with pipeline examples
  3. Technical deep dive of 3DGS model
  4. From NeRF to 3DGS, why did we switch?
  5. What’s next?

Development history of talking-head model structure

Talking Head model architectures have varied significantly over recent years, from fully two-dimensional approaches utilizing Generative Adversarial Networks (GANs) [0], to 3D rendering pipelines such as audio-driven Neural Radiance Fields (NeRFs) [1] or 3D Gaussian Splatting (3DGS) [2].
Traditional GAN models leveraged large datasets of facial images to produce realistic facial animations, but often struggled with temporal consistency and coherence across longer sequences.

Tradeoffs in architectures with pipeline examples

The transition from image GANs to NeRFs has brought notable improvements in training time requirements, render speed, and video quality. GANs by nature require vast datasets and expensive computational resources for training, and often result in slower inference times and lower video quality due to the two-dimensional nature and temporal consistency issues. By using 3D intermediates, we are able to take advantage of fast rendering techniques over 100 FPS, as well as utilize a higher degree of controllability and generalizability due to physics-aware constraints around expression animation. A visual comparison illustrates the difference between 2D and 3D talking-head models.

Figure 1. Inference pipeline example for 2D talking head generation (Image taken from [3])
Figure 2. NeRF-based talking head video generation pipeline example. (Image taken from [1])

Technical Explanation of 3DGS Model

3D Gaussian Splatting is a cutting-edge rasterization technique used in the field of 3D scene representation. Unlike previous methods, 3DGS employs a novel mechanism that leverages Gaussian splats — essentially small, localized, Gaussian-distributed elements — to represent 3D scenes.

Figure 3. The adaptive densification mechanism proposed with the 3DGS model. (Image taken from [2])

One of the most impactful improvements we made from the Phoenix model to the Phoenix-2 model was doing a drop-in replacement of the NeRF backbone of the original Phoenix model. The Phoenix-2 model now uses 3DGS to learn how audio deforms faces in 3D space, and uses that information to render novel views from unseen audio.

From NeRF to 3DGS, why did we switch?

The advantages of Gaussian Splatting over ray-tracing NeRFs are seen across several categories:

1. Data Representation

  • NeRF represents a scene using a continuous volumetric representation encoded by a neural network, which predicts color and density at any 3D point in the scene
  • 3DGS uses a collection of Gaussian Splats, where each splat represents a small, localized part of the scene with a simple mathematical form

2. Memory Usage

  • NeRFs require a significant amount of memory in order to store the neural network parameter weights and intermediate values computer during the volume rendering process
  • 3DGS uses less memory because it directly manipulates a sparse set of parameters (positions, amplitudes, and spreads of the Gaussian Splats)

3. Computational Complexity

  • NeRFs require a large number of points in the scene to be evaluated by a neural network for each training iteration, leading to high computational costs.
  • 3DGS computational load is significantly reduced due to the simplicity of the Gaussian functions

4. Training Process

  • NeRF training can be slower due to the need to optimize a neural network to fit the scene, and the need for backpropagation through many layers
  • 3DGS training requires optimizing the parameters of Gaussian Splats to fit a scene. This optimization can be more efficient because it deals with simpler mathematical entities

5. Rendering Efficiency

  • NeRF rendering requires sampling many points along rays cast through a scene and integrating the results
  • 3DGS rendering is more straightforward as it simply needs to project Gaussian splats onto the image plane
Figure 4. The visual explanation of different kinds of scene representations. From left to right: Implicit model uses a neural network to approximate a radiance field, explicit model conducts volume rendering directly on learned spatial feature (voxels, hash grids, etc.), excluding neural components, and hybrid model incorporates learned spatial features with the neural networks. (Image taken from [4])
Figure 5. The conceptual illustration of differences between 3DGS and NeRF. NeRF (left) queries an MLP along the casted rays, while 3DGS (right) blends Gaussians for the given ray. (Image taken from [5])

Our Phoenix-2 pipeline based on 3DGS is able to train new replicas 70% faster, render at 60+ FPS, and allow for a more explicit controllability due to the nature of working with the primitive Gaussian Splat scene representations.
This transition enhances the practicality of deploying these models in real-time applications, such as interactive avatars and virtual assistants. It also allows for more explicit control over avatar movements.

What’s the Next Model?

In the past few months, several concurrent research papers have been published/open released in this area. For example, GSTalker (Chen et al)[6], GaussianTalker (Cho et al) [7], GaussianTalker (Yu et al)[9], TalkingGaussian (Li et al)[8], to name just a few, all indicate that employing 3D gaussian splatting techniques into the talking head generation task is a promising direction.

Figure 6. Pipeline illustration of the GSTalker paper from Chen et al. (Image taken from [6])
Figure 7. Pipeline illustration of the GaussianTalker paper from Cho et al. (Image taken from [7])
Figure 8. Pipeline illustration of the TalkingGaussian paper from Li et al. (Image taken from [8])
Figure 9. Pipeline illustration of the GaussianTalker from Yu et al. (Image taken from [9])

While a positively-trending direction, there are still some known limitations in 3DGS-based methods. First, these methods would suffer from render quality issues, especially for in-the-wild training videos. Second, the training time requirement from the above methods is still too high for real world applications.
With Phoenix-2, we were able to build off of existing methods and combine them with our in-house novel advancements to tackle these limitations. If this is something that sounds interesting to you, come check us out! We’re hiring: https://tavus.io/careers

References

[0] Goodfellow, Ian, et al. “Generative adversarial nets.” Advances in neural information processing systems 27 (2014).

[1] Guo, Yudong, et al. “Ad-nerf: Audio driven neural radiance fields for talking head synthesis.” Proceedings of the IEEE/CVF international conference on computer vision. 2021.

[2] Kerbl, Bernhard, et al. “3D Gaussian Splatting for Real-Time Radiance Field Rendering.” ACM Trans. Graph. 42.4 (2023): 139–1.

[3] Gupta, Anchit, et al. “Towards generating ultra-high resolution talking-face videos with lip synchronization.” Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2023

[4] Tosi, Fabio, et al. “How nerfs and 3d gaussian splatting are reshaping slam: a survey.” arXiv preprint arXiv:2402.13255 4 (2024).

[5] Tosi, Fabio, et al. “How nerfs and 3d gaussian splatting are reshaping slam: a survey.” arXiv preprint arXiv:2402.13255 4 (2024).

[6] Chen, Bo, et al. “GSTalker: Real-time Audio-Driven Talking Face Generation via Deformable Gaussian Splatting.” arXiv preprint arXiv:2404.19040 (2024).

[7] Cho, Kyusun, et al. “GaussianTalker: Real-Time High-Fidelity Talking Head Synthesis with Audio-Driven 3D Gaussian Splatting.” arXiv preprint arXiv:2404.16012 (2024).

[8] Li, Jiahe, et al. “TalkingGaussian: Structure-Persistent 3D Talking Head Synthesis via Gaussian Splatting.” arXiv preprint arXiv:2404.15264 (2024).

[9] Yu, Hongyun, et al. “GaussianTalker: Speaker-specific Talking Head Synthesis via 3D Gaussian Splatting.” arXiv preprint arXiv:2404.14037 (2024).

Research initiatives

The team is at the forefront of AI video research and pushes model updates every two weeks based on the latest research and customer needs.

Research
min read
This is some text inside of a div block.
min read

Phoenix-1: Realistic Avatar Generation in the Wild

This research paper, written by the Tavus team, details the development of Phoenix, a groundbreaking generative model for realistic avatar creation and text-to-video generation. Phoenix leverages audio and text-driven 3D models, integrating volumetric rendering techniques and 2D Generative Adversarial Networks (GANs) to create lifelike replicas from short video clips.
Product
5
min read
This is some text inside of a div block.
min read

How to do Text to Video for AI Replicas

An introduction to how to make an AI video with Tavus' video API.
Product
min read
This is some text inside of a div block.
min read

Build Your First Real Time Conversational Digital Twin in Five Minutes

A tutorial on how you can build a conversational AI in just a few minutes with details of what each of the parameters do.
Product
5
min read
This is some text inside of a div block.
min read

Build a Custom Personality for Real Time Video AI

Customize your conversational AI's personality using a custom persona with system prompts, context, and your own LLM.

AI video APIs for digital twins

Build immersive AI-generated video experiences in your application