Phoenix-2: Advanced Techniques in Talking Head Generation — 3D Gaussian Splatting

Christian Safka

July 24, 2024

Table of Contents

This paper will cover the past, present and future of the talking-head generation research field. Specifically, we will dive deep into the trending 3D scene representations (NeRF -> 3DGS) and the benefits of employing 3DGS in avatar applications.

Outline

Development history of talking-head model structure
Tradeoffs in architectures with pipeline examples
Technical deep dive of 3DGS model
From NeRF to 3DGS, why did we switch?
What’s next?

Development history of talking-head model structure

Talking Head model architectures have varied significantly over recent years, from fully two-dimensional approaches utilizing Generative Adversarial Networks (GANs) [0], to 3D rendering pipelines such as audio-driven Neural Radiance Fields (NeRFs) [1] or 3D Gaussian Splatting (3DGS) [2].
Traditional GAN models leveraged large datasets of facial images to produce realistic facial animations, but often struggled with temporal consistency and coherence across longer sequences.

Tradeoffs in architectures with pipeline examples

The transition from image GANs to NeRFs has brought notable improvements in training time requirements, render speed, and video quality. GANs by nature require vast datasets and expensive computational resources for training, and often result in slower inference times and lower video quality due to the two-dimensional nature and temporal consistency issues. By using 3D intermediates, we are able to take advantage of fast rendering techniques over 100 FPS, as well as utilize a higher degree of controllability and generalizability due to physics-aware constraints around expression animation. A visual comparison illustrates the difference between 2D and 3D talking-head models.

**Figure 1.** Inference pipeline example for 2D talking head generation (Image taken from [3])

**Figure 2.** NeRF-based talking head video generation pipeline example. (Image taken from [1])

Technical Explanation of 3DGS Model

3D Gaussian Splatting is a cutting-edge rasterization technique used in the field of 3D scene representation. Unlike previous methods, 3DGS employs a novel mechanism that leverages Gaussian splats — essentially small, localized, Gaussian-distributed elements — to represent 3D scenes.

**Figure 3.** The adaptive densification mechanism proposed with the 3DGS model. (Image taken from [2])

One of the most impactful improvements we made from the Phoenix model to the Phoenix-2 model was doing a drop-in replacement of the NeRF backbone of the original Phoenix model. The Phoenix-2 model now uses 3DGS to learn how audio deforms faces in 3D space, and uses that information to render novel views from unseen audio.

From NeRF to 3DGS, why did we switch?

The advantages of Gaussian Splatting over ray-tracing NeRFs are seen across several categories:

1. Data Representation

NeRF represents a scene using a continuous volumetric representation encoded by a neural network, which predicts color and density at any 3D point in the scene
3DGS uses a collection of Gaussian Splats, where each splat represents a small, localized part of the scene with a simple mathematical form

2. Memory Usage

NeRFs require a significant amount of memory in order to store the neural network parameter weights and intermediate values computer during the volume rendering process
3DGS uses less memory because it directly manipulates a sparse set of parameters (positions, amplitudes, and spreads of the Gaussian Splats)

3. Computational Complexity

NeRFs require a large number of points in the scene to be evaluated by a neural network for each training iteration, leading to high computational costs.
3DGS computational load is significantly reduced due to the simplicity of the Gaussian functions

4. Training Process

NeRF training can be slower due to the need to optimize a neural network to fit the scene, and the need for backpropagation through many layers
3DGS training requires optimizing the parameters of Gaussian Splats to fit a scene. This optimization can be more efficient because it deals with simpler mathematical entities

5. Rendering Efficiency

NeRF rendering requires sampling many points along rays cast through a scene and integrating the results
3DGS rendering is more straightforward as it simply needs to project Gaussian splats onto the image plane

**Figure 4. The visual explanation of different kinds of scene representations.** From left to right: *Implicit* model uses a neural network to approximate a radiance field, *explicit* model conducts volume rendering directly on learned spatial feature (voxels, hash grids, etc.), excluding neural components, and *hybrid* model incorporates learned spatial features with the neural networks. (Image taken from [4])

**Figure 5. The conceptual illustration of differences between 3DGS and NeRF.** NeRF (left) queries an MLP along the casted rays, while 3DGS (right) blends Gaussians for the given ray. (Image taken from [5])

Our Phoenix-2 pipeline based on 3DGS is able to train new replicas 70% faster, render at 60+ FPS, and allow for a more explicit controllability due to the nature of working with the primitive Gaussian Splat scene representations.
This transition enhances the practicality of deploying these models in real-time applications, such as interactive avatars and virtual assistants. It also allows for more explicit control over avatar movements.

What’s the Next Model?

In the past few months, several concurrent research papers have been published/open released in this area. For example, GSTalker (Chen et al)[6], GaussianTalker (Cho et al) [7], GaussianTalker (Yu et al)[9], TalkingGaussian (Li et al)[8], to name just a few, all indicate that employing 3D gaussian splatting techniques into the talking head generation task is a promising direction.

**Figure 6.** Pipeline illustration of the GSTalker paper from Chen et al. (Image taken from [6])

**Figure 7.** Pipeline illustration of the GaussianTalker paper from Cho et al. (Image taken from [7])

**Figure 8.** Pipeline illustration of the TalkingGaussian paper from Li et al. (Image taken from [8])

**Figure 9.** Pipeline illustration of the GaussianTalker from Yu et al. (Image taken from [9])

While a positively-trending direction, there are still some known limitations in 3DGS-based methods. First, these methods would suffer from render quality issues, especially for in-the-wild training videos. Second, the training time requirement from the above methods is still too high for real world applications.
With Phoenix-2, we were able to build off of existing methods and combine them with our in-house novel advancements to tackle these limitations. If this is something that sounds interesting to you, come check us out! We’re hiring: https://tavus.io/careers

References

[0] Goodfellow, Ian, et al. “Generative adversarial nets.” Advances in neural information processing systems 27 (2014).

[1] Guo, Yudong, et al. “Ad-nerf: Audio driven neural radiance fields for talking head synthesis.” Proceedings of the IEEE/CVF international conference on computer vision. 2021.

[2] Kerbl, Bernhard, et al. “3D Gaussian Splatting for Real-Time Radiance Field Rendering.” ACM Trans. Graph. 42.4 (2023): 139–1.

[3] Gupta, Anchit, et al. “Towards generating ultra-high resolution talking-face videos with lip synchronization.” Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2023

[4] Tosi, Fabio, et al. “How nerfs and 3d gaussian splatting are reshaping slam: a survey.” arXiv preprint arXiv:2402.13255 4 (2024).

[5] Tosi, Fabio, et al. “How nerfs and 3d gaussian splatting are reshaping slam: a survey.” arXiv preprint arXiv:2402.13255 4 (2024).

[6] Chen, Bo, et al. “GSTalker: Real-time Audio-Driven Talking Face Generation via Deformable Gaussian Splatting.” arXiv preprint arXiv:2404.19040 (2024).

[7] Cho, Kyusun, et al. “GaussianTalker: Real-Time High-Fidelity Talking Head Synthesis with Audio-Driven 3D Gaussian Splatting.” arXiv preprint arXiv:2404.16012 (2024).

[8] Li, Jiahe, et al. “TalkingGaussian: Structure-Persistent 3D Talking Head Synthesis via Gaussian Splatting.” arXiv preprint arXiv:2404.15264 (2024).

[9] Yu, Hongyun, et al. “GaussianTalker: Speaker-specific Talking Head Synthesis via 3D Gaussian Splatting.” arXiv preprint arXiv:2404.14037 (2024).

‍