This paper will cover the past, present and future of the talking-head generation research field. Specifically, we will dive deep into the trending 3D scene representations (NeRF -> 3DGS) and the benefits of employing 3DGS in avatar applications.
Outline
- Development history of talking-head model structure
- Tradeoffs in architectures with pipeline examples
- Technical deep dive of 3DGS model
- From NeRF to 3DGS, why did we switch?
- What’s next?
Development history of talking-head model structure
Talking Head model architectures have varied significantly over recent years, from fully two-dimensional approaches utilizing Generative Adversarial Networks (GANs) [0], to 3D rendering pipelines such as audio-driven Neural Radiance Fields (NeRFs) [1] or 3D Gaussian Splatting (3DGS) [2].
Traditional GAN models leveraged large datasets of facial images to produce realistic facial animations, but often struggled with temporal consistency and coherence across longer sequences.
Tradeoffs in architectures with pipeline examples
The transition from image GANs to NeRFs has brought notable improvements in training time requirements, render speed, and video quality. GANs by nature require vast datasets and expensive computational resources for training, and often result in slower inference times and lower video quality due to the two-dimensional nature and temporal consistency issues. By using 3D intermediates, we are able to take advantage of fast rendering techniques over 100 FPS, as well as utilize a higher degree of controllability and generalizability due to physics-aware constraints around expression animation. A visual comparison illustrates the difference between 2D and 3D talking-head models.
Technical Explanation of 3DGS Model
3D Gaussian Splatting is a cutting-edge rasterization technique used in the field of 3D scene representation. Unlike previous methods, 3DGS employs a novel mechanism that leverages Gaussian splats — essentially small, localized, Gaussian-distributed elements — to represent 3D scenes.
One of the most impactful improvements we made from the Phoenix model to the Phoenix-2 model was doing a drop-in replacement of the NeRF backbone of the original Phoenix model. The Phoenix-2 model now uses 3DGS to learn how audio deforms faces in 3D space, and uses that information to render novel views from unseen audio.
From NeRF to 3DGS, why did we switch?
The advantages of Gaussian Splatting over ray-tracing NeRFs are seen across several categories:
1. Data Representation
- NeRF represents a scene using a continuous volumetric representation encoded by a neural network, which predicts color and density at any 3D point in the scene
- 3DGS uses a collection of Gaussian Splats, where each splat represents a small, localized part of the scene with a simple mathematical form
2. Memory Usage
- NeRFs require a significant amount of memory in order to store the neural network parameter weights and intermediate values computer during the volume rendering process
- 3DGS uses less memory because it directly manipulates a sparse set of parameters (positions, amplitudes, and spreads of the Gaussian Splats)
3. Computational Complexity
- NeRFs require a large number of points in the scene to be evaluated by a neural network for each training iteration, leading to high computational costs.
- 3DGS computational load is significantly reduced due to the simplicity of the Gaussian functions
4. Training Process
- NeRF training can be slower due to the need to optimize a neural network to fit the scene, and the need for backpropagation through many layers
- 3DGS training requires optimizing the parameters of Gaussian Splats to fit a scene. This optimization can be more efficient because it deals with simpler mathematical entities
5. Rendering Efficiency
- NeRF rendering requires sampling many points along rays cast through a scene and integrating the results
- 3DGS rendering is more straightforward as it simply needs to project Gaussian splats onto the image plane
Our Phoenix-2 pipeline based on 3DGS is able to train new replicas 70% faster, render at 60+ FPS, and allow for a more explicit controllability due to the nature of working with the primitive Gaussian Splat scene representations.
This transition enhances the practicality of deploying these models in real-time applications, such as interactive avatars and virtual assistants. It also allows for more explicit control over avatar movements.
What’s the Next Model?
In the past few months, several concurrent research papers have been published/open released in this area. For example, GSTalker (Chen et al)[6], GaussianTalker (Cho et al) [7], GaussianTalker (Yu et al)[9], TalkingGaussian (Li et al)[8], to name just a few, all indicate that employing 3D gaussian splatting techniques into the talking head generation task is a promising direction.
While a positively-trending direction, there are still some known limitations in 3DGS-based methods. First, these methods would suffer from render quality issues, especially for in-the-wild training videos. Second, the training time requirement from the above methods is still too high for real world applications.
With Phoenix-2, we were able to build off of existing methods and combine them with our in-house novel advancements to tackle these limitations. If this is something that sounds interesting to you, come check us out! We’re hiring: https://tavus.io/careers
References
[0] Goodfellow, Ian, et al. “Generative adversarial nets.” Advances in neural information processing systems 27 (2014).
[1] Guo, Yudong, et al. “Ad-nerf: Audio driven neural radiance fields for talking head synthesis.” Proceedings of the IEEE/CVF international conference on computer vision. 2021.
[2] Kerbl, Bernhard, et al. “3D Gaussian Splatting for Real-Time Radiance Field Rendering.” ACM Trans. Graph. 42.4 (2023): 139–1.
[3] Gupta, Anchit, et al. “Towards generating ultra-high resolution talking-face videos with lip synchronization.” Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2023
[4] Tosi, Fabio, et al. “How nerfs and 3d gaussian splatting are reshaping slam: a survey.” arXiv preprint arXiv:2402.13255 4 (2024).
[5] Tosi, Fabio, et al. “How nerfs and 3d gaussian splatting are reshaping slam: a survey.” arXiv preprint arXiv:2402.13255 4 (2024).
[6] Chen, Bo, et al. “GSTalker: Real-time Audio-Driven Talking Face Generation via Deformable Gaussian Splatting.” arXiv preprint arXiv:2404.19040 (2024).
[7] Cho, Kyusun, et al. “GaussianTalker: Real-Time High-Fidelity Talking Head Synthesis with Audio-Driven 3D Gaussian Splatting.” arXiv preprint arXiv:2404.16012 (2024).
[8] Li, Jiahe, et al. “TalkingGaussian: Structure-Persistent 3D Talking Head Synthesis via Gaussian Splatting.” arXiv preprint arXiv:2404.15264 (2024).
[9] Yu, Hongyun, et al. “GaussianTalker: Speaker-specific Talking Head Synthesis via 3D Gaussian Splatting.” arXiv preprint arXiv:2404.14037 (2024).