Introducing Hummingbird-0: A Leap in Lip Sync

Alex Behrens

and

•

min read

•

April 24, 2025

Table of Contents

Example H2

Today, we're releasing Hummingbird-0, a photorealistic, zero-shot lip sync model that emerged as a research artifact during the development of Phoenix-3, Tavus’ full-face rendering model.

The model’s accuracy, speed, and consistency made it clear: Hummingbird is the most reliable human lip sync model on the market. So we broke it out.

Now available via the Tavus Developer Platform and API, and integrated into the FAL model gallery, Hummingbird is launching as a research preview for developers, creators, and video teams.

What is Hummingbird-0?

Hummingbird-0 is a zero-shot model that generates high-quality lip sync from audio and a short reference video, no training, cloning, or fine-tuning required. It delivers natural, identity-preserving sync with consistent results, even across varied input styles.

Most lip sync models still miss the mark, mismatched timing, warped mouths, or an uncanny feel. Hummingbird closes that gap. It leads in lip accuracy, realism, and identity preservation, while remaining cost-effective and simple to run: just upload your audio and video, and it works.

What You Can Build

While Hummingbird is purpose-built for lip syncing, it’s flexible enough to support a wide range of creative and production use cases. Whether you're editing existing footage, localizing content, or building an AI-native media stack, Hummingbird fits cleanly into API-first workflows.

Just a few examples of how early users have been building with Hummingbird-0:

AI Film Studios: Use Hummingbird with Veo or Runway to generate rich, lip-synced cinematic dialogue in seconds.
UGC & Influencer Ads: Replace voiceovers, localize content, and reversion videos, without reshooting.
B2B & Enterprise Content: Make one great training or marketing video, then localize and personalize it for every audience.
Translation & Localization: Dub any language and match lips to voice naturally—no weird timing, no uncanny mouth movement.
AI Toolchains & Workflow: Hummingbird integrates seamlessly into AI pipelines with ElevenLabs, Veo, and more.‍
Video Editing in Post: Change a word in a podcast, film, or even a CGI cutscene, just replace the audio, and Hummingbird fixes the lips.

Research Benchmarking: State-of-the-Art Evals

We benchmarked Hummingbird-0 against the leading lip sync models including using over 30 videos representing diverse speakers, environments, and recording styles. Across the board, Hummingbird delivered state-of-the-art performance in every area that matters:

Lip Sync Accuracy: Hummingbird aligned mouth movements to speech more precisely than anything else we tested. No lag, no missed syllables, just natural, on-beat sync.
Identity Preservation: Where others drift or distort, Hummingbird keeps facial features stable and consistent throughout.
Visual Quality: Outputs were noticeably cleaner, with fewer visual glitches or frame artifacts. Ready to use with little to no post work (and a little foley).
Pricing: Hummingbird usage is included in all paid Tavus plans. Pricing begins at just $1.50 per minute, making it not just the most accurate, but one of the most cost-effective options available.

These results are from controlled, side-by-side comparisons on diverse inputs. If you're building with lipsync APIs, Hummingbird is the new baseline. For full methodology, metric definitions, and sample comparisons, check out the Hummingbird Research post.

Ethics & Guardrails

As with all emerging AI capabilities, we take our responsibilities seriously.

We’ve launched Hummingbird in a research preview mode to observe how it's used in the wild. So far, the community has been creative and respectful. We're proud of that, and want to protect the freedom of our users.

Here’s how we’re keeping things safe:

Internal abuse detection for repeat misuse
Watermarking (visible + invisible) in progress
Policy filters and output moderation being actively developed

Our philosophy: support creativity, respond to bad actors, and evolve guardrails in step with our community.

‍

How to Use Hummingbird

Hummingbird is available now via the Tavus API and in the FAL Model Gallery for quick testing. The API accepts a single MP4 video and an MP3 audio file, and returns a lip-synced MP4 output.

Hummingbird-0 performs best with front-facing, talking-head style videos where a single person is clearly visible and speaking directly to the camera. This setup provides the most accurate and stable lip sync. While it's optimized for these conditions, the model can still handle a variety of formats. That said, certain inputs may impact quality:

Multi-person shots may reduce sync precision.
Low-resolution or poorly lit footage can introduce artifacts.
Highly stylized animation (like anime or manga) typically performs poorly.
Lightly stylized or realistic cartoon content may still work well.

All output is resampled to 25fps. If the input video is higher than 1080p, it’s downscaled; otherwise, resolution is preserved. The aspect ratio always matches the original. If the audio is longer than the video, the video loops from the beginning—this can look unnatural, so it's best to keep durations closely matched.

Clips up to five minutes are supported. Generation takes about one minute per 10 seconds of video. You can find full documentation, examples, and integration guidance at docs.tavus.io.