Overview

Vozo provides two modes for lip syncing: Standard Mode and Precision Mode.
Each mode is designed for different types of videos and output quality requirements.

Choosing between modes is available to members only. Free users will automatically use Standard Mode by default.

Key Differences

Standard ModePrecision Mode
⏱️ SpeedFast: ~10 minutes for queue and processingSlower: ~2 hours for queue and processing
✅ Best forMost front-facing videos with clear facial visibilityVideos with complex visual details (e.g., beards, wrinkles, side profiles)
⚠️ LimitationsLess accurate on side profiles or detailed facial texturesNot good if the original video has static faces or minimal mouth movement

Standard Mode

Standard Mode is optimized for speed and works well in general cases, including:

  • Videos with clear, front-facing speakers

  • Projects where a quick turnaround matters more than fine detail

    Example of original video suitable for standard mode

Precision Mode

Precision Mode provides greater accuracy and attention to detail. It’s ideal for:

  • Videos with side profiles or complex facial details, such as facial hair or distinguishing features
  • Professional content requiring high-quality lip syncing

Precision Mode relies on learning the mouth movements from the original video. It’s not suitable for videos with static or minimal mouth movement, such as AI-generated videos where the speaker’s mouth doesn’t move naturally.

Example of original video suitable for precision mode