Prime Video automatically detects audio-video synchronization defects in dubbed media at scale
Prime Video achieves a 99.4% F1 score in synchronizing dubbed audio to non-dubbed audio using an innovative, fast, and memory-efficient approach.
At Prime Video, an important aspect of video quality is ensuring time synchronization between multiple modalities (for example, visual and audio). The ITU J.248 recommendation for operational monitoring of video-to-audio delay in television programs suggests that people can detect desynchronizations as low as 140 ms. Any drift beyond this perceptible tolerance limit reduces enjoyability and increases the likelihood of customer disengagement.
The audio-visual (A/V) sync problem has the following two distinct sub-problems based on the type of audio content: non-dubbed A/V sync and dubbed A/V sync. The majority of the quality defects occur on dubbed media. These sync issues in dubbed audio tracks can come from improper overlay of the dubbed audio track or poor-quality dubbing.
Audio to visual synchronization on non-dubbed media (video and audio of the same language) can be solved by looking at the correspondence of lip movements with speech. For dubbed content, there is an inherent disconnect between lip movements and speech because the lips are moving according to the language of the original audio. Direct audio-to-visual synchronization of dubbed media is not accurate and we solved the issue by posing it as an audio-to-audio synchronization problem. The central idea is that the non-dubbed original language versions are always available for the corresponding dubbed audio tracks. We can then break down A/V sync of the dubbed audio track into the following two sub-problems:
- Synchronizing the original language audio-to-video.
- Ensuring dubbed and non-dubbed audio tracks are in sync.
The following diagram shows an overview of problem simplification for the A/V sync.
The Problem: Synchronizing dubbed audio to original audio
Solving the dubbed A/V sync problem first required us to verify that the original language audio track is in sync with the visual stream. There’s sufficient past research in this space with language-agnostic lip synchronization approaches achieving more than 98% accuracy. At a high level, these approaches work by representing short segments (around 200 ms) of face image frames and corresponding speech audio, and then computing a distance between them.
The second part of solving the dubbed A/V sync problem is ensuring synchronization between audio tracks of two different languages and this has been largely untouched in academic literature. It’s easy to imagine why lip sync approaches don’t apply to dubbed audios. For example, if a viewer is watching an English language movie with a French language soundtrack, it’s unlikely that the actor’s lip movements match the French words in the audio.
The approach we proposed was inspired by the fact that only certain portions (speech or vocals in music) are actually dubbed over. All other sounds remain the same, regardless of language. For example, the sound of a door slamming is the same in English as it is in German. Therefore, we needed a fast and efficient method to find all matching segments in two audio tracks.
The following diagram shows an overview of audio-to-audio sync detection.
The Solution: Efficient audio synchronization as a compression problem
The following diagram shows efficient audio representations for downstream audio-to-audio matching. The dimensionality of the spectrograms is reduced both in time with X < T and frequency with M << F.
An accurate approach to finding matching segments in two audio tracks is to compare frame-by-frame representations of each. We can first represent each audio as a magnitude spectrogram; for example, time x frequency x amplitude graph using short-term fourier transform (STFT). We can then compare the distance between representations of audio frames from the source and target audios. This brute force approach achieves more than 99% F-1 but is prohibitively expensive and time-consuming. For reference, a magnitude spectrogram of one hour’s audio has more than 150 million data points, requires more than 500 MB of storage, and takes an excess of 40 minutes to match with another spectrogram of the same size. For a catalog of the size of Prime Video’s, this is unscalable.
To make the problem of finding matching segments both accurate and efficient, we applied two optimizations. First, we considered only audio frames that have enough loudness by filtering in the time domain. This yields a spectrogram that is truncated in length. By this filtering of spectrogram in the time-domain alone, we can reduce the memory footprint by 88% from more than 520 MB to 68 MB per hour of audio.
Second, to further reduce the dimensionality of spectrogram in the frequency domain, we applied compressive sensing. This is a signal processing method to acquire and reconstruct a signal that is sparse in some domains from far fewer samples than required by the Nyquist–Shannon sampling theorem. Although compressive sensing has been applied in a wide range of domains (for example, image compression, video compression, and song identification), we believe this is the first attempt to utilize compression technology to solve audio synchronization problems. To accommodate the entire human hearing range of 20 Hz to 20 KHz, audio in cinematic media is sampled at 44.1 KHz or 48 KHz. However, energy for sound events is mostly concentrated at frequencies of less 8 KHz. This makes cinematic audio very sparse in frequency domain and conducive to compressive sensing.
With these efficient representations, we are able to achieve around a 99.6% relative reduction in memory footprint from more than 520 MB to less than 3 MB per hour of audio. The time to match is consequently reduced from more than 40 minutes to less than six seconds. The total time to fingerprint is around 50 seconds on one central processing unit (CPU) machine.
The impact for customers
Despite operating on a compressed representation, performance of sync-defect detection remains equally high at 99.4% F1 score and the median error in prediction of exact sync offset is ≈20 ms, which is significantly better than human perceptible tolerance of 100-200 ms.
Our approach allows Prime Video to scale up by ensuring sync-defect-free videos for our customers across the world that watch content in many dubbed languages.
Stay tuned for more from us!