Machine Learning

Prime Video automatically detects audio-video synchronization defects in dubbed media at scale

Prime Video achieves a 99.4% F1 score in synchronizing dubbed audio to non-dubbed audio using an innovative, fast, and memory-efficient approach.

Avijit Vajpayee,

Zhikang Zhang,

Vimal Bhat

Mar 29, 2023

Twitter

At Prime Video, an important aspect of video quality is ensuring time synchronization between multiple modalities (for example, visual and audio). The ITU J.248 recommendation for operational monitoring of video-to-audio delay in television programs suggests that people can detect desynchronizations as low as 140 ms. Any drift beyond this perceptible tolerance limit reduces enjoyability and increases the likelihood of customer disengagement.

The audio-visual (A/V) sync problem has the following two distinct sub-problems based on the type of audio content: non-dubbed A/V sync and dubbed A/V sync. The majority of the quality defects occur on dubbed media. These sync issues in dubbed audio tracks can come from improper overlay of the dubbed audio track or poor-quality dubbing.

Audio to visual synchronization on non-dubbed media (video and audio of the same language) can be solved by looking at the correspondence of lip movements with speech. For dubbed content, there is an inherent disconnect between lip movements and speech because the lips are moving according to the language of the original audio. Direct audio-to-visual synchronization of dubbed media is not accurate and we solved the issue by posing it as an audio-to-audio synchronization problem. The central idea is that the non-dubbed original language versions are always available for the corresponding dubbed audio tracks. We can then break down A/V sync of the dubbed audio track into the following two sub-problems:

Synchronizing the original language audio-to-video.
Ensuring dubbed and non-dubbed audio tracks are in sync.

The following diagram shows an overview of problem simplification for the A/V sync.

Components for Audio-Video Sync Detection on Dubbed videos. We first synchronize the video to the original audio using lip sync and then synchronize the dubbed audio to the corrected original audio.

The Problem: Synchronizing dubbed audio to original audio

Solving the dubbed A/V sync problem first required us to verify that the original language audio track is in sync with the visual stream. There’s sufficient past research in this space with language-agnostic lip synchronization approaches achieving more than 98% accuracy. At a high level, these approaches work by representing short segments (around 200 ms) of face image frames and corresponding speech audio, and then computing a distance between them.

The second part of solving the dubbed A/V sync problem is ensuring synchronization between audio tracks of two different languages and this has been largely untouched in academic literature. It’s easy to imagine why lip sync approaches don’t apply to dubbed audios. For example, if a viewer is watching an English language movie with a French language soundtrack, it’s unlikely that the actor’s lip movements match the French words in the audio.

The approach we proposed was inspired by the fact that only certain portions (speech or vocals in music) are actually dubbed over. All other sounds remain the same, regardless of language. For example, the sound of a door slamming is the same in English as it is in German. Therefore, we needed a fast and efficient method to find all matching segments in two audio tracks.

The following diagram shows an overview of audio-to-audio sync detection.

Pipeline to determine whether dubbed audio is in-sync with original audio. Using acoustic fingerprinting we convert each audio track to a signature. Following this we identify all matching frames between two signature files. Using a statistical model on information of all matching frames, we identify whether the dubbed audio is in sync with the original language audio.

The Solution: Efficient audio synchronization as a compression problem

The following diagram shows efficient audio representations for downstream audio-to-audio matching. The dimensionality of the spectrograms is reduced both in time with X < T and frequency with M << F.

Steps for computing audio signature for a given audio track. We first convert the audio from time domain to frequency domain. Following this we compress the frequency domain spectrogram to get a small and robust signature.

An accurate approach to finding matching segments in two audio tracks is to compare frame-by-frame representations of each. We can first represent each audio as a magnitude spectrogram; for example, time x frequency x amplitude graph using short-term fourier transform (STFT). We can then compare the distance between representations of audio frames from the source and target audios. This brute force approach achieves more than 99% F-1 but is prohibitively expensive and time-consuming. For reference, a magnitude spectrogram of one hour’s audio has more than 150 million data points, requires more than 500 MB of storage, and takes an excess of 40 minutes to match with another spectrogram of the same size. For a catalog of the size of Prime Video’s, this is unscalable.

To make the problem of finding matching segments both accurate and efficient, we applied two optimizations. First, we considered only audio frames that have enough loudness by filtering in the time domain. This yields a spectrogram that is truncated in length. By this filtering of spectrogram in the time-domain alone, we can reduce the memory footprint by 88% from more than 520 MB to 68 MB per hour of audio.

Second, to further reduce the dimensionality of spectrogram in the frequency domain, we applied compressive sensing. This is a signal processing method to acquire and reconstruct a signal that is sparse in some domains from far fewer samples than required by the Nyquist–Shannon sampling theorem. Although compressive sensing has been applied in a wide range of domains (for example, image compression, video compression, and song identification), we believe this is the first attempt to utilize compression technology to solve audio synchronization problems. To accommodate the entire human hearing range of 20 Hz to 20 KHz, audio in cinematic media is sampled at 44.1 KHz or 48 KHz. However, energy for sound events is mostly concentrated at frequencies of less 8 KHz. This makes cinematic audio very sparse in frequency domain and conducive to compressive sensing.

With these efficient representations, we are able to achieve around a 99.6% relative reduction in memory footprint from more than 520 MB to less than 3 MB per hour of audio. The time to match is consequently reduced from more than 40 minutes to less than six seconds. The total time to fingerprint is around 50 seconds on one central processing unit (CPU) machine.

The impact for customers

Despite operating on a compressed representation, performance of sync-defect detection remains equally high at 99.4% F1 score and the median error in prediction of exact sync offset is ≈20 ms, which is significantly better than human perceptible tolerance of 100-200 ms.

Our approach allows Prime Video to scale up by ensuring sync-defect-free videos for our customers across the world that watch content in many dubbed languages.

Stay tuned for more from us!

Twitter

Tags:

Prime Video Audio-video quality

Avijit Vajpayee

Applied Scientist II – Prime Video

Zhikang Zhang

Applied Scientist II – Prime Video Tech

Vimal Bhat

Senior Manager Applied Science – Prime Video

Most popular

Video Streaming

“We’re just beginning to build the future of live sports streaming”

At the European Women in Tech conference 2022, Filippa Hasselstrom, head of low-latency streaming at Prime Video, explained how her team builds the future of live sports streaming using UDP.

Filippa Hasselstrom

Feb 07, 2023

Our Innovation

Prime Video announces Amazon Research Awards recipients for fall 2022

Prime Video announces ARA awards in the fields of anomaly detection and insights, automated reasoning, personalization and discovery, and video quality analysis.

Staff Writer

Apr 17, 2023

Our People

Empathetic by design: How Amélie Werner prioritizes her team to drive innovation for customers

As head of Design Ops, UX Research, and Global Commerce Design at Prime Video, Amélie helped oversee the redesign of the user experience – a journey that’s allowed her to embrace Amazon’s Leadership Principles while empowering her colleagues.

Amélie Werner

Apr 05, 2023

Video Streaming

Innovating live video streaming for a VOD-only world

Here’s how Prime Video delivers live video streaming on customer devices that only support video-on-demand (VOD) playback.

Parminder Singh

Apr 13, 2023