Prime Video uses ML-driven subtitle synchronization to ensure a smooth viewing experience
Prime Video developed a language-agnostic system to flag and automatically synchronize out-of-sync subtitles.
Prime Video strives to ensure that customers can enjoy content in their preferred local language. Subtitles are an industry-standard and accessible way to convey spoken dialogues and provide a localized experience for customers. Every subtitle file consists of multiple subtitle blocks that contain the dialogues and respective timings.
Importantly, subtitles must be in sync with audio to provide the best possible viewing experience. Our customer obsession drove us to ensure that Prime Video has high-quality content, including in-sync subtitles across our catalog. Subtitles that are out-of-sync with the audio stream disrupt the viewing experience. This is also known as “subtitle drift.” To prevent out-of-sync subtitles from reaching customers, we built a system that identifies and corrects them. Our system works irrespective of the subtitle and audio languages.
We set out to have a drift detection, localization, and correction system that 1) is agnostic of the language of the subtitle and content audio stream, 2) localizes the drift at minute-level granularity, and 3) automatically corrects the drift. The system takes in the timings of the subtitles from the subtitle file for a small minute-level segment, uses a language-agnostic and noise-robust voice activity detection (VAD) system to identify speech durations in the segment, identifies if there’s subtitle drift, and calculates its amount. The subtitles are then adjusted by the given amount for the drifted segment.
Identifying multiple technical challenges
We identified and addressed four technical challenges to develop our automated subtitle drift correction system. First, subtitles and the audio stream for content are in different languages. Second, the text content of subtitles in a language different from its audio stream is not a direct translation of the spoken audio. Third, the speech segments in the audio stream typically coexist with background or foreground music and sound effects. Fourth, deciding the exact granularity of drift localization and correction, which affects the system’s performance.
Our solution to detect subtitle drift was to match the timing of the subtitle blocks with the corresponding speech timings from the audio. As a first step, we used automatic speech recognition (ASR) and a traditional VAD model to generate the dialogue timings from the audio streams. However, these systems are limited in their language coverage and audio noise robustness. To overcome the language and audio noise barrier, we built an in-house language-agnostic and noise-robust VAD system. Our VAD model is based on a bidirectional gated recurrent unit (Bi-GRU), which identifies speech timings across multiple languages.
The VAD model is trained on an in-house corpus of 450 hours of content across more than 10 languages and five genres. Training on this corpus enables the model to generalize across languages and multiple noise categories. The VAD model results in state-of-the-art performance across multiple in-house and open-source datasets.
In a conversational speech setting (for example, movies and TV shows), automatically demarcating dialogue boundaries is a challenging task. VAD systems typically produce longer length speech segments when compared with the subtitle block timings that are manually identified based on scene understanding. Therefore, we built a machine learning (ML) model that takes in subtitle blocks and VAD timings for a small segment and identifies the drift amount. The system first finds the drifted segments and then intelligently reprocesses the segments to identify the drift amount.
Solving our technical challenges
We approached solving our four technical challenges in five steps, 1) audio and subtitle chunking at minute-level granularity, 2) identification of speech timings using the VAD model, 3) identification of statistical features for the time differences between the subtitle blocks and VAD timings, 4) identification of drift occurrence using the statistical features, and 5) reprocessing drifted segments to identify the drift amount. After the drift amount is calculated for each segment, the subtitles are corrected by shifting the blocks by the drift amount. The pipeline to identify drifted segments is outlined in the following diagram.
The diagram shows the following workflow:
1. Chunk audio and subtitles into small minute-level segments.
2. Compute the VAD-based speech timings.
3. Compute the statistical features based on the timings from VAD and subtitle blocks.
4. Identify if the segment has drift or is not using the drift detection model.
5. Reprocess the drifted segment to identify the drift amount.
After we break the audio stream and subtitles into small segments, we pass the audio segments to our neural VAD model. The VAD model generates the probabilities for speech and non-speech. These probabilities are converted to predicted speech blocks using a threshold followed by a smoothing process.
After this step, we compute a many-to-many mapping between the subtitle blocks and predicted speech blocks in the given segment. Next, we compute several features based on the timing differences between the mapped subtitle blocks and the predicted speech timings. These features include the difference in start and end timings, the signed values that identify whether the subtitle blocks starts/ends earlier than the predicted speech timings, and the overlap percentage between the subtitle block timings and the predicted speech blocks timings. Further, we compute several statistics of the features such as mean, standard deviation, quantiles, and other higher order statistics. Finally, we pass these statistical features for a segment to a single-layer feed-forward neural architecture, which generates the probability of drift or no drift.
If a segment is found to be drifted, we intelligently reprocess the inputs to identify the amount of drift. For a predicted drifted segment, we artificially introduce shifts in the subtitle blocks in the forward and backward directions at certain granularities and recompute the probability of drift. The shift for which the system predicts no-drift is considered to be the drift amount. Finally, all the subtitles belonging to the segment are shifted by the drift amount, resulting in a correct alignment with the audio.
We developed an automated language-agnostic and noise-robust drift localization and correction system. Prior to this system, drifts were manually identified and corrected. This required linear watching of the content and at least bilingual expertise, which made the process inefficient. With this innovation, we can process a file in minutes and localize the drifted regions.
This innovation not only helps to assess the existing catalog quality, but is also used to detect and self-heal both existing and newly created content. This ensures that all subtitles ingested into the Prime Video catalog are error-free and is crucial because it provides the best possible viewing experience for Prime Video customers.
Stay tuned for more!