Automatically detecting recaps, introductions, and credits in content at scale
Prime Video uses computer vision and video understanding techniques to detect different video content segments, such as introductions, recaps, and opening or ending credits.
In this article, we will explore how Prime Video, with an unparalleled commitment to cutting-edge technology and customer satisfaction, leverages its innovative techniques to enrich its vast catalog. By utilizing innovative approaches, Prime Video provides a seamless and personalized experience to customers.
The Prime Video content segmentation journey began in Amazonian fashion with Customer Obsession and wanting to improve the customer’s control over their viewing experience. We set out to segmentize videos into different parts, enabling Prime Video customers to skip repetitive segments (for example, skipping recaps if watching multiple episodes of a same TV series at one time).
We observed that many customers used our Skip Intro, Skip Recap, and Skip Credits buttons when available to them. Due to the popularity of this feature, the widespread availability of the Skip Intro option across Prime Video’s catalog was one of the biggest motivations to scale our annotation pipeline by moving from a slow manual process to one that was automated and scalable.
Identifying the key customer requirements
Before we began work on our automated solution, we had to ensure that we evaluated the following two customer requirements to ensure the best possible customer experience.
1) The Skip Intro option must be seamless
Our first challenge in automating the process involved detection accuracy. Detecting incorrect segments has a negative impact on a customer’s experience. For example, if a customer wants to skip the introduction segment of a TV show but the player skips more than or less than they intended, then the customer needs to spend more time searching for the correct starting point. This experience can be more painful if customers use devices that don’t have a straightforward navigating system to find the correct point.
2) Complete the process or don’t do it at all
Building a system to be self-aware of its shortcomings was another challenge that we faced during our design phase. A fully automated system can detect the existence of an introduction segment (some content doesn’t contain introduction segments) and precisely locate the start and end timestamp of the detected segment. If the system detects an introduction segment in content but cannot precisely locate the start or end point, the system should suggest a manual annotation path. Accordingly, if the content doesn’t contain any introduction segment as detected by the system, the content should be published without manual review. We had to ensure that the system minimized the total amount of content that needed to be manually annotated or verified. Otherwise, we could not scale the system efficiently.
Our journey toward finding a solution
We began collecting and annotating data to train a supervised system to automatically detect the introduction segment. We expected our powerful machines, in combination with our deep learning (DL) model, to find a common pattern with high precision and recall. However, we realized that there was no magic pattern to be extracted from the training data.
Introduction segments of episodic content are incredibly diverse. Some have music, some are just text over a black screen, some contain no visual aspects at all and are limited to audio playing over the main content. Our effort to tackle discrete parts of the problem didn’t show promising results, and our attempt to simplify the problem into two different parts (classification and localization) didn’t produce better results due to the diversity of input data. Our attempt included: 1) classification to detect the existence of the introduction segment, and 2) localization to locate the start and end points. The more creative the content producer becomes, the more challenging it is to solve this problem and we were put in the impossible situation of classifying creativity.
Despite mixed results, our findings guided us toward the correct technology to segmentize content. Rather than collecting more data and spending more time annotating , we decided to look for a different approach, which could better analyze diverse data.
Even though we couldn’t find a distinctive introduction pattern, we realized we might not need a pattern. During our deep dive sessions, we found out many introduction segments in a season have highly similar visual or audio aspects, which changed our focus from finding a universal pattern to finding repetitive segments within a season of each unique TV series. The following diagram visualizes common segments in a TV series season.
To identify similar segments among episodes in a season, we took two independent approaches. First, we checked visual similarity and second, we checked audio similarity. To check the visual similarity, we split the video into frames. We used a deep learning image classification model to convert each frame into a vector. Converting visual data from RGB format into a vector helps us effectively check the similarity of content. Our DL model was trained on large open-source and internal data sets. We used a preclassifier/final layer with the output size of 4096, which gives a large dimension to check the similarity of two images from several aspects.
After converting each frame into a vector, we used a clustering algorithm by measuring the L2 distance to identify the most similar frames based on their level of similarity above a certain threshold; we clustered these highly similar frames into one group. The output of the above steps can be visualized in the following diagram.
The following image shows how introduction frames are repeated across different episodes of Tom Clancy’s Jack Ryan: Season 2.
The next step is differentiating between consecutive and non-consecutive (noisy) clusters. There are many noisy clusters, such as all the black frames, similar locations in a season, and establishing scenes. We have used several filtering mechanisms to eliminate noisy clusters.
First, we limited the location of the cluster; since almost most introduction segments appear at the first X% of the content, we automatically removed all the clusters that appear at last (1-X)% of the content. Second, we removed clusters that are non-consecutive or the length of consecutive clusters is lower than a predefined value. This value can change based on the content type and other factors to filter out noisy repetitive frames and achieve maximum accuracy.
The third filter that significantly reduced the total number of noisy frames is the distribution of frames on each cluster. Since each episode has only one (or no) introduction, we automatically removed clusters in which the majority of frames are from one or few episodes. The distribution score (D-Score) threshold depends on the total number of episodes in one season and has a lower and upper threshold. D-Score, upper and lower threshold can change dynamically based on the other factors in content. The following diagram shows this approach.
Despite receiving promising results, this solution didn’t fully solve our problem and required more auxiliary systems to find other introduction forms. The above explanation succeeds at detecting visually or highly repetitive introductions. Still, other forms of introductions vary visually among different episodes of the same season. We approached this problem by adding audio signals to our system. We noticed we can repeat the same strategy using acoustic fingerprinting techniques. Acoustic fingerprinting allowed us to identify highly repetitive audio segments and apply the same filters to find noisy, repetitive audio segments such as sound effects (walking or closing a door).
To precisely locate the start and end introduction timecodes, we can easily take the first and the last consecutive cluster timecodes for each episode and annotate the video. The combination of audio and visual signals, followed by post-processing techniques such as filtering and scoring, allowed for the effective removal of noisy clusters and identification of the introduction segments with a maximum 41ms error rate, several times better than the previous top 300ms timespan.
Computing a confidence score for each prediction could help us label predictions with lower confidence scores for manual annotation or verification. During our process, we easily found repetitive segments with frame-level granularity, which is an excellent indicator of the existence of introduction in a season. We used a set of labeled data to automatically train a classification model to use cluster metadata such as 1) number of clusters, 2) quality of each cluster (similarity), 3) D-Score, and 4) consecutiveness to distinguish between introduction clusters versus noisy clusters. The higher the value of the classification model, the higher the likelihood of having an introduction in season. When there are no clusters or few clusters with low D-Score, and no consecutive order, our system can easily predict no intro found in this content and automatically publish the result without any manual intervention.
Furthermore, for annotated episodes, we use the average weight of consecutive clusters in combination with the coverage of the detected part to indicate the level of the repetitiveness of each prediction. The higher value indicates a higher level of repetitiveness and, accordingly, a higher level of confidence.
The coverage of repetitiveness indicates what percentage of introduction segments did not have any repetitive part in other episodes. To better understand this concept, consider a situation where the system finds two or more highly repetitive segments across all the episodes in a season, where all of them are longer than a minimum acceptable duration (for example, 10 seconds) and have a high likelihood to be the introduction part. We found this pattern when content creators use the introduction segment to introduce new actors or new locations in the following episodes. This concept leads to fragmentation in our detection. To solve this challenge, we used a post-processing technique that uses the length of the repetitive segments and the duration of the gap between them to merge the repetitive parts or select the one part with the higher repetitiveness score. The following diagram shows the discontinuity and gap between repetitive parts in episodic content.
During the process, we found out frame extraction rate can have a significant impact on finding similar or highly similar content, especially considering that introduction content is short and content creators try to represent a lot of information with visual effects and drastic transition shots, which makes two consecutive frames different.
Extracting all frames could slow down the process and make the detection module expensive since we had to feed every single frame to a DL mode to convert each image into a vector. After several experiments and measuring the impact of frame extraction rate, we used the below chart to find the best-suited spot for our production system. We used a dynamic mapping system to incorporate different values for frame rate extraction based on content metadata such as action content vs. drama content and preprocessing information to detect visual effects.
Despite receiving better results by using a higher frame rate, it drastically increases the process rate and cost per episode to convert each frame to a vector by running through the deep learning models. Finding the right balance between precision, recall, and cost requires extensive research and experimentation. We used a dynamic mechanism to select our frame rate based on other factors in the season to achieve the highest level of accuracy.
Prime Video’s work on detecting and localizing different video segments shows how simple techniques, along with unsupervised clustering methods, can produce high-quality output with minimal data labeling. Furthermore, the process showed how unsupervised clustering techniques can provide transparency into the decision-making process and produce explainable output.
Stay tuned for more from us!