Skip to main content

How Prime Video uses contrastive learning to accelerate automatic video-understanding at scale

Prime Video invents new state-of-the-art weakly and self-supervised contrastive learning algorithms to reduce its dependence on large amounts of labeled training data.

At Prime Video, we’re always striving to invent best-in-class experiences for our customers that help them watch videos in a more engaged and informed way. Some of these customer experiences include maturity ratings and content descriptors to enable trust and safety, actor information provided through X-Ray to improve our customers’ engagement, and the ability to skip intros, recaps, and end-credits to offer a more laid-back viewing experience.

Given the large size of our video catalog, it’s not practical to manually find all the content attributes that enable these novel and engaging customer experiences. So, we need to use computer vision and machine learning (CV/ML) to maximally automate the understanding of our video content at scale.

Many of the previous works on automatic video understanding used fully supervised learning approaches that require large amounts of labeled data for model training. However, acquiring and managing these labels is one of the most time-consuming and costly steps in the overall model development and maintenance lifecycle. To address this bottleneck, Prime Video technologists invented state-of-the-art weakly and self-supervised contrastive learning approaches to train video-understanding models efficiently and effectively under limited-label setting. Several of these works have been accepted to IEEE/CVF Computer Vision and Pattern Recognition Conference (CVPR), which is the premier international venue for cutting-edge research in Computer Vision.

Automatic scene boundary detection

One example of the state-of-the-art works by Prime Video on contrastive learning was published at CVPR 2021. Our paper addresses the problem of automatically detecting the timestamps for when scenes in movies or TV episodes begin and end. Scenes play a crucial role in breaking the story line of movies and TV episodes into semantically cohesive parts, and inform multiple downstream tasks including video classification, search, and summarization. However, given their complex temporal structure, finding scene boundaries can be a challenging task that requires large amounts of labeled training data.

To address this challenge, we proposed a novel contrastive learning approach called ShotCoL. This approach naturally makes use of the underlying production process for movies and TV episodes in which directors and editors carefully arrange different shots and scenes to communicate the story in a smooth and believable manner.

This underlying process provides a simple yet effective invariance. Namely that nearby shots tend to have the same set of actors enacting a semantically cohesive story arc, and are therefore more similar to each other in expectation than a set of randomly selected shots. This invariance enables us to consider nearby shots as augmented versions of each other where the augmentation function can implicitly capture the local scene-structure significantly better than the previously used augmentation schemes (for example, various image preprocessing techniques).

Specifically, given a shot, we try to: 1) maximize its similarity with its most similar neighboring shot, and 2) minimize its similarity with randomly selected shots. The following image shows representative frames of ten shots from two different scenes in the “Stuart Little” movie.

The diagram illustrates representative frames of 10 shots from two contiguous scenes of the movie Stuart Little. The story-arch of each scene is distinguishable and semantically coherent. The first scene is about a boy coming down the stairs of a home while his family waits for him on the ground floor. The second scene is about a race between toy sailing boats. Shots 3, 4, and 5 from scene 1, and shots 1 and 2 from scene 2 are considered as current shot neighborhood. Neighborhood shots are considered as augmented versions of each other. Given a current shot, also called a query shot, a similar shot from the current neighborhood of the query shot is selected as the positive key shot. A set of negative key shots are randomly selected as well. These positive and negative key shots are used to maximize the similarity between the query and the positive key, and minimize the similarity of the query with the randomly selected negative key shots.

The story arc of each scene is distinguishable and semantically coherent. We consider similar nearby shots (for example, shot 5 and shot 3) as augmented versions of each other. This augmentation approach can capitalize on the underlying film-production process and encode the scene structure better than the existing augmentation methods.

Given a current shot (the query) we find a similar shot (the key) within its neighborhood and 1) maximize the similarity between the query and the key, and 2) minimize the similarity of the query with randomly selected shots.

By using this approach, our learned shot representation for the task of scene boundary detection offers state-of-the-art performance on public MovieNet benchmark data while requiring only around 25% of the training labels, using nine times fewer model parameters and offering seven times faster runtime compared to the previous state-of-the-art approaches. For more information about our work on automatic scene boundary detection, see the full paper on Shot contrastive self-supervised learning for scene boundary detection on the Amazon Science website.

Learning scene representation using movie similarities

Just as finding “where scene boundaries start and end is important, understanding “what” is happening in the detected scenes is also crucial for enabling a variety of downstream use cases related to video moderation, search, and recommendation.

Like scene boundaries, labeling content of movie-scenes is also a time-consuming process which makes applying end-to-end supervised methods for scene understanding a challenging problem. Moreover, directly using image-based visual representations for scene understanding tasks doesn’t prove to be effective given the large gap between the two domains. To address these challenges, we proposed a novel contrastive learning algorithm to find a general-purpose scene-representation that works well for a diverse set of scene-understanding tasks.

Our key intuition is illustrated in the following figure, and shows how commonly available sources of movie-level information (for example, genre, synopsis, co-watch information) can be used to effectively guide the process of learning a generalizable scene representation. Specifically, we used movie-level information to define a measure of movie-similarity, and used it during contrastive learning to limit our search for positive scene-pairs to only the movies that are considered similar to each other.

Figure illustrates two similar movies with the same genre along with four non-similar movies with different genres. Thematically similar scenes from similar movies are shown to be automatically selected, and used in a contrastive learning setting along with randomly selected scenes from non-similar movies. The general-purpose scene representation that is learned as a result of this process is shown to perform multiple downstream tasks including the classification of scene, actor, director, and writer etc.

This approach allows us to find positive scene-pairs that are not only visually similar but also semantically relevant. They can therefore provide us with a much richer set of geometric and thematic data augmentations compared to previously employed augmentation schemes. Furthermore, unlike previous contrastive learning approaches that mostly focus on images or shots, our approach builds on the recent developments in vision transformers to allow using variable-length multi-shot inputs. This enables our method to seamlessly incorporate the interplay among multiple shots resulting in a more informative representation.

Using a newly collected internal dataset containing 30,340 movies to learn our scene representation, we demonstrated the flexibility of our approach to handle both individual shots and multi-shot scenes as inputs to outperform existing state-of-the-art results on 11 downstream tasks using multiple public benchmark datasets.

Automatic frame captioning via cross-modal representation learning

Contrastive learning has also been quite successful in a weakly-supervised setting where large-scale weakly-correlated multimodal data (for example, image-text pairs) is used to learn cross-modal representations using contrastive learning techniques. Particularly, the recently proposed CLIP model has garnered a lot of attention due to its impressive zero-shot recognition ability and excellent transfer performance on downstream tasks.

However, methods like CLIP are data and compute inefficient primarily due to the underlying assumptions they make about the web-harvested data used for training by modeling the caption for each image to be accurately and exclusively related to only that image. Moreover, when using larger batch sizes (32k used for CLIP), the likelihood of observing negatives with high semantic similarity increases, which can further degrade the learned representations especially those associated with shared semantics between faulty negatives. Prime Video addressed these challenges in its paper accepted at CVPR 2022 as an oral presentation, where we proposed a novel method to more accurately model the many-to-many relationships between images of web-harvested datasets and their corresponding captions using soft probabilities rather than hard pairing labels.

This intuition is illustrated in the following figure, where CLIP learns a joint vision-language embedding space by bringing corresponding image-text representation together (the green links), while repelling unpaired instances away from each other (the red links). This formulation doesn’t account for potential semantic similarity between negative samples. In contrast, we propose to learn to predict a distribution of soft-alignment targets (the dotted blue edges) in a given minibatch. This allows our model to learn more robust representations. This robustness is evident when comparing predicted distributions on an out-of-distribution image from ImageNet-R dataset, where unlike CLIP, our method can correctly classify a goldfish rendered in a stained-glass pane.

Figure illustrates a comparison of the proposed approach with a previous state-of-the-art approach called CLIP which learns a joint vision-language embedding space by bringing corresponding image-text representation together while repelling unpaired instances away from each other. Green links are shown to represent corresponding image-text instances, while red links are shown to represent unpaired image-text instances. The proposed approach is shown to learn to predict a distribution of soft-alignment targets which are represented by dotted blue edges. The robustness of the proposed approach is illustrated by comparing its predicted distributions on an out-of-distribution image of a goldfish rendered in a stained-glass pane. Unlike CLIP, the proposed method is shown to correctly classify this example test image.

Specifically, our language-image pretraining approach uses progressive self-distillation and soft image-text alignment targets to more efficiently learn from noisy data. Instead of explicitly finding, correcting, or even pruning noisy correspondences, our joint student-teacher model dynamically generates a new set of soft-alignments for a random subset of images and captions in every minibatch. This enables our method to model many-to-many relationships while simultaneously re-calibrating potentially poorly matched instances without needing to identify them. Over the course of training, our network generates soft alignments for increasingly larger subsets of a minibatch, effectively becoming its own teacher. We identified several key elements that allow the student network to predict its targets without representation collapse or reinforcing its mistakes.

We used multiple pretraining datasets to extensively compare our approach to CLIP evaluated on 14 benchmark datasets, where our approach consistently outperforms CLIP under multiple settings. Analysis using an ImageNet-based robustness test-bed shows that our method offers better effective robustness to natural distribution shifts compared to both ImageNet-trained models and CLIP. Pretraining with datasets spanning two orders of magnitude in size shows that our improvements over CLIP tend to scale with number of training examples. Finally, the simplicity of our approach means that it can be readily incorporated into existing and future methods. Our approach can be applied to accurately predict textual captions of video frames and enable our technologists to quickly retrieve large subsets of frames and shots that are relevant to a user-provided set of textual queries, and use the retrieved frames as labelled data after some light-weight curation. For more information about this work, see our Robust cross-modal representation learning with progressive self-distillation paper on the Amazon Science website.

Conclusions

Prime Video’s work on using contrastive learning under self-supervised and weakly-supervised settings has shown that this is an effective way to address the important bottleneck of acquiring and managing large sets of labeled data to train a variety of models that are related to long-form video understanding.

Going forward, we plan to further improve the efficiency of our contrastive learning approaches. So, stay tuned for more from us.

Senior Principal Scientist – Prime Video