Automatically identifying scene boundaries in movies and TV shows
Prime Video beat previous state-of-the-art work on the MovieNet dataset by 13% with a new model that is 90% smaller and 84% faster.
In June 2021 at the conference on Computer Vision and Pattern Recognition (CVPR 2021), Prime Video presented ShotCoL, a state-of-the-art self-supervised algorithm that we developed for scene boundary detection.
Scene boundary detection is identifying where scenes in a movie begin and end. This foundational capability is used at Prime Video for introducing adverts in advertising-based video-on-demand (AVOD) content at the least disruptive moments in the content’s timeline and in applications based on cinematic content understudying, such as scene classification, video retrieval, and video summarization.
ShotCoL achieves 13% higher average precision than previous works on the publicly-available MovieNet dataset, while running 84% faster and using 90% fewer model parameters. ShotCoL is also more data efficient than previous works. In fact, it requires 75% less labeled data during downstream evaluation to match the previous state-of-the-art performance.
ShotCoL leverages contrastive learning (a form of self-supervised learning) which became popular in 2019 for image-based learning. Contrastive learning teaches the model to distinguish between similar and dissimilar examples that are defined without using human labels. While the previous works leverage image based contrastive learning solutions that use augmentation (such as flipping or rotating the image) of image as a positive pair, we are one of the first to apply contrastive learning to videos, and specifically for movies.
Image augmentation doesn’t work well for videos because it doesn’t incorporate the temporal context. Therefore, we designed a new contrastive learning algorithm. The key intuition behind ShotCoL is that nearby movie shots are more likely to be similar to each other than to further away shots. We define positive pairs by searching a local neighborhood of shots and picking the most similar one. The model then learns to contrast these similar shots against randomly selected shots from further away. This leads to a representation that can effectively cluster similar shots and thereby localize scene boundaries.
After publishing at CVPR 2021, we deployed the model to drive a 43% reduction in the operator handling time for advertisement cue point insertion in AVOD content. The model is one of the foundational components that break a video into its logical sub-parts to drive cinematic content understanding based applications. You can read more about the model in our Automatically identifying scene boundaries in movies and TV shows article on Amazon Science.