Skip to main content

Prime Video uses automatic field registration to create immersive viewing experiences for live sports

Prime Video used computer vision technology to reinvent sports-field tracking for monocular broadcasting videos.

At Prime Video, we’re constantly expanding our live sports offerings, which includes exclusive coverage of NFL Thursday Night Football (TNF). Providing contextual information about player actions creates an engaging and immersive viewing experience for our customers. To do this, Prime Video developed sports-field tracking technology to create customer experiences such as highlighting key players during a play, displaying player speed and covered distances, and using visual aids to provide more insights about fouls (for example, offside fouls in soccer).

An important customer experience factor is mapping the playing field of a particular sport to a standardized field image of that sport. The following image provides an example of this.

An image of a football game is shown. Prime Video renders the playing field of an American football game to a standardized field image, maps the players to their locations on the field, and shows corresponding player positions on the diagram next to it.

A standardized field image of a football game.

We use a homography transformation computed with manually-labeled key points to perform content this 2D-to-2D mapping for each video frame. These key points are on the field as it appears in the video and on the field’s standardized image. However, this manual labeling process is time-consuming, so we needed to automate the process of finding the homography transformation for each video frame.

To automate the underlying process, we propose an innovative approach for automatic sports field registration that can run in real time and is based on computer vision and deep learning technologies. Our approach uses the RGB screen image as the only input to train a deep neural network to simultaneously localize both sparse key points and dense landmarks on the sports field. Then, these automatically detected key points and landmarks are jointly used to compute the homography transformation and perform the field registration for each frame of a sports video.

Automatic field registration framework

A key challenge for reliably and accurately performing sports field registration is the lack of sufficiently distinct field features (for example, insufficient landmarks or line intersections on the image). This challenge is prevalent in the top five sports in the United States (football, soccer, basketball, ice hockey, and tennis) and is caused by the following three factors:

  • Uniform field appearance – The sports field’s appearance has a uniform texture (for example, the middle area of soccer field is mostly occupied by green grass).
  • Narrow field-of-view – The camera only captures a small portion of the field.
  • Field occlusion – The players occlude the sports field’s landmarks (for example, multiple players in the same area occluding landmarks).

The following three images show the impact of these three factors in real life.

A series of three images. The first shows a soccer game where the field has a uniform appearance. The second shows an American football game where the view of the field is very narrow and might obstruct the play. The third image shows how players in a basketball game might block the view of the field around the hoop in the end zone of an opposition team.

The impact of uniform field appearance, narrow field-of-view, and field occlusion can be significant.

To address the lack of distinct features, Prime Video developed a new field registration framework that takes a monocular broadcast video and incrementally computes the homography for each video frame. Our framework provides the following two key technical innovations.

The first innovation is in the detection of a uniformly distributed grid of key points. Typically, key points in homography computation are only defined at the corners and line intersections of the field. Such sparse points cannot cover the entire field, which means that the uncovered parts of the field can contribute to poor overall registration accuracy. In contrast, we defined a grid of uniformly distributed key points to cover all the field so that every part of it can have sufficient key points. This creates a more accurate field registration. The following image show traditional key points and our key points grid.

A soccer field with a defined a grid of uniformly distributed key points to cover all the field so that every part of it can have sufficient key points.

A soccer field both without and with defined a grid of uniformly distributed key points.

Our second innovation relates to the detection of dense field-features to further improve our registration accuracy. These dense features are defined as the distance map from each pixel in the field template to its nearest field landmarks (for example, yard-line numbers in a football field in the following image).

Five examples that show how dense features are defined as the distance map from each pixel in the field template to its nearest field landmarks.

The application of dense field-features detection to yard-line numbers on a football field.

We train a deep neural network to simultaneously detect both a key-points grid and dense features for efficiency. Our network consists of a single encoder to extract image-features, and two detection heads which simultaneously detect a key-points grid and dense features.

Comparisons across three important metrics

To have a comprehensive evaluation of our approach for sports-field registration, we collected a diverse dataset with videos from five US sports (football, basketball, soccer, ice hockey, and tennis). The ground truth homography transformation is carefully annotated and verified by a human annotator. We use the following three evaluation metrics to measure the accuracy of our approach:

  • Intersection over union (IoU) – The IoU between the projected field using ground truth transformation and the field using estimated transformation.
  • Projection error – The average distance between the projected points on the actual field.
  • Back-projection error – The average distance between back-projected points on the screen image.

This evaluation demonstrates that our approach significantly outperforms other existing approaches across all three metrics. The approach’s improvement is even greater when compared on football videos. Football is more challenging than all the other sports because of the uniform field and large camera motion. Full details of our comparisons are available in the A robust and efficient framework for sports-field registration paper that we published at IEEE WACV in 2021.

Using our sports field registration approach, we can accurately map each pixel from the screen image to the location on the actual field. This mapping enables a set of sport visualization use-cases which provide the next-generation sports watching experience for our customers at Prime Video.

Looking ahead

Prime Video’s work on sports field registration is just the beginning. There are many challenges ahead, including extreme small camera field of view, uncommon camera angles, and uncommon field appearance. We are actively working to address these challenges to bring the robustness and accuracy of field registration to the next level.

Senior Principal Scientist – Prime Video
Senior Applied Scientist – Amazon Devices