At Prime Video, we’re constantly expanding our live sports offerings, which includes exclusive coverage of NFL Thursday Night Football (TNF). Providing contextual information about player actions creates an engaging and immersive viewing experience for our customers. To do this, Prime Video developed sports-field tracking technology to create customer experiences such as highlighting key players during a play, displaying player speed and covered distances, and using visual aids to provide more insights about fouls (for example, offside fouls in soccer).
An important customer experience factor is mapping the playing field of a particular sport to a standardized field image of that sport. The following image provides an example of this.
We use a homography transformation computed with manually-labeled key points to perform content this 2D-to-2D mapping for each video frame. These key points are on the field as it appears in the video and on the field’s standardized image. However, this manual labeling process is time-consuming, so we needed to automate the process of finding the homography transformation for each video frame.
To automate the underlying process, we propose an innovative approach for automatic sports field registration that can run in real time and is based on computer vision and deep learning technologies. Our approach uses the RGB screen image as the only input to train a deep neural network to simultaneously localize both sparse key points and dense landmarks on the sports field. Then, these automatically detected key points and landmarks are jointly used to compute the homography transformation and perform the field registration for each frame of a sports video.
Automatic field registration framework
A key challenge for reliably and accurately performing sports field registration is the lack of sufficiently distinct field features (for example, insufficient landmarks or line intersections on the image). This challenge is prevalent in the top five sports in the United States (football, soccer, basketball, ice hockey, and tennis) and is caused by the following three factors:
- Uniform field appearance – The sports field’s appearance has a uniform texture (for example, the middle area of soccer field is mostly occupied by green grass).
- Narrow field-of-view – The camera only captures a small portion of the field.
- Field occlusion – The players occlude the sports field’s landmarks (for example, multiple players in the same area occluding landmarks).
The following three images show the impact of these three factors in real life.
To address the lack of distinct features, Prime Video developed a new field registration framework that takes a monocular broadcast video and incrementally computes the homography for each video frame. Our framework provides the following two key technical innovations.
The first innovation is in the detection of a uniformly distributed grid of key points. Typically, key points in homography computation are only defined at the corners and line intersections of the field. Such sparse points cannot cover the entire field, which means that the uncovered parts of the field can contribute to poor overall registration accuracy. In contrast, we defined a grid of uniformly distributed key points to cover all the field so that every part of it can have sufficient key points. This creates a more accurate field registration. The following image show traditional key points and our key points grid.
Our second innovation relates to the detection of dense field-features to further improve our registration accuracy. These dense features are defined as the distance map from each pixel in the field template to its nearest field landmarks (for example, yard-line numbers in a football field in the following image).
We train a deep neural network to simultaneously detect both a key-points grid and dense features for efficiency. Our network consists of a single encoder to extract image-features, and two detection heads which simultaneously detect a key-points grid and dense features.
Comparisons across three important metrics
To have a comprehensive evaluation of our approach for sports-field registration, we collected a diverse dataset with videos from five US sports (football, basketball, soccer, ice hockey, and tennis). The ground truth homography transformation is carefully annotated and verified by a human annotator. We use the following three evaluation metrics to measure the accuracy of our approach:
- Intersection over union (IoU) – The IoU between the projected field using ground truth transformation and the field using estimated transformation.
- Projection error – The average distance between the projected points on the actual field.
- Back-projection error – The average distance between back-projected points on the screen image.
This evaluation demonstrates that our approach significantly outperforms other existing approaches across all three metrics. The approach’s improvement is even greater when compared on football videos. Football is more challenging than all the other sports because of the uniform field and large camera motion. Full details of our comparisons are available in the A robust and efficient framework for sports-field registration paper that we published at IEEE WACV in 2021.
Using our sports field registration approach, we can accurately map each pixel from the screen image to the location on the actual field. This mapping enables a set of sport visualization use-cases which provide the next-generation sports watching experience for our customers at Prime Video.
Looking ahead
Prime Video’s work on sports field registration is just the beginning. There are many challenges ahead, including extreme small camera field of view, uncommon camera angles, and uncommon field appearance. We are actively working to address these challenges to bring the robustness and accuracy of field registration to the next level.