Skip to main content

Prime Video presents on using spatio-temporal learning for video quality assessment

Research by Prime Video demonstrates that it’s vital to consider both spatial and temporal features when developing a model to estimate the visual quality of production-related content.

In January 2023, Prime Video presented a paper entitled On the importance of spatio-temporal learning for video quality assessment at the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV).

This research was conducted by Dario Fontanel (Applied Scientist Intern – Prime Video) during his internship on the Video Quality Analysis team at Prime Video, with support from Benoit Vallade (Applied Scientist II – Prime Video) and David Higham (Computer Vision Scientist – Prime Video). Video quality analysis ensures the high audiovisual quality of Prime Video’s video-on-demand (VOD) and live content by applying low-latency and low-cost machine learning (ML) at scale.

Our paper explores automated visual quality assurance for videos ingested into the Prime Video catalog. Specifically, we investigated the application of VFSA for estimating the visual quality of production-related content, such as those expected in the Prime Video catalog. VSFA is a state-of-the-art deep learning model for visual quality prediction on user-generated content, think videos uploaded to YouTube. Through a series of experiments that manipulated individual components of the model, we demonstrated that the modeling of both spatial and temporal features is vital when predicting the visual quality of production-related content.

In our first experiment, we found that a more complex model can extract more expressive spatial features that increases the correlation between the predictions and the true values. Further increasing the complexity of the model leads to no additional increase in correlation. We attribute this to a dimension-reduction module within the subsequent spatial learning discarding the supplementary information the most complex model can extract. We also show that the choice of temporal modelling is crucial. Using a sequence model followed by a module to handle the hysteresis effect can boost the Spearman rank-order correlation coefficient (SROCC) by 0.15. Finally, we show that it is important to use a consistent frame rate at train and test time. This simple change resulted in a 0.01 boost to the SROCC.

In summary, our work shows that it is vital to consider both spatial and temporal aspects when predicting the visual quality of production-related content. While we found that VSFA is inferior to the existing state-of-the-art for visual quality prediction on production-related content (for more information about this, see the No-reference video quality assessment using space-time chips paper), we believe that it shows promising results and should be explored further. We will continue to investigate state-of-the-art methods to improve content quality for all Prime Video’s customers.

Stay tuned for more from us!

Computer Vision Scientist – Prime Video
Applied Scientist Intern – Prime Video

Applied Scientist II – Prime Video