Skip to main content


Prime Video is a great place to build and innovate at scale, but that’s only one part of the story. Our technologists also publish, teach, and engage with the worldwide research community.

In this work, we develop a multiscale audio spectrogram Transformer (MAST) that employs hierarchical representation learning for efficient audio classification.
To overcome the drawbacks of prior MCTF design, we propose an encoder-aware MCTF (EA-MCTF) that resides within the encoder.
In this paper, we present a large-scale HDR video quality dataset for sports content that includes the above mentioned important issues in live streaming, and a method of merging multiple datasets using anchor videos.
In this paper, we consider using a multiscale approach to reduce complexity while maintaining coding efficiency.
In this work, we describe the various factors which affect the suitability of a face image for recognition by humans. We propose efficient solutions which can solve the problem without the use of ground truth data. We train a regression model using weak supervision provided by heuristics based on features which affect face quality. Finally, we use professional photography techniques to create standardized and aesthetically pleasing profile images.
In this work, we present a Multi-Lingual (MLi) and Multi-Task Learning (MTL) audio only SER system based on the multi-lingual pre-trained wav2vec 2.0 model.
In this work, we pose intro and recap detection as a supervised sequence labeling problem and propose a novel end-to-end deep learning framework to this end.
This work presents a No-Reference model to detect audio artifacts in video. The model, based upon a Pretrained Audio Neural Network, classifies a 1-second audio segment as either No Defect, Audio Hum, Audio Hiss, Audio Distortion or Audio Clicks. The model achieves a balanced accuracy of 0.986 on our proprietary simulated dataset.
In this work, we develop a data collection pipeline to address long sequence of texts and integrate this pipeline with a multi-head self-attention model.
We show that, (a) audio based approach results in superior performance compared to other baselines, (b) benefit due to audio model is more pronounced on global multi-lingual data compared to English data and (c) the multi-modal model results in 63% rating accuracy and provides the ability to backfill top 90% Stream Weighted Coverage titles in PV catalog with 88% coverage at 91% accuracy.
In this work, we propose LipNeRF, a lip-syncing NeRF that bridges the gap between the accurate lip synchronization of GAN-based methods and the accurate 3D face modeling of NeRFs.
We introduce a novel training framework based on cross-modal contrastive learning that uses progressive self-distillation and soft image-text alignments to more efficiently learn robust representations from noisy data.