Publications

Multiscale audio spectrogram transformer for efficient audio classification

In this work, we develop a multiscale audio spectrogram Transformer (MAST) that employs hierarchical representation learning for efficient audio classification.

Wentao Zhu, Mohamed Omar

Mar 27, 2023

Computer Vision

Improving compression efficiency using an encoder-aware motion compensated temporal filter

To overcome the drawbacks of prior MCTF design, we propose an encoder-aware MCTF (EA-MCTF) that resides within the encoder.

Rahul Vanam, Sriram Sethuraman

Mar 17, 2023

Computer Vision

Subjective and objective video quality assessment of high dynamic range sports content

In this paper, we present a large-scale HDR video quality dataset for sports content that includes the above mentioned important issues in live streaming, and a method of merging multiple datasets using anchor videos.

Yixu Chen, Yongjun Wu, Hai Wei, Sriram Sethuraman

Mar 10, 2023

Computer Vision

Multiscale convolutional neural networks for in-loop video restoration

In this paper, we consider using a multiscale approach to reduce complexity while maintaining coding efficiency.

Kiran Misra, Andrew Segall, Byeongdoo Choi

Jan 27, 2023

Computer Vision

Face image quality for actor profile image curation

In this work, we describe the various factors which affect the suitability of a face image for recognition by humans. We propose efficient solutions which can solve the problem without the use of ground truth data. We train a regression model using weak supervision provided by heuristics based on features which affect face quality. Finally, we use professional photography techniques to create standardized and aesthetically pleasing profile images.

Yash Pandya, Abhinav Aggarwal, Manivel Sethu, Laxmi Shivaji Ahire, Kaustav Nandy

Jan 02, 2023

Machine Learning

Multi-lingual multi-task speech emotion recognition using wav2vec 2.0

In this work, we present a Multi-Lingual (MLi) and Multi-Task Learning (MTL) audio only SER system based on the multi-lingual pre-trained wav2vec 2.0 model.

Mayank Sharma

Jan 02, 2023

Computer Vision

Intro and recap detection for movies and TV series

In this work, we pose intro and recap detection as a supervised sequence labeling problem and propose a novel end-to-end deep learning framework to this end.

Xiang Hao, Ben Cheung, Raffay Hamid

Jan 02, 2023

Computer Vision

A no-reference model for detecting audio artifacts using pretrained audio neural networks

This work presents a No-Reference model to detect audio artifacts in video. The model, based upon a Pretrained Audio Neural Network, classifies a 1-second audio segment as either No Defect, Audio Hum, Audio Hiss, Audio Distortion or Audio Clicks. The model achieves a balanced accuracy of 0.986 on our proprietary simulated dataset.

David Higham, Ayush Bagla, Veneta Haralampieva

Jan 02, 2023

Machine Learning

Detect profane language in streaming services to protect young audiences

In this work, we develop a data collection pipeline to address long sequence of texts and integrate this pipeline with a multi-head self-attention model.

Kai Wei, Xiang Hao

Jan 02, 2023

Computer Vision

CNN-based audio event recognition for automated violence classification and rating for Prime Video content

We show that, (a) audio based approach results in superior performance compared to other baselines, (b) benefit due to audio model is more pronounced on global multi-lingual data compared to English data and (c) the multi-modal model results in 63% rating accuracy and provides the ability to backfill top 90% Stream Weighted Coverage titles in PV catalog with 88% coverage at 91% accuracy.

Mayank Sharma, Xiang Hao, Raffay Hamid

Jan 02, 2023

Computer Vision

LipNeRF: What is the right feature space to lip-sync a NeRF

In this work, we propose LipNeRF, a lip-syncing NeRF that bridges the gap between the accurate lip synchronization of GAN-based methods and the accurate 3D face modeling of NeRFs.

Abhinav Jain, Rohith Mysore Vijaya Kumar, Vimal Bhat

Jan 02, 2023

Computer Vision

Robust cross-modal representation learning with progressive self-distillation

We introduce a novel training framework based on cross-modal contrastive learning that uses progressive self-distillation and soft image-text alignments to more efficiently learn robust representations from noisy data.

Shixing Chen, Raffay Hamid

Jan 02, 2023

Publications

Prime Video is a great place to build and innovate at scale, but that’s only one part of the story. Our technologists also publish, teach, and engage with the worldwide research community.