Robust cross-modal representation learning with progressive self-distillation

Shixing Chen,

Jan 02, 2023

By Alex Andonian, Shixing Chen, and Raffay Hamid

The learning objective of vision-language approach of CLIP [63] does not effectively account for the noisy many-to-many correspondences found in web-harvested image captioning datasets, which contributes to its compute and data inefficiency. To address this challenge, we introduce a novel training framework based on cross-modal contrastive learning that uses progressive self-distillation and soft image-text alignments to more efficiently learn robust representations from noisy data. Our model distills its own knowledge to dynamically generate soft-alignment targets for a subset of images and captions in every minibatch, which are then used to update its parameters. Extensive evaluation across 14 benchmark datasets shows that our method consistently outperforms its CLIP counterpart in multiple settings, including: (a) zero-shot classification, (b) linear probe transfer, and (c) image-text retrieval, without incurring extra computational cost. Analysis using an ImageNet-based robustness test-bed [70] reveals that our method offers better effective robustness to natural distribution shifts compared to both ImageNet-trained models and CLIP itself. Lastly, pretraining with datasets spanning two orders of magnitude in size shows that our improvements over CLIP tend to scale with number of training examples.

For the full paper, see Robust cross-modal representation learning with progressive self-distillation on the Amazon Science website.

Twitter

Shixing Chen

Applied Scientist II – Prime Video

Raffay Hamid

Senior Principal Scientist – Prime Video

Most popular

Video Streaming

“We’re just beginning to build the future of live sports streaming”

At the European Women in Tech conference 2022, Filippa Hasselstrom, head of low-latency streaming at Prime Video, explained how her team builds the future of live sports streaming using UDP.

Filippa Hasselstrom

Feb 07, 2023

Our Innovation

Prime Video announces Amazon Research Awards recipients for fall 2022

Prime Video announces ARA awards in the fields of anomaly detection and insights, automated reasoning, personalization and discovery, and video quality analysis.

Staff Writer

Apr 17, 2023

Our People

Empathetic by design: How Amélie Werner prioritizes her team to drive innovation for customers

As head of Design Ops, UX Research, and Global Commerce Design at Prime Video, Amélie helped oversee the redesign of the user experience – a journey that’s allowed her to embrace Amazon’s Leadership Principles while empowering her colleagues.

Amélie Werner

Apr 05, 2023

Video Streaming

Innovating live video streaming for a VOD-only world

Here’s how Prime Video delivers live video streaming on customer devices that only support video-on-demand (VOD) playback.

Parminder Singh

Apr 13, 2023