Building resilient services at Prime Video using chaos engineering

Prime Video used chaos testing to discover and prevent a customer-impacting failure, and then released an open-source library to help the developer community.

Varun Jewalikar,

Feb 13, 2023

Large-scale distributed software systems are composed of several individual sub-systems (for example, content delivery networks (CDNs), load balancers, and databases) and their interactions. These interactions occasionally have unpredictable outcomes caused by unforeseen turbulent events (such as a network failure). These events can lead to system-wide failures.

Chaos engineering is the process of experimenting on a distributed system to build confidence in the system’s capability to withstand turbulent events. The key to chaos engineering is injecting failure in a controlled manner. In 2020, a Prime Video software engineering team built and then open-sourced a lightweight library for failure injection called AWSSSMChaosRunner. You can use this library for injecting failures in systems running on Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Elastic Container Service (Amazon ECS). The library orchestrates failure injection by using AWS Systems Manager, which is installed and running by default on these underlying hosts.

The team ran failure-injection tests using this library to validate service-to-cache configurations such as timeouts, retries, and circuit-breaker. These tests led to the discovery of bugs in the timeout logic. These bugs were fixed before the service was launched, thereby preventing a customer-impacting failure from occurring.

For more in-depth information about this, see the Building resilient services at Prime Video with chaos engineering article on the AWS Open Source Blog.

Twitter

Tags:

Varun Jewalikar

Software Development Engineer – Prime Video

Adrian Hornsby

Principal System Development Engineer – Amazon Web Services (AWS)

Most popular

Video Streaming

“We’re just beginning to build the future of live sports streaming”

At the European Women in Tech conference 2022, Filippa Hasselstrom, head of low-latency streaming at Prime Video, explained how her team builds the future of live sports streaming using UDP.

Filippa Hasselstrom

Feb 07, 2023

Our Innovation

Prime Video announces Amazon Research Awards recipients for fall 2022

Prime Video announces ARA awards in the fields of anomaly detection and insights, automated reasoning, personalization and discovery, and video quality analysis.

Staff Writer

Apr 17, 2023

Our People

Empathetic by design: How Amélie Werner prioritizes her team to drive innovation for customers

As head of Design Ops, UX Research, and Global Commerce Design at Prime Video, Amélie helped oversee the redesign of the user experience – a journey that’s allowed her to embrace Amazon’s Leadership Principles while empowering her colleagues.

Amélie Werner

Apr 05, 2023

Video Streaming

Innovating live video streaming for a VOD-only world

Here’s how Prime Video delivers live video streaming on customer devices that only support video-on-demand (VOD) playback.

Parminder Singh

Apr 13, 2023