Building resilient services at Prime Video using chaos engineering
Prime Video used chaos testing to discover and prevent a customer-impacting failure, and then released an open-source library to help the developer community.
Large-scale distributed software systems are composed of several individual sub-systems (for example, content delivery networks (CDNs), load balancers, and databases) and their interactions. These interactions occasionally have unpredictable outcomes caused by unforeseen turbulent events (such as a network failure). These events can lead to system-wide failures.
Chaos engineering is the process of experimenting on a distributed system to build confidence in the system’s capability to withstand turbulent events. The key to chaos engineering is injecting failure in a controlled manner. In 2020, a Prime Video software engineering team built and then open-sourced a lightweight library for failure injection called AWSSSMChaosRunner. You can use this library for injecting failures in systems running on Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Elastic Container Service (Amazon ECS). The library orchestrates failure injection by using AWS Systems Manager, which is installed and running by default on these underlying hosts.
The team ran failure-injection tests using this library to validate service-to-cache configurations such as timeouts, retries, and circuit-breaker. These tests led to the discovery of bugs in the timeout logic. These bugs were fixed before the service was launched, thereby preventing a customer-impacting failure from occurring.
For more in-depth information about this, see the Building resilient services at Prime Video with chaos engineering article on the AWS Open Source Blog.