How Prime Video troubleshoots quickly and cost-effectively at scale
Troubleshooting across Prime Video services using a near-real time log ingestion and query system that combines zero-knowledge observability with user and product insights.
In a decoupled environment with hundreds of services and client paths involving multiple hops, it can be cumbersome to find the source of a glitch and solve a problem before it impacts customers. Prime Video teams need to detect, debug, and solve problems, and thereby delight customers with an uninterrupted viewing experience. And they must do it quickly, accurately, reliably, and cost-effectively.
The Prime Video Federated Diagnostics team’s mission is to provide Prime Video teams with real-time system observability that enables informed decision-making about the viewer experience. Prime Video works at massive scale, supporting millions of customers every day. That scale results in our systems producing petabytes of data. With continued steep viewer growth, rapid expansion, and ever-growing delivery options, making sure that we deliver 100% uptime for our viewers involves proactively detecting problems and taking rapid corrective actions. We were inspired by this unique challenge and set out to provide our teams with the data that they need to swiftly troubleshoot, triage, and mitigate any viewer-impacting problems.
Building a three-layer solution
Prime Video service teams told us they had good visibility into metrics and alarms but were struggling to access contextual data from petabytes of log files in real time. We solved this by providing access to log files via a three-layer solution, which is based on industry-standard features that are scalable, accelerate root cause analysis, and help mitigate problems before they impact our viewers.
The base layer allows users to analyze logs from individual services. Service teams told us that they could get value from their existing logs, so we architected a solution that accepted this existing unstructured log data. This allowed teams to quickly onboard in less than two hours. We also wanted teams to use powerful, industry-rich query tooling and chose SQL as the query language. To bring it all together, we provided programmatic access to allow users to integrate their own tools for querying and post-processing data, in addition to a fully-featured UI that included dashboards, pivot tables, and saved queries. We chose Amazon Web Services (AWS) technologies, such as Amazon Elastic Container Service (Amazon ECS) and AWS Lambda, that auto scale with demand so that teams don’t have to worry about tool maintenance, which helps them further focus on supporting the viewer experience.
We received feedback that providing teams with faster access to existing logs helped mitigate most glitches within a single service. However, Prime Video’s large and complex decoupled service environment meant that cross-service analysis using the existing log files was time-consuming for many use cases, resulting in longer resolution times. Service teams needed functionality to quickly correlate log data between services to help speed up cross-service analysis. To do this, we developed the second distributed tracing layer.
We introduced a new Flare log that contained structured log statements conforming to a defined schema and that encouraged discoverability of semantically understandable data. Users could stitch together requests between different services because we automatically appended a trace identifier to each log statement. During ingestion, we store each service’s Flare log statements in separate files but we logically bundle them into one consolidated queryable data source. This design is resilient to delays or outages in a single service’s Flare log delivery because the logical view automatically updates as each service’s data arrives. It allows users to one-click query against a trace identifier and view log statements from all services that were called. The user now doesn’t need to know the call path through services while debugging. These logs contain semantically clear information such as TitleDroppedException (rather than a NullPointerException) that help teams find the “problem” service. They can further deep dive with one click into existing service logs to retrieve additional context. Onboarding takes minutes because users only need to add a single line of configuration to automatically log each API call to the Flare log.
Our distributed tracing layer successfully solved the cross-team analysis use case but teams challenged us to provide easier and faster access to the overall health of the Prime Video service environment. We solved this by building an observability layer on top of the distributed tracing layer. We introduced visualizations that built on aggregated data from Flare logs to map connectivity, latencies, and error rates, giving users a live view of the state of Prime Video’s decoupled service environment, without having to write any queries. We also allow analysis of a single trace through services via a trace timeline feature showing service calls, latency, and service errors. This functionality has allowed our teams to more quickly triage and root cause problems.
By putting all these layers together, we now have a critical system for Prime Video teams to support their daily operations and high-velocity events.
Laser-focused on determining the root cause
We were laser-focused on helping Prime Video teams quickly determine root cause and problem ownership, which in a decoupled services environment can be intricate. Self-service tools were critical to allow users to find answers quickly. We wanted to empower users to onboard their logs and independently query the data they need. We had to solve cross-service data analysis as we had data that ticket “table tennis” between services was delaying resolution.
The service that triggered an alarm might not be the source of the problem but merely an indicator of an upstream or downstream service’s anomaly. The existing processes had users dive into glitches and, if they couldn’t fix it, pass it to their next upstream or downstream team. We needed to reduce these interactions and have teams only engage with each other if the problem is outside of their knowledge. We solved three challenges so that teams could make informed decisions around cross-service analysis:
- Difficulty querying and understanding logs from different services;
- No consistent trace identifier to correlate or join logs between services;
- No visibility of services that participated in the fulfillment of a viewer’s request.
We also had to make sure that we could scale to support planned and unexpected traffic volumes. High-velocity events at Prime Video can be at different times, in different territories, and hosted in different data centers. Viral trends mean that we might not be aware of upcoming traffic spikes, so our services had to have zero-touch auto-scaling to ensure operations continue. We could scale to peak traffic all the time but this would not be a frugal use of AWS resources.
Prime Video teams own and are responsible for all data within log files, so we made a conscious choice that each team should store their logs in Amazon Simple Storage Service (Amazon S3) buckets in an AWS account that they own. We provide each service team with an AWS Cloud Development Kit (CDK) script that allows them to quickly and simply configure their infrastructure in areas such as permissions. Using an AWS CDK script also meant an easier support model for us as each service team has an identical configuration.
We ingest and vend data from existing application, request, service, and bespoke logs that are generated using standard log utilities such as Apache Log4j. These are data sources that allow teams to analyze logs from their service, which relates to the base layer of our three-layer solution. Most teams also create Flare logs to seamlessly analyze log files across service boundaries. These are the data sources for the second distributed tracing layer.
Flare logs contain highly valuable data, but we knew we had to provide a low barrier to entry to encourage engagement. To create Flare logs, service teams set a configuration to automatically write each API call to the services Flare log. Teams can add additional log statements in their service code if they want to supplement details. We used Disco, the Amazon open-source library, to automatically create and add a trace identifier to each log statement so that users can trace requests across services. Disco handles the in-process propagation (between threads) and between-process propagation (via HTTP headers), which means that developers don’t need to actively think about their Flare log statements and can focus on delivering value for our viewers, instead. We ingest logs for each Prime Video service through separate ingestion infrastructure allowing each service to scale independently. This multi-tenant setup also adds resiliency by preventing any ingestion problems bleeding between services.
The first step in our ingestion process is to copy existing logs from each service host to an S3 bucket at a rate of once a minute. Once written to Amazon S3 an Amazon Simple Notification Service (Amazon SNS) notification is sent. We have an Amazon Simple Queue Service (Amazon SQS) queue listening for this notification and it adds the new log file to this queue for processing. Every minute, we use Amazon CloudWatch Events to trigger a Lambda function to check the size of the queue.
From this, we calculate how many Amazon ECS processing tasks are required to clear the queue. As the queue gets larger, more processing tasks come online to clear it. As the queue gets smaller, the number of tasks reduces. We have an aggressive scaling algorithm to ensure that the SQS queue is never too large when there are spikes or peaks in log volume.
In fact, we alarm on AgeOfOldestMessage, not SQS queue depth, because we know at times the queue might grow in size but we are confident our scaling algorithm will launch enough tasks to clear it in a timely manner. The Amazon ECS tasks efficiently and cost-effectively convert the text-formatted log files into binary Apache Parquet format. Apache Parquet is a columnar data format that allows us to efficiently query log data.
When onboarding a service, teams can define which data is extracted into which columns for easier SQL querying. For example, while onboarding, teams can configure the ingestion process to split log statements into several columns using string extraction techniques such as regular expressions. Often the column extraction reverses the formatted strings that are used to create the log statement within the code.
The following two log statement examples could be ingested with a configuration that splits the text into seven columns.
Sample Log Message 1
09 May 2022 11:30:58,700 [INFO] (Executor-7) com.amazon.pv.ServiceName: Request: [CustomerId ABCDE] [Asin ASIN-123] Log Message here
Sample Log Message 2
09 May 2022 11:30:58,800 [ERROR] (Executor-2) com.amazon.pv.ServiceName: Request: [CustomerId ABCDE] Incorrect Login Permissions
|Column Name||Log Message 1||Log Message 2|
|Log Message||Log Message here||Incorrect Login Permissions|
These files are written back into the S3 bucket in the team-owned AWS accounts. The data indices are based on hour partitions and this is implemented through our S3 bucket structure of year/month/day/hour. We use Amazon Athena as our query engine to read these Parquet formatted files and use its partition projection feature to automatically update the partition listing.
The following diagram shows our ingestion architecture that converts raw text log files to parquet files that can be efficiently queried. Our P99 ingestion latency is less than three minutes.
The UI allows users to query all onboarded log files via Amazon Athena. This UI is permissioned to ensure compliance with various worldwide regulations. Users write SQL to query data and create dashboards, share queries and results, and set up integrations with third-party tools. This UI also contains features to support the third Observability layer of our product. It is in this UI that we present features such as service connectivity graphs, trace timelines, and service health data to users.
We also provide programmatic access to our logs by which users can configure queries to run against a cron schedule and publish results to an SNS topic that integrates with Slack, email, or Amazon CloudWatch. Being on a cron schedule, data is pushed to users rather than having them pull it. As each team owns the S3 buckets where this data resides, some have fed this data into data lakes and other processes they use to support their service.
Since launching in 2018, tools from the Federated Diagnostics team are integral to all Prime Video services and are onboarding services in other Amazon organizations. We have users from across multiple job families, from developers to product managers, who run tens of thousands of queries in support of a seamless viewer experience.
Stay tuned for more from us!