How Prime Video ingests, processes, and distributes live TV to millions of customers around the world, while reducing costs
At Prime Video, cost optimization doesn’t mean compromising on quality or reliability for customers.
Prime Video began its live streaming journey in October 2015, with the launch of subscriptions for Showtime and Starz channels in the United States, which included eight linear stations from each channel partner. Since then, in partnership with broadcasters and content owners, we have launched over 1,000 TV stations, bringing a mix of simulcast subscription, traditional free linear, and free ad-supported TV (FAST) services spanning multiple genres such as entertainment, news, and live sports to a global audience.
This article will explain how we built, scaled, and operate our platform globally, and how we think about reliability, availability, cost, security, and sustainability as we grow the content selection and customer base. The article will discuss multiple aspects of our linear TV services and provide an introduction as to how we do live streaming at a global scale.
Werner Vogels, traditional TV, and Prime Video
Let’s begin with some numbers that show our scale and why the dimensions we described earlier are important to us. We operate in partnership with over 30 broadcasters and broadcast services providers, those partners deliver live video into Prime Video 24 hours a day, seven days a week, and 365 days a year.
My colleague Andrew Collins, who leads a team at Prime Video that teaches computers to watch TV, wouldn’t let me write this article without at least once quoting Amazon CTO Werner Vogels. So, here’s my favorite (and probably the most well-known) quote from Werner: “Everything fails all the time.”
Like all good Amazonian narratives, this one starts with a metric: 99.999% (5 9s).
This is a truism in the world of the internet, which relies on communication between disparate software, servers, and networks. Its strength is in this heterogeneity but it can also be a weakness. When was the last time you turned on your regular, terrestrial, satellite, or cable TV and the screen was blank? Or, when was the last time that you were watching a live sports event and the picture froze? Not often…right?
Traditional broadcast television has 100 years of innovation in reliability; and viewers all over the world expect this when they switch on their TVs. So, the most important thing for us at Prime Video is that this experience, of the TV always being on, is the same for Prime Video customers watching live TV services, delivered through our applications.
Requiring 5 9s availability
Like all good Amazonian narratives, this one starts with a metric: 99.999% (5 9s). This is the availability that we build our services to meet and a metric that translates as an “acceptable” unplanned downtime of 26 seconds per month. We hold ourselves accountable to that metric and we hold our partners accountable too.
How do we know our system can meet this uptime? Well, that’s just maths, too. Any component in any system has a mean time between failure (MTBF), so the combined MTBF of all the components in a system can be calculated to provide the overall availability.
The core of Prime Video’s live streaming platform consists of AWS Elemental MediaConnect, AWS Elemental MediaLive and AWS Elemental MediaPackage. The following diagram shows how we deploy the AWS Elemental and networking services in a 1+1 architecture, with four source signals from the partner, two (primary and backup) available to each leg of infrastructure, enabling failover between the two if one of the sources goes down. Each AWS Elemental component has an individual service-level agreement (SLA) of 99.9% and by combining them in different configurations, we can design a service with the SLA we require.
Components in a series have an “Overall Availability” of components in Series = (Availability of component X) x (Availability of component Y). Whereas, components with “Overall Availability” of components in Parallel = 1- (Availability of Component X) (to the Power of the number of parallelized components X).
1) Components in series
2) Components in parallel
In this case, we know that Amazon Web Services (AWS) provides an SLA of 99.9% for each of these components and, therefore, that running them in serial should result in an overall SLA of 99.7% or 131 minutes of downtime every month. However, parallelizing the two systems, each with a combined SLA of 99.7%, will result in an expected uptime of 99.999% or 5 9s per month.
After we have a system designed, we need to ensure it meets the expected availability and performance i.e. not only is the system operational, but how well is it operating.
Fully observing and monitoring the system
You might be familiar with “observability” as a term. It means how well I can understand how my system is performing. Again, we turn to our metrics, and consider what metrics do we need to observe to ensure that the uptime is being achieved and that customers are getting a highly reliable experience.
Video is delivered over a network in a sequence of packets. These packets describe video frames (each of which contains a description of every pixel in relation to its neighbouring pixels), and frames in relation to its neighboring frames. Frames are grouped into self-contained groups of pictures (GOPs), comprised of I (intra) and P/B (inter) frames, and in delivery of video over HTTP, these GOPs are themselves grouped into fragments. Every packet, frame, and fragment have to be delivered from the camera, via each video processing system, to the customer device in the correct sequence and in real time. If this doesn’t happen, it causes a poor customer experience, such as corruption or buffering as shown in the following image.
This kind of degradation can happen as a result of networking or processing issues outside of Prime Video’s direct control; the pixilation in the image above, for instance, was a result of a lack of capacity within one component between a broadcaster’s source and Prime Video. Prime Video therefore needs to monitor as many points in the signal chain and mitigate these types of issues before they impact Customers.
To fully observe the system from end to end, we need to define the metrics that are to be measured for every component which can indicate a degradation in viewer experience. We then set up alerts and reporting when they breach thresholds so that we can respond quickly to issues, and observe overall trends.
By identifying the important customer impacting metrics, defining the acceptable thresholds for their performance, monitoring those thresholds, and putting mechanisms and technologies in place which enable us to recognize degradation, mitigate, and fix issues, we ensure that our SLAs are met and that customers receive the best possible experience at all times.
Optimizing for cost, without sacrificing quality
At our operational scale, optimizing for cost is critical, but it’s also simple when starting with performance and availability as a baseline. Before diving in, let’s understand the following levers that we have for managing cost:
- Which AWS Regions we choose to use.
- How much infrastructure we use in each Region.
- The characteristics of the video signals we receive and deliver, broadly speaking the higher the complexity of the signal, the higher the cost.
- Whether we choose or have to work with third-parties to receive the TV signals, or whether partners can deliver those signals directly to us.
Let’s tackle these one at a time. Optimizing for cost doesn’t mean compromising on quality or reliability. Choosing the appropriate AWS Region is a good example of this. Because our partners and customers are globally distributed, the most reliable location from which to process and originate signal for customers should be the closest Region to both the signal source and the customer. If there isn’t a “local” Region, then AWS technologies such as AWS Direct Connect, AWS Global Accelerator, AWS Transit Gateway, and AWS Elemental MediaConnect can help bring signals into and across the AWS backbone network into the optimal location for our customers. Meanwhile, Amazon CloudFront can help deliver live TV stations back to them in as reliable a way as possible.
To meet the 5 9s uptime, which is important for a reliable customer experience, we avoid building systems with single points of failure (SPOFs). This means that we deploy systems in at least two AWS Regions, as active-active, with each one being redundant for the other. AWS Elemental also provides an in-Region redundancy model called Standard Pipeline, which deploys dual redundant Elemental MediaLive systems at a reduced cost. This enables seamless failover between signals and protects the viewer from buffering by managing the failover at the MediaPackage input layer
Our data shows that the AWS services we operate are the most reliable part of our signal pipeline, so our challenge is to ensure the signal coming into the encoding pipelines is as reliable as possible. To achieve this, we ask partners to deliver two feeds to each AWS Elemental MediaConnect (four in total). We leverage the input failover capability supported by MediaConnect so that if one of the inputs fails, the source immediately switches to the backup, this doesn’t disrupt the video pipeline and only marginally impacts the customer experience with less than 1s of picture degradation. We are working to remove even the less than 1s degradation using EMX seamless merge, however this requires a more complex configuration on the partner side, so will take some time and may not be possible for all our partners to deliver.
Video delivery over the internet (or any network) requires some art and science to achieve the best picture quality with the fewest bits. By reducing the number of bits delivered, we improve the quality of experience for Viewers by reducing the chance of buffering, and we reduce the cost of data delivery. Our Video Scientists spend a lot of time optimizing our adaptive bitrate (ABR) encoding ladders to achieve the best results for customers. We work with partners to ensure the quality of the signals coming in meets our quality bar, in addition we further optimize our encoding by ensuring the maximum quality encode matches the maximum quality source as far as possible, so that we don’t use more bits than necessary to encode our video, and in doing so risk Customer experience.
The great thing about optimizing to get the best picture quality at the lowest bit rate is that everyone benefits. Prime Video reduces costs for encoding and data transfer and customers get high quality images with no buffering.
The final lever is how we choose to acquire signals. TV signals are available from many sources, over the air by satellite, over cables in data centers, and, increasingly, from within AWS. If we need to acquire a signal from satellite then we have to partner with an operator who can downlink those signals in at least two locations (to mitigate localized weather issues), if via cable in a data center or data centers we may need a partner, or we may be able to use AWS Direct Connect.
The optimal option for us, from a cost perspective, is to pick up the signal within AWS, and we’re at a point in our technical evolution where more than half of our signals are contributed from within AWS, direct from the partner, or through their third- party in AWS. In addition to cost, contribution from within AWS provides greater flexibility for Prime Video and our partners, enabling them and us to choose the optimal Regions and workloads to meet requirements and improve operations. This is because we can benefit from standardized services and metrics across the entire pipeline, which makes support and communications more streamlined, helps troubleshoot issues faster, and therefore improves the overall customer experience.
Securing Prime Video and partner content
Security in the context of this article primarily means content security. We need to ensure that Prime Video and our partners’ content is protected from camera to customer. Live Playback teams are responsible for ensuring that we employ network, transport, and content protection mechanisms that ensure 1) the integrity of the content throughout its lifecycle, 2) content is only accessible by authorized users and systems internally, 3) content is not exposed externally, except at the point of distribution, 4) only eligible customers can watch our content, and 5) our content is resilient to piracy.
We are constantly assessing our content protection measures and evaluating what technologies or products we can build or integrate to further improve content security.
Continuing our sustainability journey
The Sustainability pillar, is the newest addition to the AWS Well-Architected Framework and it’s something Prime Video has been closely investigating.
In addition to actively making sustainable choices, we are also able to take advantage of prevailing winds. We can ask some of our third-party data centres to share how much of their energy supply comes from sustainable sources. Additionally, Amazon plans to power its operations with 100% renewable energy by 2030 – which we are on the path to achieve five years ahead of schedule, by 2025. By moving more processes to AWS, we are helping support this goal through energy efficient data centers.
Finally, when we have to host equipment in data centers (for example, complex live video workflows), we can choose high density, low power hardware that reduces power and cooling requirements and the overall energy consumption footprint.
In 2022, we took part in one of the first Well-Architected sustainability reviews at AWS for media and entertainment services. Several of the recommendations which came out of that will be implemented this year; in addition, we now include sustainability focused FAQs in our architecture documents in order to ensure we’re focused on this.
This is a really exciting, challenging and fast evolving space. My time at Prime Video has been, and continues to be, a period of constant innovation; increasing quality through working with partners to source signals that have higher resolutions and frame rates, working with technology partners to improve reliability by implementing improved forward error connection and redundancy technologies, and working alongside AWS Cloud Economists and Solution Architects to scale as reliably and cost-effectively as possible; we are pushing the boundaries of how technology can deliver amazing live TV experiences to our customers all over the world.