Prime Video invents an algorithm that optimizes content encoding and manifest stitching at scale
Prime Video’s algorithm optimized encoding for secondary content and manifest generation for content stitching, achieving the best performance and compatibility on customer devices.
At Prime Video, we regularly stitch primary and secondary video content together for video on-demand (VOD), live linear channels, and live events. For instance, we might stitch secondary content (for example, a video advert) into primary content such as live sports, a movie, or an episode in a TV show.
However, video playback artifacts (such as video freezes, glitches, silent video, and playback stall) can appear on different devices depending on the secondary content’s property, the manifest generation during content stitching, and device limitations on media platforms (for example, smart TVs and casting devices). These artifacts must be avoided because they significantly affect the customer’s content experience and enjoyment.
Prime Video tech teams evaluated and assessed manifest generation and audio/video (A/V) encoding of secondary content on services and video playback on devices. We then invented an algorithm to optimize encoding for the secondary content and manifest generation for content stitching. By doing this, we consistently avoid any video playback artifacts and ensure the best possible performance and compatibility for our customers’ devices.
Audio/video misalignment causes gaps and overlap in media timelines
Secondary content often has an audio stream that is longer or shorter than the video stream by tens or hundreds of milliseconds. This is due to the difference in video frame duration and audio frame duration from A/V compression, in addition to the signal sampling frequency of the video frame rate and audio sampling frequency. Such a misalignment could also come from the content production and capture process.
The following diagram shows that if the duration of secondary content in a MPEG-DASH manifest is equal to the audio stream duration (AD), then there is an overlap in the video stream timeline when the video stream duration (VD) is longer than the AD and also a gap if the VD is shorter.
Similarly, if the duration of secondary content in a MPEG-DASH manifest is equal to the VD, then there’s a gap in the audio stream timeline when the AD is shorter than the VD or an overlap with the next period if the AD is longer. Occasionally, audio and video streams have the same duration and no gaps or overlap. But typically, this is just a happy coincidence.
Although the DASH Industry Forum (DASH-IF) recommends providing enough media segments to cover the entire time span, there remains challenge and ambiguity in how to optimally achieve this with secondary content input. Even after enough media segments are provided to cover the entire time span, there’s still potential overlap with the next period, which often causes artifacts in video playback depending on the overlap amount and media platform limitations.
Optimizing A/V encoding and manifest generation
We extensively analyzed audio and video compression algorithms for standard specifications, media platforms on devices, and human perception (whether human eyes and ears can perceive any artifacts in video playback). The following list is a summary of our analysis:
- According to audio compression algorithms in standard specifications (for example, AAC-LC and Dolby Digital Plus), leading or trailing audio frames can be independently dropped in compressed samples, rather than raw pulse-code modulation (PCM) samples, without going through computationally intensive audio decoding. Media platforms on devices can do this after compressed audio frames are demuxed from a file container.
- Instead, video compression algorithms have a built-in, serial dependency coming from inter-frame motion compensation in standard specifications (for example, H.264/AVC and HEVC/H.265). It’s challenging to drop video frames in the trailing part of a video segment because it has to often decode all frames and then selectively drop some frames in the trailing portion to compensate for the overlap. In some cases, media platforms might not have APIs available to handle the video frame dropping in the trailing part of a segment.
- When there is incremental or accumulated overlap in an A/V stream, we have to push A/V playback a bit later than video/audio playback to resolve the A/V overlap. That introduces A/V drifting and desynchronization. According to Rec. ITU-R BT.1359-1, people are typically more tolerant of late audio than early audio in A/V drifting and desynchronization. Therefore, audio stream overlap is preferred over video stream overlap because we can push audio playback a lot later than video playback without a perceivable out-of-sync A/V.
Constraints for audio and video compression algorithms, media platforms, and recommendations from DASH-IF and Rec. ITU-R BT.1359-1 mean that we recommend the following algorithm for secondary content. The duration of secondary content in a DASH manifest must match the VD. This avoids video stream overlap and trailing video frame dropping. The AD must be longer than or equal to the VD. To reduce processing of audio frame dropping, amount of audio overlapping, and have no gaps in the audio stream, the equation is as follows:
How can we satisfy this constraint when handling secondary content that has a longer AD or shorter VD by tens or hundreds of milliseconds? Well, in secondary content encoding we define the audio and video encoding algorithm with a silent audio frames, black video frame padding and no trimming, and loss of any trailing audio or video frames. For example, if the AD is shorter than the VD, we keep padding silent audio frames until the equation is satisfied. In contrast, if the AD is longer than the VD, we keep padding black video frames at the secondary content’s frame rate.
The following diagram shows how we pad silent audio frames until the AD is longer than the VD and the equation is satisfied. Black video frame padding would also be similar.
Our algorithm works on the understanding that AD and VD can’t perfectly align due to audio sampling rate and frame size, and can also only be occasionally aligned during a long cycle. The equation doesn’t generate any gaps in audio and video streams in secondary content, which satisfies DASH-IF’s recommendation. Finally, the equation doesn’t have any overlap between the secondary content period and the next period in the video stream. But it always has a minimum overlap in the audio stream, which can be handled on media platforms or video players to ensure tolerable A/V synchronization and drifting.
Achieving smooth playback at scale on all devices
Since 2020, Prime Video has deployed this algorithm for audio and video encoding of secondary content and manifest generation during content stitching, with the overall period duration coming from VD. Specifically, we encode the secondary content either offline or in real time, using silent audio frame and black video frame padding when necessary. We always set the period duration of the secondary content to be VD during the dynamic manifest generation in content stitching at scale. Because of this, Prime Video has achieved smooth video playback without any perceivable artifacts and it works well across all eligible devices.
Our work on this required a deep understanding of algorithms in audio and video compression and DASH standard specifications, DASH-IF guidelines, typical limitations in media platforms across devices, and an awareness about how people perceive artifacts in video playback. Our work went through multiple discussions, prototype experiments, simplifications before finally being pushed to prod. But what truly made all this possible was an excellent team of incredible engineers and technologists at Prime Video.
Stay tuned for more!