Prime Video invents a new way to represent audio media timelines

Prime Video used closed-form equations and solutions to represent and compress audio media timelines in a pattern template, which ensures a second-order, lossless compression for audio stream media timelines.

Feb 13, 2023

You can stream many live events on Prime Video, such as concerts, Major League Baseball, the English Premier League, or NFL Thursday Night Football (TNF). During a live-streaming event, Prime Video must refresh the live manifest every few seconds to update it with the most recent media information. However, a live streaming manifest’s size grows linearly over time and increases linearly when the number of supported audio streams increase. This is caused by the existing audio timeline representation method in streaming technologies, such as HTTP Live Streaming (HLS) and MPEG-DASH.

For example, in 2020, TNF on Prime Video had three audio streams in stereo Advanced Audio Coding-Low Complexity (AAC-LC) codec format, which resulted in around 1,154 KB for a six-hour live DASH manifest. When the full six-hour DASH manifests were refreshed every two seconds, a bandwidth of around 4,616 Kbps was required to download the manifest (the bit rate for 720p video streams is around 2-6 Mbps). During manifest parsing, the huge manifest size adds computational overhead on the service and device side, particularly if devices have limited CPU power. It also impacts the bandwidth used to download audio and video segment data and causes a lower-quality experience. Finally, the manifest uses more storage on both the service and device side.

To solve this, Prime Video invented closed-form equations and solutions to represent and compress audio media timelines using pattern templates. These templates are a second-order, lossless compression for audio stream media timelines, use fundamentals in audio and video processing and compression, and are applicable to all streaming technologies. Our algorithms and equations are flexible and can apply to any configurations in all audio and video streams, irrespective of the video frame rates, video codec formats, audio sampling frequency, and audio coding formats.

Growing live manifests in streaming technologies

In live streaming, it is preferable to maintain the same number of video and audio segments, and keep the corresponding audio or video stream segments temporally aligned. This could be for easy trimming, content editing, or playback. Many video encoders are configured to generate fixed duration video segments. A fixed duration is useful for efficiently generating a manifest because a list of segments and their locations can be used by client devices to acquire the segments of a media presentation. It allows for the collapse of a sequence of multiple video segments into a single entry in one live manifest.

However, because of the audio frame duration coming from audio sampling rates and audio sample counts from each frame associated with every audio codec, the fixed duration of the video segments often doesn’t correspond to an integer multiple of audio frames.

To maintain temporal alignment of the video and audio segments, audio segments are generated with different durations. Some are slightly longer than the fixed video segment duration and others are slightly shorter. The following illustration shows a typical example of this.

To maintain the temporal alignment of the video and audio segments, audio segments are generated with different durations.

The diagram shows a video stream in H.264/AVC at 30 frames per second (fps) with two-second segments and an audio stream in AAC-LC at 48 KHz sampling rate with around two-second segments. The audio and video segments are perfectly aligned every eight seconds. The misalignment between corresponding audio and video segments are minimal in a pattern of [0,5.3,10.7,16] milliseconds.

While this means temporal alignment can be maintained, such an irregularity in audio segment durations means that audio segment sequences cannot be as efficiently represented in the manifest as video segment sequences. In fact, while video sizes might remain constant over the duration of a media presentation, the audio size of the manifest will increase linearly with the number of audio segments. This is problematic for devices and cloud services because the manifest consumes memory, processing resources, and internet bandwidth.

Since 2019, Prime Video has enabled and deployed pattern template manifests in both HLS and DASH for our live streaming on services and devices.

When using DASH streaming technology, a DASH manifest includes lists of video segments and audio segments representing the audio and video components of the media presentation. Every entry in each list represents one or more segments in relation to the media timeline for the media presentation and includes a starting timestamp and a segment duration. The DASH standard includes a syntax that, when it’s included as an attribute of a segment entry, indicates consecutive repetitions of segments having a specified duration.

For example, the element in live manifest is an example of a video timeline element that indicates that beginning at the timestamp there are video segments that have a duration of . This means that thousands of video segments can be represented by a single element because the video segments have the same duration.

However, because of the audio fragmentation issues, there are a fewer number of audio segments in a row that have the same duration, which results in a much longer list of elements including, for example, the following sequence:

The list shows that beginning at the timestamp there are three consecutive segments that have a duration, followed by a single segment beginning at the that has the duration. In this example, the use of the DASH syntax only compresses four entries into two. Therefore, despite the use of the syntax, the audio segment list will continue to grow linearly with the total duration of the media presentation.

Pattern template manifest

We investigated where and why such behavior arises from audio and video segment durations. We used a typical case study to reveal the underlying fundamentals, with a video frame rate of 30 fps and an audio sampling frequency of 48 KHz. The analysis is generic, applicable, and extensible to all other video frame rates and audio codec formats at different audio sampling rates.

With a video stream at 30 fps and AAC-LC audio stream at 48 KHz sampling rate, we use two equations to perfectly align audio and video segments. The following equation shows what X (the number of video frames in 1/30 second durations) and Y (the number of audio frames in of 1024/48000 seconds duration) should be for audio segments and video segments to perfectly align:

X \times \frac{1}{30} = Y \times \frac{1024}{48000}

The following equation indicates that the minimum number of X is equal to 16 and the minimum number of Y is equal to 25, which means that audio and video segments can perfectly align at 0.53 seconds or the integer multiplication of 0.53 seconds:

\frac{X}{Y} = \frac{16}{25}

A video segment duration of 0.53 seconds or the integer multiplication of 0.53 seconds requires us to carefully set the video encoding configuration of the IDR interval and the audio segmentation matching video segment duration.

For adaptive bit-rate switching purposes, we typically set video segment duration to be a fixed two seconds for integer video frame rate. We must therefore use two slightly different equations. The following equation shows what F (the number of video segments in two-second durations) and Y’ (the number of audio frames in seconds duration) must be so that audio segments in frames and video segments in 2-second duration can perfectly align:

F \times 2 = Y' \times \frac{1024}{48000}

The following equation indicates that the minimum number of X is equal to four and the minimum number of Y is equal to 375:

\frac{F}{Y'} = \frac{4}{375}

This means that audio and video segments can perfectly align every 8 seconds or the integer multiplication of seconds. Eight seconds consists of four 2-second video segments and four audio fragments in (94, 94, 94, 93) audio frames, which have the time duration pattern of (2.0053, 2.0053, 2.0053, 1.984) seconds. This is where the typical pattern in most media packagers comes from.

Based on the math analysis about how and why audio segment patterns are generated, it’s normal to have the definition of the new syntax “Pattern” for DASH standard as follows:

“Element pattern to allow for the grouping of a set of elements into a pattern. A pattern element might exist in conjunction with elements at the level and contains child elements, which represent a repeating pattern of multiple segments.”

The pattern element represents repeating patterns for segment durations and provides a second-order compression for manifests beyond what is possible with a first-order compression syntax in DASH. In the previous example, we can represent the same audio segment list in the following way:

This indicates that there are 1,061 instances of the pattern specified by the element, , and attributes. In this example, each pattern has three consecutive segments of duration (specified with the and attributes) followed by one segment of duration (specified with the attribute attribute). The pattern template syntax therefore allows for the representation of 4,244 audio segments in a single entry, which removes the linear growth factor of audio media timeline against time duration while remaining constant in size.

As we learnt at Prime Video, using this pattern template reduces the size of a DASH manifest for a 2.3 hour-long audio stream in AAC-LC codec format from 121 KB to 24 KB. As time duration or the number of audio streams increase, these savings become even more significant.

Improving live-streaming experiences for Prime Video customers

This pattern template’s algorithm and technology is also applicable to HLS and Microsoft Smooth Streaming because the segment pattern comes from audio and video compression properties (for example, video frame rate and duration), in addition to audio frame duration decided by sampling rate and the number of audio point samples in one audio frame defined in audio compression standard.

Since 2019, Prime Video has enabled and deployed pattern template manifests in both HLS and DASH for our live streaming on services and devices. By doing this, we have achieved significant benefits in video playback experiences for our customers, including better video quality and reduced buffering and spinning rates during video playback.

Twitter

Tags:

Yongjun Wu

Senior Principal Engineer – Prime Video

Most popular

Video Streaming

“We’re just beginning to build the future of live sports streaming”

At the European Women in Tech conference 2022, Filippa Hasselstrom, head of low-latency streaming at Prime Video, explained how her team builds the future of live sports streaming using UDP.

Filippa Hasselstrom

Feb 07, 2023

Our Innovation

Prime Video announces Amazon Research Awards recipients for fall 2022

Prime Video announces ARA awards in the fields of anomaly detection and insights, automated reasoning, personalization and discovery, and video quality analysis.

Staff Writer

Apr 17, 2023

Our People

Empathetic by design: How Amélie Werner prioritizes her team to drive innovation for customers

As head of Design Ops, UX Research, and Global Commerce Design at Prime Video, Amélie helped oversee the redesign of the user experience – a journey that’s allowed her to embrace Amazon’s Leadership Principles while empowering her colleagues.

Amélie Werner

Apr 05, 2023

Video Streaming

Innovating live video streaming for a VOD-only world

Here’s how Prime Video delivers live video streaming on customer devices that only support video-on-demand (VOD) playback.

Parminder Singh

Apr 13, 2023