How Modern Video Players Work


In our last post, we went through the history of video on the internet and how it has changed since it first appeared in the web ecosystem.

In recent years, with increased demand for multi-device streaming, adaptive bitrate risen to the forefront, forcing web and mobile developers to entirely rethink the logic of their video technology. At first, the companies that released HLS, HDS and Smooth Streaming kept the player logic hidden in their proprietary SDKs. Developers did not have the freedom to customize the media engine logic: you couldn’t change adaptive bitrate rules, your buffer size, or even the length of your segments. This made players easy to build, as all of the decisions were made by the Powers that Be.

As expected, the need for customization arose quickly. With different use-cases, needs varied greatly. Just between Live and VOD, buffer management, ABR rules, caching needs are completely different. This need for emancipation led to a set of low level media APIs: Netstream in Flash, Media Source Extensions in HTML5, and Media Codec SDK in Android. At the same time emerged a new open HTTP streaming format, MPEG-DASH. These many advances combined gave the power back to developers, allowing them to build their own modern video players and media engines that could fit their specific needs.

Today we’ll talk about how a modern video players are built, and the essential components to know and understand if you decide to build your own. Today, a typical player can be divided into three layers: the UI, the media engine, and the decoder.

    • User Interface (UI): This is the most high-level part of the player. It defines your end-user experience according to three distinct functionalities: the skin (or the design of your player), the UI (all of the custom features such as playlists, social sharing, etc.) and the business logic (specific business features such as advertising, device specificities, authentication management, etc.).
    • Media Engine: This handles all media playback logic, such as manifest parsing, segment retrieval, adaptive bitrate rules and switching, and a lot more as we’ll see next. As media engines are often tied to a platform/format, you’ll probably need to use several media engines to cover all devices.
  • Decoder & DRM Manager: This most low-level layer is often called by the platform (often OS level), and is exposed as an API to the developers. The main function of the decoder is to decode and render the video segments on the screen, while the DRM manager allows you to decrypt the segments if you have the right key.
architecture modern video players
Figure 1: Simplified architecture of a modern video player

In the following sections, we’ll explain the different roles of each layer using simple examples.

I. User Interface

The UI layer is the most high-level layer of a video player. It is what your users see and interact with, and can be personalized with your brand’s identity and features that will make your user experience unique. This layer is the closest to what we call the front-end development. Within the UI, we also include the business logic component that makes your streaming service unique, even if the user does not directly “interact” with this aspect of the interface.

The three main components of the UI are:


a. Skin

Skin is the common name for the graphic design of the main parts of your player: control bar, buttons, animation icons, etc. As with most design components, it is generally done in CSS, and is easily customizable by a designer/front-end developer (or even yourself using turnkey solutions like JW Player or Bitdash).

video player skin
Figure 2: Video.js custom skins


b. UI logic

The UI logic is what defines all direct visible interaction with the user on top of video playback: playlist bar, thumbnails, channels selection, social media sharing, etc. An infinite number of additional features may be added to your streaming experience. Many already exist as open-source plugins for various video players, which can be a source of inspiration.

Instead of discussing in detail each of the different possible features, let’s look at the example of the Eurosport player UI:

Eurosport user interface video player
Figure 3: Eurosport’s user interface, designed for a flexible viewing experience.

In addition to more classic UI elements, we can see here a very interesting feature that lets the user see what is happening in the live broadcast while he is catching up with the DVR stream, giving him the possibility to go back to the live stream at any time. Because the layout/UI and the media engine are completely separate, this feature can be done in HTML5 using dash.js with in only a few lines of code.

For UI, it is best to have each feature added as a plugin/module to the core UI base.


c. Business Logic

These are what we call the “invisible” features that are unique to your business: authentication & payment, channels & playlists fetching, advertising, etc. It also includes more technical elements like A/B testing modules and device specific configurations that choose between several different media engines depending on the best options for the device.

business logic component video player skin get channels metadata
Figure 4: Business Logic flow diagram

To shed some light on the hidden complexity, we’ll detail a few of these modules here:

Device detection and configuration logic: This is one of the most important features, as it offers the possibility of separating playback & rendering from the UI. For example, depending on the version of your browser, you may be able to use an HTML5 MSE-based media engine like hls.js, or a flash-based engine such as flashls to stream HLS. The big advantage here is that whichever the case you still use the same javascript & CSS code for all UI and business logic features.

Detecting the user’s device also allows you to configure your user experience accordingly: you will want to start with a lower bitrate on mobile than on a 4K screen!

A/B testing logic: A/B testing offers the opportunity to add and test new features in production with a limited number of users. For example, you can try out a new hover button or a new media engine with a subset of Chrome users to ensure everything works as intended. At Streamroot, we often use A/B testing to prove the effectiveness of our P2P acceleration module on a subset of users and browsers before pushing it into full production.

Advertising (optional): Handling advertisements on the client side is one of the most complex business logic features. As you can see in the videojs-contrib-ads module diagram here, the workflow to get an advertisement has a number of steps. As with HTTP video playback, you must use a more or less defined formats such as VAST, VPAID or GOOGLE IMA, be able to fetch the video from the ad server (often in an out-dated non-adaptive format), and play it as an unskippable pre/mid/post-roll in your video.


In short:

Depending on your customization needs, you may choose to use all-in-one players like JW Player that offer many of the classic UI functionalities (and also allow you to use your own), or to develop your own features based on open-source projects like Videojs or from scratch. To create a unified user experience across browsers and native apps, you might want to take a look at tools like React native for the UI/Skin, or Haxe for the business logic, which offer a re-usable code base for different devices.

II. Media Engine

The media engine emerged only recently as a truly independent component of the player architecture. In the times of mp4, the platform handled all playback logic and left only a handful of media features open to the developer (basically play, pause and seek, fullscreen mode… and that’s it).

However, new HTTP streaming formats required a distinct component handling and controlling the new complexity: parsing the manifests, downloading the segments, adaptive bitrate monitoring and decision-making, and more. At first, this ABR complexity was handled by the platform & device providers. However, as broadcasters increasingly sought to control and customize their players, low level APIs (Media Source Extensions on the web, Netstream in Flash, and Media Codec on Android) began to appear, and quickly led to the creation of powerful and robust media engines built on top of them.

shakaplayer google video player open-source
Figure 5: Data flow diagram of Shakaplayer, a Dash media engine built by Google

Here we detail the essential components found in a modern media engine:

Manifest Parser & Interpreter

In HTTP streaming, everything starts with a manifest. This manifest contains all the metadata needed to understand what the media server has to offer: how many and which qualities, languages, subtitles, etc. The parser gets the manifest data in a form of an XML (or a proprietary file format in the case of HLS), and follows the format’s specs to extract the right information. There are dozens of media servers and not all of them implement the specification correctly, so the parser must also be able to deal with minor implementation mistakes.

Once it has extracted the information, the parser interprets the data to build its own vision of the stream, and understands how to get the different tracks and segments. In some media engines, this vision takes the form of an abstract media map, which plots out the specificities of the various HTTP streaming formats onto one representation.

In the case of live streaming, the parser must also re-fetch the manifest periodically to have information on the most recent segments.

Downloader (manifests, media segments, keys)

The downloader is the module wrapping the HTTP request native API. It is used not only to download the media segments, but also the manifest and DRM keys if needed. The downloader plays the important role of handling network errors and retries, and also collects stats on the available bandwidth.

Note: HTTP or other delivery protocols may be used to obtain the media segments. For example, this is where we plug our Streamroot peer-accelerated streaming module, which delivers segments via the WebRTC protocol.

Streaming Engine

The streaming engine is the central module that interacts with the decoder API, pushing the different media segments into the decoder, and handling quality-switching and playback specificities (differences in timestamp between the manifest and the segment, automatic seeking if the video stalls, etc.).

Quality metrics estimators (bandwidth, CPU, frames, etc.)

The estimators aggregate data from the different metrics (chunk size, download time per segment, number of dropped frames, etc.) to estimate the user’s available bandwidth and CPU capabilities.

These outputs are then used by the ABR (Adaptive Bitrate) switching controller.

ABR switching controller

The ABR switching controller is probably the most crucial component of the media engine – and the most often neglected! The controller uses custom algorithms, taking the estimators’ metrics as inputs (bandwidth, dropped frames), and tells the streaming engine if it should change the video and audio qualities.

Much research has been done in this field; most of the difficulty lies in striking a balancing between re-buffering risk and switch frequency (too many quality switches can lead to poor user experience). Edge cases also present significant difficulties.

Note: Having explored and contributed to most of the ABR algorithms available on the market, we could do a separate post on the different approaches and best practices when it comes to ABR algorithms. Tell us if you’d be interested in the comments below!

DRM manager (optional component) 

All premium services require a DRM management today. The DRM largely depends on the platform/device, as we’ll see later when exploring the decoder layer. The DRM manager in the media engine is a wrapper for the API of the content decryption module in the more low-level decoder.

Where possible, it abstracts the different browser/OS implementations (like in the case of CENC and Encrypted Media Extensions). This component is often closely linked to the streaming engine, as it interacts with the decoder layer.

Transmuxer (optional component)

As we’ll see a bit later, each platform has its own restrictions in terms of packaging and codecs (Flash reads FLV with h264/aac, MSE reads ISOBMFF with h264/aac, etc.). This has led to the need to transmux video segments before appending them to the decoder. For instance, with an MPEG2-TS to ISOBMFF transmuxer, hls.js makes it possible to play HLS with MSE. Transmuxing within the media engine layer was once a cause of concern; however, with today’s high-performance JavaScript and Flash, it induces very negligible overhead and has no impact on the user experience.

In short:

There are many different components and features that can go into a media engine, from subtitles and captioning to ad insertion, etc. As a follow-up to this introduction, our next post will offer tips on choosing the right media engine for your needs, with a benchmark of the most popular open-source HTML5 media engines on the market.

It is important to understand, however, that when choosing or building a video player, having several media engines is most often necessary to reach your entire audience. This is because the decoder layer is linked to the user’s platform, as we we’ll see in the next section.

III. Decoder & DRM manager

The decoder and DRM manager are closely tied to the OS, due to performance (decoder) and security (DRM) considerations.

 Content decryption and DRM workflow diagram
Figure 6: Decoder, Renderer and DRM workflow diagram


a. Decoder

The decoder handles the most low-level playback logic. It demuxes and decodes the video segment, and passes each frame to the OS renderer, which shows it on the user’s screen.

Because compression algorithms have become increasingly sophisticated, decoding is very calculation intensive and is intrinsically linked to the OS and hardware to ensure good performance and smooth playback. Today most of the decoding is done with the help of GPU acceleration (one of the reasons the free and more powerful VP9 has not won out over h264 today). Without GPU acceleration, decoding a 1080p on a modern PC takes up to 70% of the CPU, with a significant number of dropped frames.

On top of decoding and rendering of the frame, the manager also often provides a native media buffer. The media engine can interact with this buffer, to know its size and flush it out if needed.

Each platform has its own rendering engine and APIs as we mentioned before: Netstream with Flash, Media Codec API on Android, and last but not least, the standardized Media Sources Extensions on the web. MSE is gaining traction, and will probably become the de facto standard on other platforms in addition to the browsers.

Note: We are currently working on a Media Source Extensions API polyfill for Flash, which we open-sourced last month so broadcasters can use a single HTML5 media engine on top of a Flash renderer.

b. DRM Manager

Today, DRMs are necessary to deliver premium content approved by production studios. As they are designed to protect content from theft, DRM code and the way DRMs work are hidden from users and developers. Decrypted data never leaves the decoder layer, so it cannot be intercepted.

In an effort to standardize DRMs and promote a certain degree of interoperability, several web giants have created Common Encryption (CENC) and Encrypted Media Extensions, which help building a common API that can be used with different DRM providers (for instance, EME can be used with Playready on Edge and Widevine on Chrome). This API handles the way the content keys are retrieved from DRM licences and used to decrypt the content.

CENC specifies standard encryption and key mapping methods that can be used by different DRM systems to enable decryption of the same content, provided the same key has been used to encrypt the content.

Within the browser, EME handles CENC content by identifying which DRM system(s) is associated with the content, based on the content metadata, and calling the corresponding Content Decryption Module (CDM). The CDM, when present, will then process the content license to obtain the content key and decrypt the content.

The details of licence acquisition, licence formatting and storage, usage rules and rights mapping, etc. are not specified by CENC and remain under the responsibility of the DRM providers.

DRM workflow
Figure 7: DRM workflow


Final Thoughts

Today we took an in-depth look at how modern video players work with three distinct layers. The greatest advantage of this modern structure is that the UX is entirely separate from the media engine logic; broadcasters can therefore build a seamless user experience across devices while using several different media engines to ensure playback for different formats and older systems.

On the web, MSE and EME are becoming standard, with the help of solid media engines such as dash.js, shakaplayer and hls.js that are close to maturity and are already used in production by many leading broadcasters. Recently, this traction has expanded to STBs and connected TVs, and we are seeing more and more of these devices using MSE as the basis for their media stack. We here at Streamroot hope that perhaps one day, in the next few years, MSE will become the defacto media decoder API, and that users will wake up to a world in which they can use dash.js indifferently on their browser, in their native apps, and on their smart TVs!

Until that day comes, you will need to use several media engines to reach your entire audience. In our next post, we’ll therefore focus on what to look for in a media engine, and offer a benchmark of some of the open-source solutions available today!

13 thoughts on “How Modern Video Players Work”

  1. Thanks for an excellent post. I would be interested in an elaboration of where/how in the flow transmuxing would occur (diagrams 6 & 7) and how that fits into the Decoder & DRM flow. For example: decryption/decode of a CENC (Widevine) encrypted HLS stream (MPEG2-TS) transmuxed to ISOBMFF for native platform decryption on Android.

    1. The transmuxing on the client side is mostly needed when the platform doesn’t support the stream format. The best example is the hls.js usecase: the Media Source Extensions API can decode ISOBMFF streams, but not MPEG2-TS, so hls.js transmuxes the .ts segments into fragmented mp4 segments on the fly. And as the decryption of the segments is also done via Javascript, it also works with AES encrypted streams (1. decryption of the segment 2. transmuxing ts=>mp4 and 3. decoding with the MSE API).

      With DRMs, it gets a little more complicated, because you would need to decrypt the segment before transmuxing it, which is not possible because the decryption is done by the CDM, and the decrypted content is never accessible to the javascript player (due to obvious security restrictions).

  2. Excellent post, especially the writing style: not too complex but certainly not oversimplified. I definitely would be interested in hearing more about how ABR is handled if you’re willing to write it and there’s enough other interest. Cheers, Brad

    1. Hi Brad! Thanks so much for the comment. An ABR post is definitely on the roadmap. We have a couple more media engine articles to come (including a benchmark of some of the open-source options out there) but check back in the next couple months! We’re also doing a presentation on optimizing ABR algorithms at Streaming Media East if you’ll be attending the conference.

  3. Pingback: How Modern Video Players Work – Jkab tekk

Comments are closed.

Scroll to Top