WebGL: The New Frontier (for some)

Not too long ago there was a demo of the Epic Citadel OpenGL demo translated to JavaScript and played inside a web browser. The code was ported from C++ to ASM, a subset of JavaScript optimized for performance using and then fed into a web browser for rendering and display.

Even if we leave the conversion from C++ to JavaScript aside for the moment. You’re running a high end game inside a web browser… without plugins!

Think about it. We can now do 3D animations that take advantage of your computer’s GPU to render high quality 3D objects, animations and transitions that rival work done with professional 3D software (and you can even import files from Maya, Blender and others into a WebGL environment).

What is WebGL

WebGL is a cross-platform, royalty-free web standard for a low-level 3D graphics API based on OpenGL ES 2.0, exposed through the HTML5 Canvas element as Document Object Model interfaces. Developers familiar with OpenGL ES 2.0 will recognize WebGL as a Shader-based API using GLSL, with constructs that are semantically similar to those of the underlying OpenGL ES 2.0 API. It stays very close to the OpenGL ES 2.0 specification, with some concessions made for what developers expect out of memory-managed languages such as JavaScript.
From: http://www.khronos.org/webgl/

WebGL is a JavaScript API that allows us to implement interactive 3D graphics, straight in the browser. For an example of what WebGL can do, take a look at this WebGL demo video (viewable in all browsers!)
WebGL runs as a specific context for the HTML <canvas> element, which gives you access to hardware-accelerated 3D rendering in JavaScript. Because it runs in the <canvas> element, WebGL also has full integration with all DOM interfaces. The API is based on OpenGL ES 2.0, which means that it is possible to run WebGL on many different devices, such as desktop computers, mobile phones and TVs.
From: http://dev.opera.com/articles/view/an-introduction-to-webgl/

Why should we care?

Besides the obvious implications of having 3D graphics online without having to use plugins imagine how much more expressive we can make our content if we can mix 2d and 3d content. WebGL means that we’re no longer tied to plugins to create AAA-game type experiences for the web platform (look at the Epic Citadel for a refresher), incorporate 3D rendered content into your 2D web content (product configurators like the +360 Car Visualizer come to mind),
web native architectural or anatomical visualizations or just awesome animation work like thisway.js a WebGL remale of an older flash based demo or Mr. Doob’s extensive libraries of demos and examples.

As the WebGL techology stack matures we’ll see more and more examples and creative uses of the technology. I hope we’ll see more editors and tools that will make it easuer for non-experts to create interactive content.

Before we start

Before you try creating your own WebGL It’s a good idea to go to http://analyticalgraphicsinc.github.io/webglreport/ to test whether your browser and computer support WebGL.

Building WebGL content

There are two ways to create WebGL content. They each hve their advantages and drawbacks that will be pointed out as we progress through the diffrent creation processes.

The hard way

The hard way is to create content using WebGL primitives and coding everything ourselves as show in https://developer.mozilla.org/en-US/docs/Web/WebGL/Adding_2D_content_to_a_WebGL_context and https://developer.mozilla.org/en-US/docs/Web/WebGL/Animating_objects_with_WebGL. You can see the ammount of code and specialized items we need to build in order to create the content.

Take a look at the Mozilla WebGL tutorial as an illustration of the complexities involved in creating WebGL with the raw libraries and interfaces.

If interested in the full complexities of writing raw WebGL code, look at the source code for https://developer.mozilla.org/samples/webgl/sample7/index.html to see WebGL code and the additional complexities. We’ll revisit part of the examples above when we talk about custom CSS filters.

The pen below also illustrates the complexities of writing bare WebGL

See the Pen HLxbB by Carlos Araya (@caraya) on CodePen.

For a more complete look at the workings of WebGL look at Mike Bostock’s WebGL Tutorial series:

If you are a game programmer or a game developer this will be the level you’ll be working at and other platforms, libraries and frameworks will not be necessary for your work. If you’re someone who is not familiar with coding to the bare metal or who is not familiar with matrix algebra there are solutions for us.

The easy way

Fortunately there are libraries that hide the complexities of WebGL and give you a more user friendly interface to work with. I’ve chosen to work with Three.JS, a library developed by Ricardo Cabello also known as Mr Doob.

The following code uses Three.js to create a wireframe animated cube:

[codepen_embed height=”516″ theme_id=”2039″ slug_hash=”JlEfG” default_tab=”js”]
var camera, scene, renderer;
var geometry, material, mesh;


function init() {

camera = new THREE.PerspectiveCamera(45, window.innerWidth / window.innerHeight, 1, 1000);
camera.position.z = 800;

scene = new THREE.Scene();

geometry = new THREE.CubeGeometry(200, 200, 200);
// We have to use MeshBasicMaterial so we can see the color
// Otherwise it's all black and waits for the light to hit it
material = new THREE.MeshBasicMaterial({
color: 0xAA00AA,
wireframe: true

mesh = new THREE.Mesh(geometry, material);

renderer = new THREE.CanvasRenderer();
renderer.setSize(window.innerWidth, window.innerHeight);



function animate() {
// note: three.js includes requestAnimationFrame shim

// Play with the rotation values
mesh.rotation.x += 0.01;
mesh.rotation.y -= 0.002;

renderer.render(scene, camera);

See the Pen Basic WebGL example using Three.js by Carlos Araya (@caraya) on CodePen.

As you can see, Three.js abstracts all the matrix/linear algebra and all the shader programming giving you just the elements that ou need to create content. This has both advantages and drawbacks.

Mike Bostock writes in his introduction to WebGL tutorial:

Every introductory WebGL tutorial I could find uses utility libraries (e.g., Sylvester) in efforts to make the examples easier to understand. While encapsulating common patterns is desirable when writing production code, I find this practice highly counterproductive to learning because it forces me to learn the utility library first (another layer of abstraction) before I can understand the underlying standard. Even worse when accompanying examples use minified JavaScript.

On the other hand many people, myself included, just want something that works. I am less interested in the underlying structure (that may come in time as I become more proficient with the library way of doing WebGL) than I am in seeing results happen.

See the section WebGL resources for a partial list of other WebGL libraries.

Building a 3D scene

For this example we will use Three.js to build an interactive scene where we will be able to use the keyboard to move a cube around the screen. We will break the scripts in sections to make it easier to explain and work with.

The script is at the end of the <body> tag so we don’t ensure that the body is loaded.

Basic setup

The first thing we do is define the element where our WebGL content is going to live. We follow that with linking to the scripts we will use in the project:

  • three.min.r58.js is the minimized version of Three.js. In developmnet I normally use the uncompressed version
  • OrbitControls.js controls the movement of the camera used in the scene
  • Detector.js is the WebGL detection library
  • threex.keyboardstate.js will help with key down event detection
  • threex.windowresize.js automatically redraws the rendering when the window is resized
<!-- This is the element that will host our WebGL content -->
  <div id="display"></div>

<!-- Script links -->
<script src="lib/three.min.r58.js"></script>
<script src="lib/OrbitControls.js"></script>
<script src="lib/Detector.js"></script>
<script src="lib/threex.keyboardstate.js"></script>
<script src="lib/threex.windowresize.js"></script>

Variables, scene and camera setup

Now tht we setup the links to the scripts that we’ll use in this application we can declare variables and call the functions that will actually do the heavy lifting for us: init() and animate

// General three.js variables
var camera, scene, renderer;
// Model specific variables
// Floor
var floor, floorGeometry, floorTexture;
// light sources
var light, light2;
// Space for additional material
var movingCubeTexture, movingCubeGeometry, movingCubeTexture;
// Window measurement
var SCREEN_WIDTH = window.innerWidth;
var SCREEN_HEIGHT = window.innerHeight;
// additional variables
var controls = new THREE.OrbitControls();
var clock = new THREE.Clock();
var keyboard = new THREEx.KeyboardState();


Initializing our scene

The first step in rendering our scene is to set everything up. In the code below:

Step 1 does two things. It sets up the scene, the root element for our model and it sets the camera, the object that will act as our eyes into the rendered model.

For this particular example I’ve chosen a perspective camera which is the one used most often. This tutorial covers perspective as used in WebGL and Three.js; it is a technical article but I believe it provides a good overview of what is a complex subject.

Once the camera is created, we set its position (camera.position.set)and where the camera is pointing to (camera.lookAt)

function init()
// Step 1: Add Scene
scene = new THREE.Scene();
// Now we can set the camera using the elements defined above
// We can also reduce the number of variables by putting the values directly into the camera variable below
camera = new THREE.PerspectiveCamera( 45, window.innerWidth / window.innerHeight, 0.1, 10000);
// The position.set value for the camera will setup the initial field of vision for the model
camera.position.set(0, -450, 150);

Step 2 sets up the renderer for our model. Using Detector.js we test if the browser supports WebGL. If it does then we setup Three.js CanvasRenderer, one of the many renderers available through Three.js.

Also note that even a modern browser can fail to support WebGL if it’s running on older hardware or if the browser is not configured properly (older versions of Chrome required you to enable WebGL support through flags in order to display WebGL content and versions of Internet Explorer before 11 did not support WebGL at all).

At this point we also set up the THREEx.WindowResize using both the renderer and the camera. Behind the scenes, WindowResize will recompute values to resize our rendered model without us having to write code.

// Step 2: Set up the renderer
// We use the WebGL detector library we referenced in our scripts  to test if WebGL is supported or not
// If it is supported then we use the WebGL renderer with antialiasing
// If there is no WebGL support we fall back to the canvas renderer
if ( Detector.webgl ) {
  renderer = new THREE.WebGLRenderer( {antialias:true} );
 }  else {
     renderer = new THREE.CanvasRenderer();
// Set the size of the renderer to the full size of the screen
// Set up the contaier element in Javascript
var container = document.getElementById( 'display' );
THREEx.WindowResize(renderer, camera);
controls = new THREE.OrbitControls( camera, renderer.domElement );
// Add the child renderer as the container's child element
container.appendChild( renderer.domElement );

Step 3 is where we actually add the objects to our rendering process. They all follow the same basic process:

  1. We define a material (texture) for the object
  2. We Define a geometry (shape) it will take
  3. We combine the material and the geometry to create a mesh
  4. We add the mesh to the scene

The texture floor has some additional characteristics:

  1. We load an image to use as the texture
  2. We set the texture to repeat 10 times on each plane.

The sphere mesh on step 3b is positioned on the Z coordinates.

// Step 3 : Add  objects
// Step 3a: Add the textured floor
floorTexture = new THREE.ImageUtils.loadTexture( 'images/checkerboard.jpg' );
floorTexture.wrapS = THREE.RepeatWrapping;
floorTexture.wrapT = THREE.RepeatWrapping;
floorTexture.repeat.set( 10, 10 );
floorMaterial = new THREE.MeshPhongMaterial( { map: floorTexture } );
floorGeometry = new THREE.PlaneGeometry(1000, 1000, 10, 10);
floor = new THREE.Mesh(floorGeometry, floorMaterial);

// Step 3b: Add model specific objects
movingCubeGeometry = new THREE.CubeGeometry( 50, 50, 50, 50);
movingCubeMaterial = new THREE.MeshBasicMaterial( {color: 0xff00ff} );
movingCube = new THREE.Mesh(movingCubeGeometry, movingCubeMaterial);
movingCube.position.z = 35;
movingCube.position.x = 0


In step 4, we add lights to the model in a way very similar to how we added the objects in step 3.

  1. Create the light and assign a color upon creation
  2. Set the position of the light using an X, Y, Z coordinate system
  3. Add the lights to the scene
// Step 4: Add light sources
// Add 2 light sources.
light = new THREE.DirectionalLight( 0xffffff );
light.position.set( 0, 90 ,450 ).normalize();
light2 = new THREE.DirectionalLight( 0xffffff);
light2.position.set( 0, 90, 450 ).normalize();

Keyboard Interactivity

Right now we’ve built a cube over a checkered background plane… but we can’t move it. That’s what the animate function will do… it will setup the keyboard bindings.

Animate is a wrapper around 2 additional functions; render and update. The render() function just call our renderer to draw the frame of our application. The update function is where the heavy duty work happens.

function animate() {
  requestAnimationFrame( animate );

function render() {
  renderer.render( scene, camera );

The update() function sets up 2 variables: delta is a time-based value that will be used to calculate the distance of any movement that happens later in the function. The variable moveDistance uses delta to calculate the exact distance of each movement (200 * the delta value)

Now that we’ve set up the value for the distance moved we test which key was pressed and change the coordinate values (x, y or z coordinates) and then prepare for the next update cycle.

The reset value is simpler. We reset all coordinates to 0.

function update() {
  var delta = clock.getDelta(); // seconds.
  var moveDistance = 200 * delta; // 200 pixels per second
  //var rotateAngle = Math.PI / 2 * delta;   // pi/2 radians (90 degrees) per second

  // global coordinates
  if ( keyboard.pressed("left") )
    movingCube.position.x -= moveDistance;
  if ( keyboard.pressed("right") )
    movingCube.position.x += moveDistance;
  if ( keyboard.pressed("up") )
    movingCube.position.y += moveDistance;
  if ( keyboard.pressed("down") )
    movingCube.position.y -= moveDistance;
  if ( keyboard.pressed ("w"))
    movingCube.position.z += moveDistance;
  if ( keyboard.pressed ("s"))
    movingCube.position.z -= moveDistance;
  // reset (let's see if this works)
  if ( keyboard.pressed ("z")) {
    movingCube.position.x = 0;
    movingCube.position.y = 0;
    movingCube.position.z = 50;

Adding audio

In a future project I want to experiment with adding audio to our models. The best audio API I’ve found is the WebAudio API submitted by Google to the World Wide Web Consortium

Getting started with the WebAudio API discusses my experiments with the WebAudio API.

Playing with editors

Three.js editors and authoring tools are just starting to appear. From my perspective some of the most promising are:


WebGL Libraries and Frameworks

From: Dev.Opera’s Introduction to WebGL article and Tony Parisi’s Introduction to WebGL at SFHTML5 Meetup

Three.js (Three Github repo)
Three.js is a lightweight 3D engine with a very low level of complexity — in a good way. The engine can render using <canvas>, <svg> and WebGL. This is some info on how to get started, which has a nice description of the elements in a scene. And here is the Three.js API documentation. Three.js is also the most popular WebGL library in terms of number of users, so you can count on an enthusiastic community (#three.js on irc.freenode.net) to help you out if you get stuck with something.
PhiloGL (PhiloGL Github repo)
PhiloGL is built with a focus on JavaScript good practices and idioms. Its modules cover a number of categories from program and shader management to XHR, JSONP, effects, web workers and much more. There is an extensive set of PhiloGL lessons that you can go through to get started. And the PhiloGL documentation is pretty thorough too.
GLGE (GLGE Github repo)
GLGE has some more complex features, like skeletal animation and animated materials. You can find a list of GLGE features on their project website. And here is a link to the GLGE API documentation.
J3D (J3D Github repo)
J3D allows you not only to create your own scenes but also to export scenes from Unity to WebGL. The J3D "Hello cube" tutorial can help you get started. Also have a look at this tutorial on how to export from Unity to J3D.
tQuery (Github repo)
tQuery ( is a library that sits on top of Three.js and is meant as a lightweight content and extension creation tool. The author, Jerome Etienne has created several extensions to Three.js that are available on his Github repository (https://github.com/jeromeetienne)
SceneJS (Github repo
This library is an extensible WebGL-based engine for high-detail 3D visualisation using JavaScript.

BabylonJS (Github repo)
The library produces high quality animations. If your comfortable being limited to windows machines to create your libraries and customizations this may be the library for you.
VoodooJS (Github repo)
Voodoo is designed to make it easier to intergrate 2D and 3D content in the same page.
Vizi (Github repo)
The framework that got me interested in WebGL frameworks. Originally developed by Tony Parisi in conjunction with his book Programming web 3D web applications with HTML5 and WebGL it presents a component based view of WebGL development. It suffers from a lack of documentation (it has yet to hit a 1.0 release) but otherwise it’s worth the learning curve.

HTML5 video captioning using VTT

Introducing VTT

WebVTT (Web Video Text Tracks), formerly known as WebSRT, is a W3C community proposal for synchronized video caption playback. It is a time-indexed file format and it is referenced by HTML5 video and audio elements.

As with many assistive technologies, it would be a mistake to assume that they are only meant as a way to provide for accessibility accomodations. We can enable captions when the ambient noise is too loud to listen to a recorded presentation, we can use chapters to navigate through a long lecture video just like DVD or Blue Ray movies.

Captions can also improve our movies’ discoverability. Google indexes the content of our captions. Both YouTube and Google search can report results based on the video captions available for a given file.

WebVTT files provide open captions, independent of the audio or video files they are attached to, they are not “hard coded” into pixels. This also means that creating VTT files requires nothing more than a text editor; although there are more specialized tools to create the captions.

Browser support

Based on Silvia Pfeiffer’s post to the VTT community group dated August, 2012, and updated with new information about Firefox, the following browsers support VTT tracks for video and audio:

Browser Version First Supported Format Supported Notes
Internet Explorer IE 10 Developer Preview 4 VTT and TTML
Google Chrome Version 18 VTT
  • Basic tutorial hosted at HTML5 Rocks
  • Based on Webkit’s implementation
Apple Safari Version 6 VTT
  • Based on Webkit’s implementation
Opera Since August, 2012 VTT
Firefox Nightly VTT
  • Tested with version 29.0a1 (12/14/2013)
  • Feature enabled by default
  • See the Mozilla Developer Documentation for more information
  • If the size of the video doesn’t match the size attributes of the video tag, the video will display on white/gray background

Polyfills and alternatives

I will use one of the many polyfils available for HTML5 Video Tracks. Playr seems to be the most feature complete polyfill for HTML5 video tracks. The downside is 2 more files (one CSS and one JavaScript) to download for the video page but until VTT is widely supported the extra files are worth the effort to create accessible content.

One way to ensure that we only load our polyfill if the browser doesn’t support tracks natively is to use Modernizr.load to conditionally load Playr’s CSS and JavaScript when the browser does not support HTML5 video tag natively.

    // test whether we support video
    test : Modernizr.video,
    // Load the corresponding assetts for the polyfill you want to use
    // in this case we are using the playr polyfill
    nope : ['playr.js', 'playr.css']

The code below uses plain JavaScript to test if a browser supports HTML5 video by creating an empty video element and testing for the video’s canPlayType property. It will not load the code for a polyfill like the Modernizr example.

var canPlay = false;
  var h, plink, pscript;

  // Create an empty video element
  var v = document.createElement('video');
  // If the video can playType and can play MP4 video
  if(v.canPlayType) {
    // Set canPlay to true
    canPlay = true;
    // Display an alert telling them so
    alert('Can Play HTML5 video')
  else {
    // Append Playr CSS and JS to the head of the page to
    // provide a fallback
    h = document.getElementsByTagName('head')[0];
    plink = document.createElement('link');
    plink.setAttribute('href', 'css/playr.css');
    plink.setAttribute('media', 'screen');
    pscript = document.createElement('script');
    pscript.setAttribute('src', 'js/playr.js');

This is the simplest test for video support; a more elaborate version can include support for specific formats and write the <source> tags only for the supported formats. The example below makes the following assumptions:

  • You have encoded a video in all three formats (webm, mp4 and ogg)
  • You are testing for support for HTML5 video in general and specific formats
  • If HTML5 video is not supported you have a flash-based fallback solution
var canPlay = false;
  // Get the video by selecting the video tag
  var v = document.getElementsByTagName('video');
  // Optionally add video attribtues as needed
  // At a minimum set height, width and controls
  // as shown below
  v.setAttribute('height', '640');
  v.setAttribute('width', '480');
  v.setAttribute('control', 'control');

  // If the video can playType and can play MP4 video
  if (v.canPlayType && v.canPlayType('video/webm'; codecs="vp8, vorbis"').replace(/no/, '')){
    // append the appropriate source track
    var webm = v.appendChild(source);
    webm.setAttribute("source", "myvideo.webm");
    webm.setAttribute("type", "video/webm");
  else if (v.canPlayType && v.canPlayType('video/mp4; codecs="avc1.42E01E, mp4a.40.2"').replace(/no/, '')){
    // append the appropriate source track
    var mp4 = v.appendChild(source);
    mp4.setAttribute("source", "myvideo.mp4");
    mp4.setAttribute("type", "'video/mp4; codecs="avc1.42E01E, mp4a.40.2"'");

Also note that we’re testing for specific audio and video codec combinations. WebM supports a single combination of video and audio codecs but MP4 supports multiple profiles, not all of which are supported in HTML5 video. See http://mpeg.chiariglione.org/faq/what-are-different-profiles-supported-mpeg-4-video for an introduction to the different profiles supported by MPEG4.

Players and Polyfills

Playr is by no means the only polyfil or the only player that supports VTT. It is the one that I found the most feature complete for what I needed. The selection below represents a set of players and polyfills available.

Different types of VTT tracks and their structures

Captioning Tracks

Captioning is text that appears on a video, which contains dialogue and audio cues such as music or sound effects that occur off-screen. The purpose of captioning is to make video content accessible to those who are deaf or hard of hearing, and for other situations in which the audio cannot be heard due to noise or a need for silence.

Captions can be either open (always visible, aka “burned in”) or closed, but closed is more common because it lets each viewer decide whether they want the captions to be turned on or off.

From http://www.cpcweb.com/faq/

The simplest and most often used type of text track, captions provide alternative text content for people with visual dissabilities, for people who choose to play the video without audio, and others.

Depending on the player you may have open captions, where the captions are always visible on screen, or closed captions where you have to manually activate the display of captions; Either open or closed, the captions are independent of the content they are attached to.


railroad (2)
00:00:10.000 --> 00:00:12.500 (3) [Optional Settings] (4)
Left uninspired by the crust of railroad earth (5)

00:00:13.200 --> 00:00:16.900
that touched the lead to the pages of your manuscript.

Explanation of the cue above:

  1. WEBVTT must be the first item on the file, on the first line and in a line of its own. Optionally there may be lines of metadata. This section must be followed by a blank line
  2. The name of the cue. This is also optional
  3. Immediately below the name of the cue come the beginning and end time for the cue expressed in hours:minutes:seconds:milliseconds format. Hours, Minutes and Seconds must have 2 digits and be padded with zeros if necessary. Miliseconds must have 3 digits and be zero padded if not long enough
  4. Optional Cue Settings separated from the time one or more SPACE or TAB characters
  5. The text for the cue

Subtitles Tracks

Subtitle Tracks are similar to Caption Tracks but are not meant to address accessibility issues as Captions are. Subtitle tracks are used primarily to convey the dialogue in a language other than the one being spoken in the video. Take, for example a Japanese movie where the subtitles translate the content to English.

Subtitles are not expected to convey additional non-verbal cues. Once again, subtitles are only meant to provide a translation of the words being spoken although some delivery formats such as Blue Ray do not follow this recommendation.

What’s the difference between captions and subtitles?

The main difference is that subtitles usually only transcribe the spoken dialog, and are mainly aimed at people who are not hearing impaired, but lack fluency in the spoken language. Closed captions are aimed at the deaf and hearing impaired, who need additional non-verbal audio cues (such as “[GUN SHOT]” or “[SPOOKY MUSIC]”) to be transcribed in the text. Closed captions are also useful for situations in which video is being shown but the sound is muted or difficult to hear, such as for a noisy bar, convention floor, video signage & billboards, etc.

From http://www.cpcweb.com/faq/

Other than the content for each type of track, HTML5 video structures the track element the same way. In the example below, the only difference are the kind attributes for each track.

<!-- This is the captions track -->
<track kind="captions" lang="en" srclang="en" label="English" src="sintel.vtt" />
<!-- This is the subtitles track for Spanish -->
<track kind="subtitles" lang="es" srclang="es" label="Español" src="sintel-es.vtt" />

Chapter Tracks

Chapter tracks help you navigate through the video by associating certain “chapters” with time codes. This will let you navigate to different sections of your video using some sort of visual cue.

In the example below, chapter 1 is 10 second long and titled Introduction to HTML5.


Chapter 1 (2)
00:00:01.000 --> 00:00:10.000 (3)
Introduction to HTML5(4)

Chapter 2
00:00:10.001 --> 00:00:15.000
Introduction to HTML5

Explanation of the chapter above:

  1. WEBVTT must be the first item on the file, on the first line and in a line of its own. It must be followed by a blank line
  2. The name of the chapter
  3. Immediately below the name of the chapter come the beginning and end time expressed in hours:minutes:seconds:milliseconds format. Hours, Minutes and Seconds must have 2 digits and be padded with zeros if necessary. Miliseconds must have 3 digits and be zero padded if not long enough
  4. The title of the chapter

Description Tracks

Description tracks are used primarily as an assistive technology helper, these tracks will be read by assistive technology devices for people with visual disabilities (blind or low vission). The cues can be arbitrarily long as long as they don’t contain empty lines (they would signal the beginning of a new cue)

VTT - Description for Sintel trailer

Sintel's Search -- beginning of the search
00:00:01.000 --> 00:00:52.000
Woman walks up a mountain
Fights an unknown man
Smoking man (covering full frame) speaks
Little dragon flies towards the woman before a larger dragon snatches it and flies away. The woman screams trying to grab the smaller flying creature

Metadata Tracks

Metadata Tracks are used to convey any additional information (such as base64 encoded images, JSON, additional text or any additional text-based file format) the developer needs to include in the page based on time indexes. A web app can listen for cue events, extract the text of each cue as it fires, parse the data and then use the results to make DOM changes (or perform other JavaScript or CSS tasks) synchronised with media playback.

WEBVTT - Example metadata track containing JSON payload

00:01:15.200 --> 00:02:18.800
"title": "Multi-celled organisms",
"description": "Multi-celled organisms have different types of cells that perform specialised functions.
  Most life that can be seen with the naked eye is multi-cellular. These organisms are though to have evolved around 1 billion years ago with plants, animals and fungi having independent evolutionary paths.",
"src": "multiCell.jpg",
"href": "http://en.wikipedia.org/wiki/Multicellular"

00:02:18.800 --> 00:03:01.600
"title": "Insects",
"description": "Insects are the most diverse group of animals on the planet with estimates for the total
  number of current species range from two million to 50 million. The first insects appeared around
  400 million years ago, identifiable by a hard exoskeleton, three-part body, six legs, compound eyes
  and antennae.",
"src": "insects.jpg",
"href": "http://en.wikipedia.org/wiki/Insects"

We can then use Javascript to parse the track content and do something with the track’s content.

textTrack.oncuechange = function (){
  // "this" is a textTrack
  var cue = this.activeCues[0]; // assuming there is only one active cue
  var obj = JSON.parse(cue.text);
  // do something

Building the tracks

We can build our caption file using the text above as an example, and this is the most common way to caption a video for accessibility.

We can also build multiple caption tracks as well as a variety of other tracks. Most polyfills will support a subset of the full VTT specification, Playr, the polyfill I’ve selected for these examples, supports captions, descriptions and chapter tracks.

Getting the captions to work

Building the tracks

(built with information from http://demosthenes.info/blog/584/Creating-And-Validating-WebVTT-Subtitles)

There are no programs that support VTT as a native captioning format. However there are plenty of programs that will create SRT captions, which is very similar to VTT (we’ll discuss the differences later in this section).

Choose whatever tool will work best for you to generate the SRT file; then follow the instructions below to convert them to VTT files.

Converting SRT to VTT

Due to their close relationship, conversion from .srt into .vtt is very simple. A typical .srt file will look something like this:

00:01:21,700 --> 00:01:24,675
Life on the road is something
I was raised to embrace.

The process is little more than a find-and-replace:

  • Add WEBVTT to the first line of the file
  • Convert the comma before the millisecond mark in every timestamp to a decimal point
  • Add styling markup to the subtitle text if needed
    • Special characters must be escaped as in HTML (&amp;, &gt;, &lt;)
    • You can use CSS classes defined in your CSS file by using &gt;c.XXX&lt;
    • See the section Cue Payload Tags for more information about the specific tags you can use to style your content

The resulting VTT file will look like this:


01:21.700 --> 01:24.675
Life on the road is something
I was <i>raised</i> to embrace.

Save the file with a .vtt extension and link to it from a <track> element in your video.

Validating A VTT File

It is not hard to make mistakes when creating a VTT track fille. Fortunately there is an online validator to help with authoring.

VTT Validator

It is essentially a two step process:

  • Paste the text of your VTT file
  • Select the type of track you’re working on

The results will display automatically.

Optional Cue Settings

Cues can also be styled and moved around the screen relative to the borders of the video. The table below summarizes the settings avalable for cues.

Vertical Alignment

Name: vertical

Values: rl (right to left) – lr (left to right)

What is used for: Vertical text alignment for languages like Japanese that can be read from top to bottom

Example: vertical:lr (makes the cue display vertically from left to right)

Line Placement / Top Alignment

Name: line

Value [-][0 or larger] (negative or possitive number) or [0-100]%

What is used for: Absolute references to a particular line number the cue is to be displayed on.
What is used for: Percentage value indicating the position relative to the top of the frame (when using percentages)

  • Line numbers are based on the size of the first line of the cue.
  • A negative number counts from the bottom of the frame

  • Positive numbers from the top

Cue Box Size

Name: size

Value: [0-100]%

What it’s used for: Indicates the size of the cue box. The value is given as a percentage of the width of the frame

Text Align

Name: align

Values: start | middle | end

What it’s used for: Specifies the alignment of the text within the cue. The keywords are relative to the text direction and are the same alignment keywords used in SVG

The alignment values are similar to those used in SVG. For users of CSS that uses a different terminology, the equivalency is:

  • Start alignment: The cue box’s left side (for horizontal cues) or top side (otherwise) is aligned at the text position.
  • Middle alignment: The cue box is centered at the text position.
  • End alignment: The cue box’s right side (for horizontal cues) or bottom side (otherwise) is aligned at the text position.

Note: if no cue settings are set, the positioning default to the middle, at the bottom of the frame.

Cue positioning

Name: position

Value [0-100]%

What is used for:
Percentage value indicating the horizontal alignment relative to the edge of the frame where the text begins (e.g. the left edge in English)
The value is dependent on the alignment of the cue:
* For left aligned or start aligned cues: 0%.
* For middle aligned cues: 50%.
* For right aligned or end aligned cues: 100%.

Note: Since the default value of the text track cue text alignment is middle, if there is no text track cue text alignment setting for a cue, the text track cue text position defaults to 50%.

Note: Even for horizontal cues with right-to-left paragraph direction text, the cue box is positioned from the left edge of the video frame. This allows defining a rendering space template which can be filled with either left-to-right or right-to-left paragraph direction text. If you define such a cue box template with start or end aligned text, make sure to control its size unless you want text to flip from one side of the video frame to the other.

Cue Payload Tags

These are additional tracks that will allow you to customize the appearance of your tracks. ** You cannot use payload tags with chapter tracks**

Timestamp Tags (Karaoke Style and Paint On Caption Text)

Using timestamp tags can build Karaoke Style tracks. You build the track by inserting the correct time stamp where you want the text to change, subject to the following restrictions:

  • The timestamp must be greater that the cue’s start timestamp, greater than any previous timestamp in the cue payload, and less than the cue’s end timestamp.
VTT - Example Karaoke Style Track

00:16.500 --> 00:18.500
When the moon <00:17.500>hits your eye

00:00:18.500 --> 00:00:20.500
Like a <00:19.000>big-a <00:19.500>pizza <00:20.000>pie

00:00:20.500 --> 00:00:21.500
That's <00:00:21.000>amore

In the example above:

  • The active text is the text between the timestamp and the next timestamp or to the end of the payload if there is not another timestamp in the payload.
  • Any text before the active text in the payload is previous text .
  • Any text beyond the active text is future text. We can use the previous and future tracks to create the Kraoke experience.

A possible CSS rule to style the content looks like this.

::cue:past {

::cue:future {
  text-shadow: black 0 0 1px;

Timestamp tags can also be used for Paint On captions, which placed independently from each other and don’t erase what was already on the screen. They are written one letter at a time and they appear to ‘paint on’ the screen.

Speaker Semantics

You can use a combination of cue positioning and specific markup on individual cues to further emphazise who is speaking in a given caption or subtitle where appropriate.

WEBVTT - Sintel Caption File With Speaker Semantics

00:00:12.000 --> 00:00:15.000 A:middle T:10%
<v.gatekeeper>What brings you to the land
of the gatekeepers?

00:00:18.500 --> 00:00:20.500 A:middle T:80%
<v.sintel>I'm searching for someone.

We can style the speaker semantic classes using CSS. For example we can add a different color for each speaker, something like the example below:

video::cue(v.gatekeeper) {

video::cue(v.sintel) {
  color: #ff00ff;
Addtional Style tags

The following tags require opening and closing tags.

Class tag

Style the contained text using a CSS class.

Cue 14 - Class tag example

Italics tag

Italicize the contained text.

Example 15 - Italics tag

Bold tag

Bold the contained text.

Example 16 - Bold tag

Underline tag

Underline the contained text.

Example 17 - Underline tag

Ruby tag / Ruby text tag

Used together to display ruby characters (i.e. small annotative characters above other characters). Ruby annotations are primarily used in languages with logographic alphabets (Japanese, Chinese, Korean) where a single character may represent a complete word and where the meaning of the character may not be familiar to the reader.

Ruby characters are small, annotative glosses that can be placed above or to the right of a Chinese character when writing languages with logographic characters such as Chinese or Japanese to show the pronunciation. Typically called just ruby or rubi, such annotations are used as pronunciation guides for characters that are likely to be unfamiliar to the reader.

From Wikipedia

Example 18 - Ruby tag and Ruby text tag
<ruby>WWW<rt>World Wide Web</rt>oui<rt>yes</rt></ruby>

Adding the tracks to the video

Either in a supported browser or using one of the polyfills available (like we’ve chosen to do with Playr) we add <track> elements, one for each language of captions that we make available.

There is one non-standard attribute we will add to the video to make it work with Playr. The code below shows what a video track looks with associated an associated caption track for English.

<!DOCTYPE html>
  <title>Sample Captioned Video</title>
  <script src="playr.js"></script>
  <link rel="stylesheet" href="playr.js"></head>
  width="640" height="480"
  These are the three sources. This should cover most of our
  deployed player base
<source src="//media.w3.org/2010/05/sintel/trailer.mp4" type="video/mp4" />
<source src="//media.w3.org/2010/05/sintel/trailer.webm" type="video/webm" />
<source src="//media.w3.org/2010/05/sintel/trailer.ogv" type="video/ogg" />
  This is the captions track
<track kind="captions" lang="en" srclang="en" label="English" src="sintel.vtt" />

The working example is located at http://labs.rivendellweb.net/vtt-demo/basic.html and an example without a polyfill (meant to test native browser support) is located at http://labs.rivendellweb.net/vtt-demo/basic-plain.html

The same example without polyfill support and supporting captions in English and Spanish with the English caption being the default. The default attribute will also display the captions automatically

<!DOCTYPE html>
  <title>Sample Captioned Video</title>
  width="640" height="480"
  These are the three sources. This should cover most of our
  deployed player base
<source src="//media.w3.org/2010/05/sintel/trailer.mp4" type="video/mp4" />
<source src="//media.w3.org/2010/05/sintel/trailer.webm" type="video/webm" />
<source src="//media.w3.org/2010/05/sintel/trailer.ogv" type="video/ogg" />
  This is the captions track
<track kind="captions" lang="en" srclang="en" label="English" src="sintel-en.vtt" default />
<track kind="captions" lang="es" srclang="es" label="Spanish" src="sintel-es.vtt" />

The final example contains multiple caption tracks, subtitles in Spanish and descriptions for the video.

<!DOCTYPE html>
  <title>Sample Captioned Video</title>
  width="640" height="480"
  These are the three sources. This should cover most of our
  deployed player base
<source src="//media.w3.org/2010/05/sintel/trailer.mp4" type="video/mp4" />
<source src="//media.w3.org/2010/05/sintel/trailer.webm" type="video/webm" />
<source src="//media.w3.org/2010/05/sintel/trailer.ogv" type="video/ogg" />
  These are the captions track
<track kind="captions" lang="en" srclang="en" label="English" src="sintel-en.vtt" default />
<track kind="captions" lang="es" srclang="es" label="Spanish" src="sintel-es.vtt" />
<track kind="captions" lang="de" srclang="de" label="Spanish" src="sintel-de.vtt" /></video>
  This is the subtitles track
<track kind="subtitles" lang="es" srclang="es" label="Subtitulos en Español" src="sintel-es-subtitles.vtt" />
  These are the description tracks
<track kind="captions" lang="en" srclang="en" label="English" src="sintel-en.vtt" default />

Text tracks and audio

Text tracks are not limited to working with just video. They work just the same with audio. The example below (taken from http://mattcrouch.net/experiments/music-sync/) provides synchronized captions to an audio track.

Using jQuery, an extract of the audio for the Sintel video and the same captions that we used for the video examples, we change the cues programatically using the video API to display the cues at the matching time.

As you can see, description tracks would be particularly useful in this case as they would provide a more complete context to the audio.

jQuery(document).ready(function() {
    // Step below is optional. I don't like taking
    // the option from the user and autoplay the video
    var audio = document.querySelector("audio");
    // log the name of the track we're working with

    audio.textTracks[0].oncuechange = function (){
      $("#output").html(""); // Clear the content of our output region
        if(this.activeCues !== null) {
          for(var i=0;i");

Additional Tutorials And Tools