Globalizing web content: Globalize.js

GLobalize is the heavy gun in the i18n world. It’ll automate most of the i18n work and integrate ICU and CLDR into one application.

npm install --save globalize 

or build from the Github development branch

git clone https://github.com/globalizejs/globalize.git
bower install && npm install
grunt

Globalize makes heavy use of the CLDR data set and, in the examples below, the installation process through NPM will take care of installing the CLDR data. To illustrate how this works we’ll look at two examples from the Globalize repository, one running NPM to create text-based responses and one using NPM and WebPack to generate bundles for each of our target languages.

NPM

The first example we’ll look at it is a Node-based app that will output all the results to the console. The package.json file, as with any Node-based project, tells NPM what packages and versions to install.

{
  "name": "globalize-hello-world-node-npm",
  "private": true,
  "dependencies": {
    "cldr-data": "latest",
    "globalize": "^1.3.0",
    "iana-tz-data": ">=2017.0.0"
  },
  "cldr-data-urls-filter": "(core|dates|numbers|units)"
}

main.js has all the code that will load and run Globalize tools. I’ve commented the code to illustrate what it does.

const Globalize = require( "globalize" );

let like;

// Before we can use Globalize, we need to feed it on
// the appropriate I18n content (Unicode CLDR).
Globalize.load(
    require( "cldr-data/main/en/ca-gregorian" ),
    require( "cldr-data/main/en/currencies" ),
    require( "cldr-data/main/en/dateFields" ),
    require( "cldr-data/main/en/numbers" ),
    require( "cldr-data/main/en/timeZoneNames" ),
    require( "cldr-data/main/en/units" ),
    require( "cldr-data/supplemental/currencyData" ),
    require( "cldr-data/supplemental/likelySubtags" ),
    require( "cldr-data/supplemental/metaZones" ),
    require( "cldr-data/supplemental/plurals" ),
    require( "cldr-data/supplemental/timeData" ),
    require( "cldr-data/supplemental/weekData" )
);
// Load messages for our default language
Globalize.loadMessages( require( "./messages/en" ) );
// Load time zone data
Globalize.loadTimeZone( require( "iana-tz-data" ) );

// Set "en" as our default locale.
Globalize.locale( "en" );

// Use Globalize to format dates.
console.log( Globalize.formatDate( new Date(), { datetime: "medium" } ) );

// Use Globalize to format dates in specific time zones.
console.log( Globalize.formatDate( new Date(), {
    datetime: "full",
    timeZone: "America/Sao_Paulo"
}));

// Use Globalize to format dates to parts.
console.log( Globalize.formatDateToParts( new Date(), { datetime: "medium" } ) );

// Use Globalize to format numbers.
console.log( Globalize.formatNumber( 12345.6789 ) );

// Use Globalize to format currencies.
console.log( Globalize.formatCurrency( 69900, "USD" ) );

// Use Globalize to get the plural form of a numeric value.
console.log( Globalize.plural( 12345.6789 ) );

// Use Globalize to format a message with plural inflection.
like = Globalize.messageFormatter( "like" );
console.log( like( 0 ) ); // Be the first to like this
console.log( like( 1 ) ); // You liked this
console.log( like( 2 ) ); // You and someone else liked this
console.log( like( 3 ) ); // You and 2 others liked this

// Use Globalize to format relative time.
console.log( Globalize.formatRelativeTime( -35, "second" ) );

// Use Globalize to format unit.
console.log( Globalize.formatUnit( 60, "mile/hour", { form: "short" } ) );

Run the program from a terminal by running node main.js.

Globalize Web Content: Basic Strategies

Quick and Dirty: JS Template String Literals

Andrea Giamarchi wrote an article: “Easy i18n in 10 lines of JavaScript (PoC)” that provides an idea of how to do translation using template literals. This code has been further developed in a Github Repo.

See the article in my blog and Andrea’s post for more information.

Messages: Gender- and plural-capable messages

The next step is to use Messages. We use the library to separate your code from your text formatting while enabling much more humane expressions. This library will eliminate the following from your UI:

  • There are 1 results.
  • There are 1 result(s).
  • Number of results: 5.

The installation process is just like any other application:

npm --save install messageformat

Once it’s installed we require it like any other in a Node application.

const MessageFormat = require('messageformat');

We then build the message we want to display to our users. In this case, we build a message with three rules:

  • A gender (GENDER) rule with values for male, female and other
  • A Resource (RES) rule with values for no results, (exactly) 1 result and more than one result
  • A Category (rule) with ordinal values for one, two, third and other categories
const msg =
  '{GENDER, select, male{He} female{She} other{They} }' +
  ' found ' +
  '{RES, plural, =0{no results} one{1 result} other{# results} }' +
  ' in the ' +
  '{CAT, selectordinal, one{#st} two{#nd} few{#rd} other{#th} }' +
  ' category.';

The last step is to compile the rules and use it as needed. The compilation makes it possible to use the different combinations of the values defined in our message variables.

Using the compiled message we build using mfunc and the values for the categories that we created when we defined the message. The examples below show how the different combinations of messages.

// Compiles the messages and formats.
const mfunc = new MessageFormat('en').compile(msg);

mfunc({ GENDER: 'male', RES: 1, CAT: 2 })
// 'He found 1 result in the 2nd category.'

mfunc({ GENDER: 'female', RES: 1, CAT: 2 })
// 'She found 1 result in the 2nd category.'

mfunc({ GENDER: 'male', RES: 2, CAT: 1 })
// 'He found 2 results in the 1st category.'

mfunc({ RES: 2, CAT: 2 })
// 'They found 2 results in the 2nd category.'

For more information on how to use the formatting capabilities of Messageformat, check the Format Guide particularly the sections where it gives instructions for what values to use in what situation.

Globalizing Web Content

I’ve always been interested in internationalization (i18n) and localization (l10n) and how they relate to the web. My interest got picked again when I started wondering how much extra work would it be to keep a site or app in multiple languages and how much time and how many additional resources I’d need to get it done.

This goes beyond translating the content; tools like Google Translate make that part easier than it used to be but also the changes and modifications that we need to do to our code to accommodate for the different languages and cultures we want to deploy our application in.

Difference between l10n and i18n

Before we can jump in and talk about localizing web content and the challenges involved we need to understand the difference between localization (l10n for the 10 letters between l and n in the word localization in English) and Internationalization (i18n for the 18 letters between i and n in the word internationalization in English)

Localization

Localization refers to the adaptation of a product, application or document content to meet the language, cultural and other requirements of a specific target market (a locale).

Often thought of as a synonym for translation of the user interface and documentation, localization is often a much more complex issue. It can entail customization related to:

  • Numeric, date and time formats
  • Use of currency
  • Keyboard usage
  • Collation and sorting of content
  • Symbols, icons, and colors
  • Text and graphics containing references to objects, actions or ideas which, in a given culture, may be subject to misinterpretation or viewed as insensitive
  • Varying legal requirements

and potentially other aspects of our applications.

Localization may even require a comprehensive rethinking of logic, visual design, or presentation if the way of doing business (eg., accounting) or the accepted paradigm for learning (eg., focus on individual vs. group) in a given locale differs substantially from the originating culture.

Internationalization

Definitions of internationalization vary. This is a high-level working definition for use with W3C Internationalization Activity material. Some people use other terms, such as globalization to refer to the same concept.

Internationalization is the design and development of a product, application or document content that enables easy localization for target audiences that vary in culture, region, or language.

Internationalization typically entails:

  1. Designing and developing in a way that removes barriers to localization or international deployment. This includes activities like enabling the use of Unicode, or ensuring the proper handling of legacy character encodings (where appropriate), taking care over the concatenation of strings, decoupling the backend code from the UI text, etc
  2. Providing support for features that may not be used until localization occurs. For example, adding markup or CSS code to support bidirectional text, or for identifying the language, or adding to CSS support for vertical text or other non-Latin typographic features.
  3. Enabling code to support local, regional, language, or culturally related preferences. Typically this involves incorporating predefined localization data and features derived from existing libraries or user preferences. Examples include date and time formats, local calendars, number formats and numeral systems, sorting and presentation of lists, handling of personal names and forms of address, etc.
  4. Separating localizable elements from source code or content, such that localized alternatives can be loaded or selected based on the user’s international preferences as needed.
  5. Notice that these items do not necessarily include the localization of the content, application, or product into another language; they are design and development practices which allow such a migration to take place easily in the future but which may have significant utility even if no localization ever takes place.

What type of internationalization can we automate?

For the most part, we automate UI internationalization and localize only those aspects of the user interface that will change when we change the language we use. Note that this will work in applications and sites that generate HTML from Javascript.

Depending on the tools we use we may be further limited in what we can and cannot localize programmatically, particularly content other than UI.

JS Template Literals

If you’ve worked with Javascript for a while you’ve probably hit the nightmare of string concatenation and how error prone the process is and how hard it is to troubleshoot if you’re not careful.

var sentence = 'This is a very long sentence that we want to '
 + 'concatenate together.'

In ES6/ES2015 there is a new way to create interpolated strings: Template String Literals.

In this post we’ll discuss three areas:

  • How to build template string literals and multi-line string literals
  • Variable interpolation
  • Possible ways to use template string literals for localization

None of these things are new. You’ve always been able to do in Javascript with template literals it’s now easier and more efficient to do so.

Building Template Literals

To build a Template Literal use the backtick (`) character to open and close the string. In essence it’s not different than concatenating strings but it allows you to create the string literals without worrying about whether you interpolated the correct type of quotation mark (' and ") or the + sign when creating multi-line strings.

At it’s simplest, an ES6 Template String Literal is written like this:

let demoString = `Hello World.`
console.log(demoString);

We can also create Template Literal Strings using the same system. Also, note how we’ve been able to add angle brackets to open and close tags and single quotation marks to the longer example.

let longerString = `The second, greyed out example shown here shows the faux subscripts the browser creates by scaling whatever you put in the <code></sub></code> tag; by using the sinf class from Utility OpenType instead, you’ll use the optically correct glyphs included in your web font. In browsers that don’t support the OpenType features, the browser default will still be used.`
console.log(longerString);

Variable interpolation

Where the Template String Literals really show their strengths is when we interpolate variables in the longer string. Going back to our first example we’ll add a variable to show the name of the person we’re greeting:

var userName2 = "carlos"
var greeting = `Hello, ${userName2}!`
console.log(greeting);
// Returns Hello, carlos!

In this example, the interpolation is the ${userName} string that will take the value of the corresponding variable and put its name in placeholder.

We can also work with arrays as the source of interpolation data, something like the example below where we interpolate the values in the userData array:

var userData = {
  "name": "Carlos",
  "home": "Chile",
};

var greeting2 = `Hello ${userData.name}!.

The weather in ${userData.home} is...`;
console.log(greeting2);

Using that last bit of code we can visit an interesting idea. Using String Template Literals when doing Translation.

var userData = {
  "en": {
    "chapter": "Chapter",
    "section": "Section",
  },
  "es": {
    "chapter": "Capítulo",
    "section": "Sección",
  },
};

var chapterHeading = `${userData.en.chapter} 1, ${userData.en.section} 1.`;
console.log(chapterHeading);
// Produces: Chapter 1, Section 1.

var chapterHeadingEs = `${userData.es.chapter} 1, ${userData.es.section} 1.`;
console.log(chapterHeadingEs);
// Produces: Capítulo 1, Sección 1.

Using the code above we can also insert it in our HTML, looking something like this:

<h1>`${chapterHeading}` (English)</h1>

<h1>`${chapterHeadingEs}` (Spanish)</h1>

The challenge is to dynamically set the current language and use the corresponding entry from the language database. I did a naive pass before finding an external solution that works better.

A complete example

It’s tempting to try and reinvent the wheel (and fail miserably like I did) it’s good to go around and see what’s out there.

Andrea Giamarchi’s Easy i18n in 10 lines of JavaScript (PoC) provides a more robust idea of how to do translation using template literals. This code has been further developed in a Github Repo. I will stay with the original idea of the post, and leave it to you if you want to use the library.

The first part of this process is to define how we’ll handle the i18n templates. This will query the database and, based on the language key, we return the string for the matched language.

It will also set up the default language (en) and an empty internationalization database (i18n.db).

function i18n(template) {
  for (var
    info = i18n.db[i18n.locale][template.join('\x01')],
    out = [info.t[0]],
    i = 1, length = info.t.length; i < length; i++
  ) out[i] = arguments[1 + info.v[i - 1]] + info.t[i];
  return out.join('');
}
i18n.locale = 'en';
i18n.db = {};

The next function creates the database for the translation. We’ll use this to populate the database that we’ll feed translations to.

i18n.set = locale => (tCurrent, ...rCurrent) => {
  const key = tCurrent.join('\x01');
  let db = i18n.db[locale] || (i18n.db[locale] = {});
  db[key] = {
    t: tCurrent.slice(),
    v: rCurrent.map((value, i) => i)
  };
  const config = {
    for: other => (tOther, ...rOther) => {
      db = i18n.db[other] || (i18n.db[other] = {});
      db[key] = {
        t: tOther.slice(),
        v: rOther.map((value, i) => rCurrent.indexOf(value))
      };
      return config;
    }
  };
  return config;
};

Andrea provides multiple ways to populate the database. For this example, I will populate it using the set method. The example below set a group of entries using English as the default language and then using .for to identify additional languages and their translation.

i18n.set('en') `Hello ${'name'}`
  .for('de') `Hallo ${'name'}`
  .for('it') `Ciao ${'name'}`
  .for('sp') `Hola ${'name'}`;

TO create a database containing translation information for our books could look like this:

i18n.set('en') `Chapter ${'number'}`
  .for('es') `Capítulo ${'number'}`
  .for('de') `Kapitel ${'number'}`
  .for('fr') `Chapitre ${'number'}`;

We can then use the default language and type the data in English.

// default
i18n`Chapter ${'73'}`;
// "Chapter 73"

We also have the option of switching languages at runtime, continue writing our text in English and see it translated using the database content.

// we switch to German but still write in English
i18n.locale = 'de';
i18n`Chapter ${'73'}`;
// "Kapitel 73"

i18n.locale = 'es';
i18n`Chapter ${'73'}`;
// Capítulo 73

This code presents a basic engine that will cover most of our needs if we’re willing to do the data entry ourselves or use the libraries and utilities Andreas present in Github.

This project is not meant to replace libraries like Globalize, ICU, Unicode CLDR.

More on Font Subsetting

Idea from Bram Stein’s Webfont Handbook

I’ve discussed font subsetting regarding ebooks. This post will review how do we load multiple fonts and how we subset fonts.

Loading multiple font-faces

This is what my normal font loading CSS looks like. I use 5 @font-face commands to load the fonts. One for the monospace font used in code block examples and 4 for our content font: regular, bold, italics and bold italics.

/* Monospaced font for code samples */
@font-face {
  font-family: "notomono";
  src:  url("../fonts/notomono-regular.woff2") format("woff2"),
        url("../fonts/notomono-regular.woff") format("woff"),
        url("../fonts/notomono-regular.ttf") format("truetype");
  font-weight: normal;
  font-style: normal;
}
/* Regular font */
@font-face {
  font-family: "notosans";
  src:  url("../fonts/notosans-regular.woff2") format("woff2"),
        url("../fonts/notosans-regular.woff") format("woff"),
        url("../fonts/notosans-regular.ttf") format("truetype");
  font-weight: normal;
  font-style: normal;
}
/* Bold font */
@font-face {
  font-family: "notosans";
  src:  url("../fonts/notosans-bold.woff2") format("woff2"),
        url("../fonts/notosans-bold.woff") format("woff"),
        url("../fonts/notosans-bold.ttf") format("truetype");
  font-weight: 700;
  font-style: normal;
}
/* Italic Font */
@font-face {
  font-family: "notosans";
  src:  url("../fonts/notosans-italic.woff2") format("woff2"),
        url("../fonts/notosans-italic.woff") format("woff"),
        url("../fonts/notosans-italic.ttf") format("truetype");
  font-weight: normal;
  font-style: italic;
}
/* bold-italic font */
@font-face {
  font-family: "notosans";
  src:  url("../fonts/notosans-bolditalic.woff2") format("woff2"),
        url("../fonts/notosans-bolditalic.woff") format("woff"),
        url("../fonts/notosans-bolditalic.ttf") format("truetype");
  font-weight: 700;
  font-style: italic;
}

I use Font Face Observer to load fonts asynchronously. A newer API that I hadn’t heard before is the CSS Font Loading API, currently an editor draft, to perform a similar task.

Creating a subset

Using the full fonts can be costly in terms of file size. Even at 40KB size per file (on average) we get 200KB of fonts that contain glyphs that may never be used.

The simplest way to create a font subset remains Font Squirrel’s Web Font Generator which has a section for subsetting, shown below

Font Squirrel’s Web Font Generator Subsetting Section

If we know the Unicode ranges that we want to subset we can use fonttools, a collection of Python library for font manipulation.

Install it with Pip… If you’ve installed Python using Homebrew your Pip installer may be pip2 or pip3 depending on the Python version you have installed.

pip install fonttools

The tool we use to subset a font is pyftsubset that takes the name of the font we want to subset, one or more Unicode glyphs (characters) that we want to include in the subset and the format of the subset file.

pyftsubset font.otf --unicodes="U+a,U+20,U+2e,U+44,U+45,U+4d,U+54,U+59,U+61,U+62,U+63,U+64,U+65,U+66,U+67,U+68,U+69,U+6b,U+6c,U+6d,U+6e,U+6f,U+70,U+72,U+73,U+74,U+75,U+76,U+77,U+78,U+79" --flavor=woff

In the next section, we’ll explore a way to generate the subset based on the characters used in the page.

How do you know what characters to use in a subset?

pyftsubset assumes that you know the Unicode glyphs of the characters you want to subset. This is not always the case.

So we’ll look at it in a different way… instead of entering the Unicode characters to subset, we’ll use a separate tool to query a page to see what characters the page is using and feed that data to pyftsubset to create the subsets.

Glyphanger is a Node utility that generates a combined list of every glyph used on a list of sample files or URLs. I prefer to use a global installation of Glyph Hanger as it’s not related to a project and benefits from being in my path.

npm install -g glyphhanger

The first thing we do is get a list of all the Unicode Glyphs used in a given site. We do this by querying the site and adding the --unicodes flag.

glyphhanger http://www.example.com/ --unicodes

We can then feed the result of this command to pyftsubset to create the subset font. Or we can use the subset command directly from Glyphhanger by adding --subset parameter to point where the fonts you want to subset.

glyphhanger https://www.example.com/ --unicodes  --subset=fonts/*.ttf

Can you use multiple subsets of the same font?

We’ve figured out how to subset fonts there is one more question… Can we use more than 1 subset? In the example below, we use two subsets. The first subset is for English using the standard Latin subset and the second one with the Extended Latin character sets to incorporate accents and other marks used in Latin scripts other than English.

@font-face {
  font-family: Source Sans Pro;
  src: url(latin.woff) format("woff");
  unicode-range: U+0000-00FF, U+0131, U+0152-0153, U+02C6, U+02DA, U+02DC, U+2000-206F, U+2074, U+20AC, U+2212, U+2215, U+E0FF, U+EFFD, U+F000;
}

@font-face {
  font-family: Source Sans Pro
  src: url(latin-extended.woff) format("woff");
  unicode-range: U+0100-024F, U+1E00-1EFF, U+20A0-20AB, U+20AD-20CF, U+2C60-2C7F, U+A720-A7FF;
}

The idea is that for browsers that support the unicode-range rule and are displaying English content the first subset will be used. Whenever we use glyphs from the Extended Latin set the browser will load the second font subset and use them together when laying out the text. You may think that this is unnecessary; remember that English borrows many words from other languages many of which use accents and other marks that are not used in English.