Subsetting Fonts

In creating ebooks we need to pay attention to files size… Modern fonts have many more characters than we use in normal books. Subsetting the fonts will allow us to decrease the size of the font by decreasing the number of glyphs and the the number of languages supported by the font in question.

Although for me the primary reason for subseting my fonts is reducing the overall book size, there are other reasons for doing so. Some foundry’s require subsetting and font obfuscation as part of their license for eBook embedding. This makes the font less appealing for would-be stealers as it doesn’t necessarily have all the characters you need for your project and it’s not in clear text so you can’t really use it.

We will not, for now, deal with Adobe’s WebKit fonts as the service goals are different. It doesn’t provide the granularity we need to reduce the size of the fonts to a more reasonable size. It does, however, provide basic subsetting for each web font kit you create with the service. Desktop fonts must go through the process below to be properly subset.

Before we start

Before doing any work with fonts you need to make sure that the font allows you to use it on eBooks and digital publishing. I found this out the hard way when my favorite font was not available for use in eBooks under the license I had purchased it and the cost of the license for eBook and web publishing proved prohibitive.

For the purpose of this and other eBook research projects I will use open source fonts or fonts released under the SIL open font license which is permissive and specifically allow for embedding and subsetting.

Choosing the tools

I found three tools that did what I wanted to do; make font files smaller by subsetting them to only the characters and unicode pages associated I need for my book.

Font Squirrel

Web Squirrel’s webfont generator has been the go-to tool for web font work since they were reintorduced along with the rest of the HTML5 specification. It does everything you need to get the fonts displayed in your website: Generates the formats needed for web deployment (TTF, EOT, WOFF and SVG), provides the CSS needed to include the font in your page and, most important for the purpose of this article, gives you expert settings where you can subset the font as needed.

We’ll take a look at the different features of the generator, paying particular attention to subsetting.

Font Squirrel WebFont Generator Upload and Basic selections
Font Squirrel WebFont Generator Upload and Basic selections

Before we can start working with the font subsetting tools, we need to tell Font Squirrel which font we want to work with. Only way to do this is to upload the font.

Once the font is uploaded you need to indicate wich level customization we want. For subsetting we want the expert level. Basic and optimal hide the settings we need to subset the font.

If you’re working on the web you will want to subset TrueType, WOFF and EOT compressed. SVG may be necessary if you’re supporting older iOS devices only. Newer versions of iOS support TrueType and or WOFF.

I leave the following options:

  • Truetype Hinting
  • Rendering
  • Fix Missing Glyps
  • X-height Matching
  • Protection

Under Subsetting, I choose custom subsetting to get the setting shown below.

Font Squirrel Subsetting Options
Font Squirrel Subsetting Options

You can choose to subset based on character types, languages, unicode tables, single characters or unicode ranges either individually or combined.

I normally select the following Unicode Tables

  • Basic Latin
  • Punctuation
  • Currency Symbols

If I’m only using a few characters to create a title, I may subset the font using the single characters option. You will be able to check what your subset will look like under Subset Preview

These characters will be available after subsetting
These characters will be available after subsetting

We now move into the final settings before saving our font subset.

Unless you know what you’re doing, you can leav these settings as they are:

  • OpenType Features
  • OpenType Flattening
  • CSS
  • Advanced Options
  • Shortcuts
Additional option and permission
Additional option and permission

You must check to acknowledge that the fonts you’re uploading are legal to embed. Some foundries will not allow you to use their fonts for embedding directly, prefering instead that you use their online font service (their version of TypeKit).

If you choose to lie and check anyways, Font Squirrel may still disallow you from using the font if the creator has so requested and regardless of you having a license for the desktop font.

As with many things in web development land; test the resulting fonts. Make sure that they have all the glyphs you will need and redo the subset if it doesn’t or if the subset font is larger than your original.

FontFont Subsetter

FontFont Subsetter is an online service that support subsetting fonts. I tried uploading Roboto, a TTF font from Google, and I received the result .

FontFont Subsetter result when uploading Roboto font
FontFont Subsetter result when uploading Roboto font

According to the FAQ, only certain flavors of TTF fonts are supported by the service. Specifically it states:

Make sure that you upload a FontFont with the file extension .ttf, .eot or .woff. Subsetter supports TrueType-flavoured Offc FontFonts and Web FontFonts in WOFF or EOT format.

Since I can’t be sure if the subsetter will work with the fonts I choose I have to look for something else.

FontPress 3

Fontpress has recently been open sourced and it’s available as a download in Github.

It has a drag and drop interface but after you add the fonts to subset it’s more complicated than it needs to be. The resulting font subsets were larger than the original files. I’m not 100% sure if it was the way I did it or if the software is not giving me the results I want.

TODO: Research the tool. It may do what I want but right now the subset is larger than the original font.

CSS Support and namespaces

There are two new @rules in CSS (well, they may not be new but they are new to me) that open an awesome set of possibilities for CSS development with or without a pre-processor.

@namespace

CSS namespaces are the CSS implementation of XML namespaces; the technology that allows elements from different XML vocabularies to live in the same document .

In the case of CSS, namespaces allow us to style elements with the same name from different vocabularies differently. For example, let’s look at the a from both XHTML and SVG vocabularies

@namespace url(http://www.w3.org/1999/xhtml);
@namespace svg url(http://www.w3.org/2000/svg);

/* This matches all XHTML <a> elements, as XHTML is the default namespace */
a {}</a>

/* This matches all SVG <a> elements */
svg | a {}</a>

/* This matches both XHTML and SVG <a> elements */
*|a {}

In this example we define a default namespace @namespace url(http://www.w3.org/1999/xhtml); and a namespace for the SVG vocabulary @namespace svg url(http://www.w3.org/2000/svg);. This will allow us to style the elements based on whether it’s an HTML link or an SVG one.

Rather than have to build separate stylesheets or different selectors for each of our vocabularies we can now create one stylesheet and prefix our selectors based on the language they work with thus making them match only if both the namespace prefix and the selector match.

@supports (also known as feature queries)

Feature queries using the @supports at-rule make it possible to write feature detection in CSS. In principle, feature queries are similar to media queries (described in this CSS Tricks article) but with a different emphasis.

Where media queries concentrate on the device capabilities (as seen in the example below for a screen wider than 1024 pixels)

@media screen and (min-width: 1025){
/* content for devices matching the query goes here */
}

Feature queries work by testing for a CSS capability, similar to how we should be doing feature detection in Javascript, as show below:

@supports (display: flex) {
/* content for browsers that support the condition goes here */
}

Bear Travis, from Adobe, presents a more realistic example on his blog postabout feature queries, copied below:

@supports (background-blend-mode: multiply) {
body {
background-blend-mode: multiply;
background: linear-gradient(rgb(59, 89, 106), rgb(63, 154, 130)),
url('https://s3-us-west-2.amazonaws.com/s.cdpn.io/28727/tree_bark.png');
}
}

Furthermore we can do more complext detections using and, not and or operators. For example we can test for multiple display features by using something like this:

@supports (display: table-cell)
and (display: list-item){
/* code goes here for browsers that support
table-cell and list-item display */
}

One of my favorite uses of feature detection in CSS is to test for prefixed properties using the or operator. This makes the code less brittle because, as vendors drop prefixes for a property, the style will still match the unprefixed version of the rule and, at the same time, we provide backwards compatibility for those browsers that still need the prefixed properties.

This technique does not eliminate the need for Prefix Free or Autoprefixer but it allows designers to code deffensively without having to worry about which browser dropped each prefix when.

@supports (
(perspective: 10px)
or (-moz-perspective: 10px)
or (-webkit-perspective: 10px)
or (-ms-perspective: 10px)
or (-o-perspective: 10px)
) {
/* specific CSS applied when 3D transforms, eventually prefixed, are supported */
}

The final trick to add to the CSS feature detection arsenal is the not operator which negates the test being performed. For example, we can test for lack of support for text-align-last or its Mozilla prefixed counterpart.

@supports ( not ( (text-align-last:justify) or (-moz-text-align-last: justify)) ) {
/* selectors and rules for browsers that don't 
support text-align-last: feature */

Note the parentheses. When using compound expressions they are required, otherwise the css parser will not know how to process the query.

@support allows progressive enhancement on the CSS side of the design equation (or it will once all browsers fully support the specification). Using the CSS cascade we might be able to do something like this:

/* First we do a plain color body background for older browsers */
body {
background-color: #ff8;
}

/* If the browser supports rgb colors we use that */
@supports (background-color: rgb(255, 255, 255)){
background-color: rgb(255, 255, 128);
}

/* test for hsla color space, if supported, use it*/
@supports (background-color: hsla(50, 33%, 25%, 0.75){
background-color: hsla(50, 33%, 25%, 0.75);
}

/* finally we try a blended background and use it if supported. */
@supports (background-blend-mode: multiply) {
body {
  background-blend-mode: multiply;
  background: linear-gradient(rgb(59, 89, 106), rgb(63, 154, 130)),
  url('https://s3-us-west-2.amazonaws.com/s.cdpn.io/28727/tree_bark.png');
  }
}

I created a CodePen with the code above to test in different browsers. So far it does what I expected. It tested the rules in order and the last one the browser supports (and assuming that it supports the @supports at-rule) will be the one displayed.

Only the last supported rule will be used so we can set up for as many capabilities of the browser as we need to. However, the results are inconsistent in the different Macintosh browsers I’ve tested this with. As you can see below, the support is not uniform across browsers or complete in the browsers that do support the specification. Still it’s a great starting point.

Specification status and browser support.

CSS Conditional Rules Module Level 3 (the recommendation that contains that contains feature queries) is at the Candidate Recommendation stage. I’m concerned that the @supoort rule is at risk but the blink implementation is not included on the test case suite.

As far as support the matrix looks like this:

Desktop

  • Chrome: 28.0
  • Firefox (Gecko): 22
  • Internet Explorer: Not Supported
  • Opera: 12.1
  • Safari: Not Supported

mobile

  • Android: Not supported
  • Firefox Mobile (Gecko): 22
  • IE Mobile: Not supported
  • Opera Mobile: 12.1
  • Safari Mobile: Not supported

Quick and dirty ebook creation script

After creating all the content for an ebook there is still more work to do and I can’t always remember the exact commands to run to finish the book. I do remember that I have to do the following:

  • Delete the existing version of the book (if any) to make sure that changes are picked up in the final product
  • Delete all .DS_Store directories created in my Mac. This may not always be necessary but avoids epubcheck errors if you forget to remove the directory from one of the files being compressed
  • Zip all files to the zipped epub container
  • Runs epubcheck on the resulting epub file
#!/bin/sh

#1. removes the existing epub file (if any)
rm -rf mybook.epub
echo "book file deleted"

#2. Remove .DS_Store file
find . -type f -name '*.DS_Store' -ls -delete
echo "deleted mac specific files"

#3. Zip the necessary files and directories
zip -r -X mybook.epub mimetype META-INF OEBPS

#4. Run epubcheck on the resulting file from step 3
java -jar /usr/local/java/epubcheck/epubcheck-3.0/epubcheck-3.0.jar mybook.epub

#5. All Done :-) 
echo "All Done"

#open mybook.epub

CSS Paged Media Update

Ever since I wrote my original research on paged media the specs have changed considerably. Here’s an update based on the following specifications:

I’ve also tailored the project to work with Antenna House Formatter and Prince XML. Some of the idiosyncracies will come up while developing the stylesheet for this project.

HTML to be used in these examples

The basic HTML file that will be used throughout these examples is below. It’s not a complete example by any stretch of the imagination but it will be enough to get us started.

<html>
<head>
<title>My Awesome Book</title>
<meta charset="utf-8">
</head>
<body data-type='book'>
<section data-type='titlepage'>
<h1 class='title'>Lorem Ipsum</h1>
<h2 class='author'>Carlos Araya</h2>
</section>
<section data-type="toc">
  <!-- TO BE POPULATED LATER-->
</section>
<section data-type='chapter'>
  <p class="rh">Introduction</p>
  <h1>Introduction</h1>
  <p>Lórem ípsum dolór sít amet, vix graeco minímum no. Iudicó atomorum praesent cum éi. Méa quem accumsan adversárium ño, ut mea illum corpora. Vídit aperiri partieñdo iñ duo, vel dicta antiopam médiocrem ád. Et omnés dolorúm perpetua eúm, est át ália labore adversaríum.
  </p>
  <p>
 Usu et adhuc phaedrum philosophiá, nec posidónium mediocritatem et, dólorum euripidis mediocritátem per et. Vís harum adipiscing ei. Et eos qúañdo primis quodsi. Nullám accusata expetenda mel et. Facilisi deseruísse qui at, módo tritáñi legéndos id ius. Denique splendide dispútando ad sit, néc ex tale bonorum consulatu.
  </p>
  </section>
  </body>
  </html>

There are a few things to notice:

  • This is not a complete document. It lacks many of the features from the stylesheet
  • Each chapter title is set up twice
    • First as a paragraph with rh class that we’ll take out of the regular flow of text and use as our running headder
    • The second one, wrapped on h1 tags, will be shown as part of the flow of text
  • Instead of classes or ID attributes, we use data-type attributes to model after epub and the epub:type attributes

Defining the base page

To define the base page we’ve used the following three elements

The first one defines our default page and attributes. We reset the counter for every page and lay the footnotes at the bottom of every page spanning all pontential columns and allowing the height to take as much as it need to in order to fill the content.

@page {
  size: 8.5in 11in;
  margin: 0.5in 1in;
  /* Footnote related attributes */
  counter-reset: footnote;
  @footnote {
    counter-increment: footnote;
    float: bottom;
    column-span: all;
    height: auto;
    }
  }

For the chapter page we set up the layout of a running footer but doesn’t tell the page what the content of the footer is, just placement and content

@page chapter {
  @bottom-center {
    vertical-align: middle;
    content: element(heading);
  }
}

For the body of our css, body tag where the data-type is book, we set up a CMYK color rather than RGB as CMYK is what printers use. We also setup automatic hyphenation for the entire document so we don’t have to worry about it later.

body[data-type="book"] {
  color: cmyk(0%,0%,100%,100%);
  hyphns: auto;
}

Counters

/* page counters  */
body[data-type="book"] > div[data-type="part"]:first-of-type,
body[data-type="book"] > section[data-type="chapter"]:first-of-type { counter-reset: page 1 }
body[data-type="book"] > section[data-type="chapter"]+div[data-type="part"] { counter-reset: none }

We are setting page counters up so that they’ll reset when we want them to. We set the following scenarios:

  • When there is a part element that is the first of type direct child of book body[data-type="book"] > div[data-type="part"]:first-of-type or
  • There is a chapter child that is the first of type direct descendant of book body[data-type="book"] > div[data-type="part"]:first-of-type

Then reset the counters for page to 1.

  • if the first direct child of a book is a chapter that has a part siblibling body[data-type="book"] > section[data-type="chapter"]+div[data-type="part"] { counter-reset: none }

Do not reset the page counter

Title Page

/* Title Page*/
section[data-type="titlepage"] { page: titlepage }
section[data-type="titlepage"] * { text-align: center }

For the title page we made minimal customizations, we could definitely do more. We have chosen to align all the content

Front Matter

We define a series of pages to handle our front matter. We could define fewer pages but then we’d have to create them as we need them and that’s work we don’t need to do unless we really need to

Copyright

/* Copyright page */
section[data-type="copyright"] { page: copyright }

We define a page for copyright and other legal information. We are leaving it empty by default.

Dedication

/* Dedication */
section[data-type="dedication"] { page: dedication }
section[data-type="dedication"] p { font-style: italic }
section[data-type="dedication"] * { text-align: center }

For the dedication element we center all the content and we make all paragraphs italic.

Table of Content

/* TOC */
nav[data-type="toc"] { page: toc }
nav[data-type="toc"] ol { list-style-type: none }

Make the nav containing our TOC have an ordered list without numbers. This is the best semantics for the TOC I’ve found.

Foreword and Preface

/* Foreword  */
section[data-type="foreword"] { page: foreword }

/* Preface*/
section[data-type="preface"] { page: preface }

We mark both of these sections up but we don’t do any particular styling for them, not yet

Front Matter Page Definition

/* Comon Front Mater Page Numbering in lowercase ROMAN numerals*/
/* Right Side */
@page toc:right {
  @bottom-right-corner { content: counter(page, lower-roman) }
  @bottom-left-corner { content: normal }
}

@page foreword:right {
  @bottom-right-corner { content: counter(page, lower-roman) }
  @bottom-left-corner { content: normal }
}

@page preface:right {
  @bottom-right-corner { content: counter(page, lower-roman) }
  @bottom-left-corner { content: normal }
}

/* Left Side*/
@page toc:left  {
  @bottom-left-corner { content: counter(page, lower-roman) }
  @bottom-right-corner { content: normal }
}

@page foreword:left  {
  @bottom-left-corner { content: counter(page, lower-roman) }
  @bottom-right-corner { content: normal }
}

@page preface:left  {
  @bottom-left-corner { content: counter(page, lower-roman) }
  @bottom-right-corner { content: normal }
}

We define each set of pages (left and right) to allow setup of different elements on each facing page. On the left side pages we place the page number on the bottom left corner and we set it in the opposite corner for the right side. The page number is in addition to the running footer we set earlier

Parts, Chapters and Appendices

/* Part */
div[data-type="part"] { page: part }

Parts are the largest containers for our books, Right now they have no other definition but can be further extended

/* Chapter */
section[data-type="chapter"] {
  page: chapter;
  page-break-before: always;
}

/* Appendix */
section[data-type="appendix"] {
  page: appendix;
  page-break-before: always;
}

Chapters and Appendices always start at the top of a page by using the page-break-before selector set to always.

Back Matter

/* Glossary */
section[data-type="glossary"] { page: glossary }

/* Bibliography */
section[data-type="bibliography"] { page: bibliography }

/* Index */
section[data-type="index"] { page: index }

/* Colophon */
section[data-type="colophon"] { page: colophon }

The glossary, bibliography, index and colophon (to which I refer to as back matter) are set up with their own pages which we can style later.

Content Sections and Page Numbering

/* Common Content Page Numbering  in Arabic numerals 1... 199 */
@page titlepage{ /* Need this to clean up page numbers in titlepage in Prince*/
  @bottom-right-corner { content: normal }
  @bottom-left-corner { content: normal }
}

/* Right Side*/
@page chapter:right  {
  @bottom-right-corner { content: counter(page) }
  @bottom-left-corner { content: normal }
}

@page appendix:right  {
  @bottom-right-corner { content: counter(page) }
  @bottom-left-corner { content: normal }
}

@page glossary:right,  {
  @bottom-right-corner { content: counter(page) }
  @bottom-left-corner { content: normal }
}

@page bibliography:right  {
  @bottom-right-corner { content: counter(page) }
  @bottom-left-corner { content: normal }
}

@page index:right  {
  @bottom-right-corner { content: counter(page) }
  @bottom-left-corner { content: normal }
}

/* Left Side */
@page chapter:left {
  @bottom-left-corner { content: counter(page) }
  @bottom-right-corner { content: normal }
}

@page appendix:left {
  @bottom-left-corner { content: counter(page) }
  @bottom-right-corner { content: normal }
}

@page glossary:left, {
  @bottom-left-corner { content: counter(page) }
  @bottom-right-corner { content: normal }
}

@page bibliography:left {
  @bottom-left-corner { content: counter(page) }
  @bottom-right-corner { content: normal }
}

@page index:left {
  @bottom-left-corner { content: counter(page) }
  @bottom-right-corner { content: normal }
}

Like what we did with the front matter page numbering in roman numerals we do with our content and back matter pages using Arabic numerals

#Element Definitions

The following definitions are meant for the content.

Headings

/*  Block Elements*/
h1, h2, h3, h4, h5, h6 {
  hyphens: none;
  text-align: left;
}

h1.bookTitle { font-size: 200%; }
h2.author {
  font-size: 150%;
  font-style: italic;
}

All our headings are aligned left and will not be hyphenated. If a word would be hyphenated it will be moved to the next line instead.

We also setup specific styles for headings in our title page. We make the title h1.bookTitle 2 times bigger than our standard text and the name of the author h2.author italics and 1.5 times larger than the standard text size.

Paragraphs

p {
  orphans:4; /* min number of lines of a paragraph left at bottom of a page */
  widows:2; /* min number of lines of a paragraph that left at top of a page.*/
}

p.rh {
  position: running(heading);
  text-align: center;
  font-style: italic;

}

We set up orphans and widows for our paragraphs. Orphans and Widows are typographic terms that refer to hanging lines at the beginning or end of a paragraph

Widows refer to:

  • A paragraph-ending line that falls at the beginning of the following page/column, thus separated from the rest of the text.

Orphans refer to:

  • A paragraph-opening line that appears by itself at the bottom of a page/column.
  • A word, part of a word, or very short line that appears by itself at the end of a paragraph. Orphans result in too much white space between paragraphs or at the bottom of a page.

In our setup we don’t want paragraphs shorter than 2 lines to appear at the top of the page or paragraphs shorter than 4 lines to appear at the bottom. If either of those conditions are met move the entire paragraph to the next page.

We also use a paragraph style to set the content of our running header. We take the paragraph with class rh (p.rh and make it the text of our running header defined earlier. We then center it and make it italic to distinguish it from any surrounding text.

img { max-width: 100% }

code { font-family: monospace }

We make sure that images will take the full width available to them and that code will be laid out in a monospaced font.

Footnotes

/*
  Footnotes
*/
span.footnote {
  float: footnote;
}

The paged media and generated content for paged media specifications define a new value for the float attribute to be used with footnotes.

::footnote-marker {
  content: counter(footnote);
  list-style-position: inside;
}

::footnote-marker::after {
  content: '. ';
}

::footnote-call {
  content: counter(footnote);
  vertical-align: super;
  font-size: 65%;
}

We define three pseudoclasses for footnotes. We create a footnote-marker with the footnote counter’s current value as the content (content: counter(footnote);) and with a list stye position attribute. We then add a period (.) after it by using the after pseudo class ::footnote-marker::after

To call the footnotes we use the footnote-call pseudoclass (::footnote-call) and style it as a smaller superscript for the footnote number.

PDF Bookmarks

/*
  Bookmarks
*/
section[data-type="chapter"]  h1 {
  -ah-bookmark-level: 1;
  -ah-bookmark-state: open;
  -ah-bookmark-label: content();
  prince-bookmark-level: 1;
  prince-bookmark-state: closed;
  prince-bookmark-label: content();
  bookmark-level: 1;
  bookmark-state: closed;
  bookmark-label: content();}

section[data-type="chapter"]  h2 {
  -ah-bookmark-level: 2;
  -ah-bookmark-state: closed;
  -ah-bookmark-label: content();
  prince-bookmark-level: 2;
  prince-bookmark-state: closed;
  prince-bookmark-label: content();
  bookmark-level: 2;
  bookmark-state: closed;
  bookmark-label: content();}

section[data-type="chapter"]  h3 {
  -ah-bookmark-level: 3;
  -ah-bookmark-state: closed;
  -ah-bookmark-label: content();
  prince-bookmark-level: 3;
  prince-bookmark-state: closed;
  prince-bookmark-label: content();
  bookmark-leve: 3;
  bookmark-state: closed;
  bookmark-label: content();}

section[data-type="chapter"] h4 {
  -ah-bookmark-level: 4;
  prince-bookmark-level: 4;
  bookmark-level: 4;
}

section[data-type="chapter"] h5 {
  -ah-bookmark-level: 5;
  prince-bookmark-level: 5;
  bookmark-level: 5;
}

section[data-type="chapter"] h6 {
  -ah-bookmark-level: 6;
  prince-bookmark-level: 6;
  bookmark-level: 6;
}

One of the best features of the PDF generated from HTML/CSS is the ability to generate PDF bookmarks for the document content. Antenna

In this particular case, we’ll create bookmarks for chapters only.

Level 1 boomarks are based on the h1 element and it’s created as level 1 PDF header that is open by default. The label for the bookmark is the content of the associated tag.

Level 2 and level 3 are based on h2 and h3 respectively. They are linked to level 2 and level 3 bookmark levels and are closed by default to make the tree narrowers.

Level 4 through 6 are only associated with a bookmark level, nothing else.

Note how we repeat the content for each bookmark level 3 times, once with the Antenna House prefix, once with Prince and one unprefixed (although I don’t know if there is any vendor supporting the unprefixed properties)

Creating the PDF

Now that we have seen the CSS code needed to create the PDF, let’s see how to use the tools to create the PDF. I’ve tested the code with both Antenna House Formatter and Prince XML

Antenna House Formatter

/usr/local/AHFormatterV62/run.sh -d 
        paged-media.html -s 
        paged-media.css 
        -o ahf-test.pdf
/usr/local/AHFormatterV62/bin/AHFCmd 
        -d paged-media.html 
        -s paged-media.css 
        -o ahf-test.pdf
AHFCmd : AH Formatter V6.2 MR3 Evaluation for MacOSX (x86) : 6.2.5.18171 (2014/08/04 16:28JST)
         Copyright (c) 1999-2014 Antenna House, Inc.
AHFCmd :Formatting finished normally :total 16 pages

So far Antenna House has provided the best solution for creating paged content from HTML.

The main drawback of Antenna House is cost. Their evaluation version puts a page-sized watermark on every page of the output PDF and the watermark curs oer text, sometimes making it look like the text itself was not set or printed correctly.

The price starts at $400 for a standard XSL or CSS processor to $560 for both CSS and XSL processors as a single-user standalone version. This includes support.

Prince XML

prince -s paged-media.css 
        --no-author-style 
        paged-media.html 
        -o prince-test.pdf

Prince XML is another commercial solution that provides a fairly decent level of support. The current stylesheet prints page number in the blank page before the first chapter and ignores the page numbering for the table of content.

Cost is also a consideration with Prince although less so than with Antenna House. The $495 price tag includes all formats supported by the processor.

Other solutions

In earlier documents I mentioned open source solutions. I’ve tested the solutions mentioned in the earlier article against the new stylesheet. The results are show below

wkhtml2pdf

This product produced a bookmarked PDF but the result was less than optimal:

  • It moved the running footer to the header
  • It skipped page number altogether
  • It ignored our orphans and widows setting

Even with all these shortcomings this is the best option so far for creating paged media (PDF) using open source tools.

xhtml2pdf

This program can capture HTML+CSS output but seems to have a problem with the CSS in our stylesheet. I ran the command below with the error show below it. There seems to be an issue with the CSS parsers this application uses.

xhtml2pdf --css paged-media.css paged-media.html xhtml2pdft-test.pdf
Converting paged-media.html to /Users/carlos/code/css-paged-media/xhtml2pdft-test.pdf...
Traceback (most recent call last):
  File "/usr/local/bin/xhtml2pdf", line 9, in <module>
    load_entry_point('xhtml2pdf==0.0.5', 'console_scripts', 'xhtml2pdf')()
  File "build/bdist.macosx-10.9-x86_64/egg/xhtml2pdf/pisa.py", line 146, in command
  File "build/bdist.macosx-10.9-x86_64/egg/xhtml2pdf/pisa.py", line 363, in execute
  File "build/bdist.macosx-10.9-x86_64/egg/xhtml2pdf/document.py", line 89, in pisaDocument
  File "build/bdist.macosx-10.9-x86_64/egg/xhtml2pdf/document.py", line 57, in pisaStory
  File "build/bdist.macosx-10.9-x86_64/egg/xhtml2pdf/parser.py", line 673, in pisaParser
  File "build/bdist.macosx-10.9-x86_64/egg/xhtml2pdf/context.py", line 486, in parseCSS
  File "build/bdist.macosx-10.9-x86_64/egg/xhtml2pdf/w3c/cssParser.py", line 434, in parse
  File "build/bdist.macosx-10.9-x86_64/egg/xhtml2pdf/w3c/cssParser.py", line 533, in _parseStylesheet
  File "build/bdist.macosx-10.9-x86_64/egg/xhtml2pdf/w3c/cssParser.py", line 653, in _parseAtKeyword
  File "build/bdist.macosx-10.9-x86_64/egg/xhtml2pdf/w3c/cssParser.py", line 751, in _parseAtPage
TypeError: 'NotImplementedType' object is not iterable

Phantom JS

Phantom completed the capture but it had the following issues:

  • Ignored page breaks
  • It used the font size specified for one element for all the text in the document
  • No page numbers
  • No running footers (or headers)
  • It did not create all the pages in the document

Phantom is not suited to this task. It’ll capture basic pages and process them into PDF but the nature and structure of this particular document/style sheet combination make it ill suited for processing by Phantom.

HTML is the final product, not the initial source

HTML is the final product

In researching the technologies and tools that I use when developing digital content I’ve come across multiple discussions about what’s the best way to create HTML for X application (ebooks, web, transforming into other formats and any number of ideas. Some people think that HTML is perfect for everyone to write, regardless of experience and comfort with the technology. We forget that HTML now is very different to HTML as it was originally created.

HTML —which is short for HyperText Markup Language— is the official language of the World Wide Web and was first conceived in 1990. HTML is a product of SGML (Standard Generalized Markup Language) which is a complex, technical specification describing markup languages, especially those used in electronic document exchange, document management, and document publishing. HTML was originally created to allow those who were not specialized in SGML to publish and exchange scientific and other technical documents. HTML especially facilitated this exchange by incorporating the ability to link documents electronically using hyperlinks.

From http://www.ironspider.ca/webdesign101/htmlhistory.htm

The biggest issue, in my opinion, is that HTML has become a lot more complicated than the initial design. Creating HTML content (particularly when used in conjunction with CSS frameworks like Bootstrap or Zurb or with applications that use additional semantic elements like ePub) takes a lot more than just knowing markup to code them correctly. It takes knowledge of the document structure, the semantics needed for the content or the applications we are creating and the restrictions and schemas that we need to use so that the content will pass validation.

This article presents 4 different approaches to creating HTML. Two of them use HTML directly but target it as the final output for transformations and templating engines; the other two use markup like HTML without requiring strict HTML conformance. I’ve made these selections for two reasons:

  • People who are not profesionals should not have to learn all the details of creating an ePub3 table of content or know the classes to add to elements to create a Bootsrap or Foundation layout grid
  • It makes it easier for developers and designers to build the layout for the content without having to worry about the content itself; we can play with layout and content organization in parallel with content creation and, if we need to make any further changes, we just run our compilation process again

Markdown

Perhaps the simplest solution when moving content from text to HTML is Markdown.

Markdown is a text to (X)HTML conversion tool designed for writers. It refers both to the syntax used in the Markdown text files and the applications used to perform the conversion.

Markdown language was created in 2004 by John Gruber with the goal of allowing people “to write using an easy-to-read, easy-to-write plain text format, and optionally convert it to structurally valid XHTML (or HTML)” (http://daringfireball.net/projects/markdown/)

The language was designed to be readable as-is, without all the additional tags and attributes that makes it possible to covert markdown to languages like SGML, XML and HTML. Markdown is a formatting syntax for text that can be read by humans and can be easily converted to HTML.

The original implementation of Markdown is markdown.pl and has been implemented in several other languages as applications (Ruby Gems, NodeJS modules and Python packages). All versions of Markdown are distributed under open source licenses and are included or available as a plugin for, several content-management systems and text editors.

Sites such as GitHub, Reddit, Diaspora, Stack Overflow, OpenStreetMap, and SourceForge use variants of Markdown to facilitate content creation and discussion between users.

The biggest weakness of Markdown is the lack of a unified standard. The original Markdown language hasn’t been really supported since it was released in 2004 and all new version of Markdown, both parser and language specification have introduced not wholy compatible changes to Markdown. The lack of standard is also Markdown’s biggest strength. It means you can, like Github did, implement your own extensions to the Markdown syntax to acommodate your needs.

Markdown is not easy to learn but once your fingers get used to the way we type the different elements it becomes much easier to work with as it is nothing more than inserting specific characters in a specific order to obtain the desired effect. Once you train yourself, it is also easy to read without having to convert it to HTML or any other language.

Most modern text editors have support for Markdown either as part of the default installation or through plugins.

Example Markdown document

Markdown example form daringfireball

Asciidoctor

I only discovered Asciidocs recently, while researching O’Reilly Media’s publishing toolchains. It caught my attention because of it’s structure, the expresiveness of the markup without being HTML like HTMLbook and the extensibility of the templating system that it uses behind the scenes.

Asciidoctor has both a command line interface (CLI) and an API. The CLI is a drop-in replacement for the asciidoc command from the Standard python distribution. This means that you have a command line tool asciidoctor that will allow you to convert your marked documents without having to resort to a full blown application.

Syntax-wise, Asciidoctor is progressively more complex as you implement more advanced features. In the first example below no tables are used, for example. Tables are used in the second and thirs examples both as data tables and for layout.

The documentation provides more detailed instructions for the desired markup.

Example Asciidoc documents

HTMLBook

O’Reilly Media has developed several new tools to get content from authors to readers. Atlas is their authoring tool, a web based application that allows you to create content they developed HTMLbook, a subset of HTML geared towards authoring and multi format publishing.

Given O’Reilly’s history and association with open source publishing tools (they were an early adopter and promoter of Docbook and still use it for some of their publications) I found HTMLbook intriguing but not something to look at right away, as with many things you leave for later it fell off my radar.

It wasn’t until I saw Sanders Kleinfeld’s (O’Reilly Media Director of Publishing Technologies) presentation at IDPF Book World conference (embedded below) that I decided to take a second look at HTMLbook and its ecosystem.

Conceptually HTMLbook is very simple; it combines a subset of HTML5, the semantic structure of ePub documents and other IDPF specifications to create a flavor of HTML 5 that is designed specifically for publishing. There are also stylesheets that will allow you to convert Markdown and other text formats into HTMLbook (see [Markdown to HTMLBook(https://github.com/oreillymedia/htmlbook.js) and AsciiDoc to HTMLBook (via AsciiDoctor).)

If you use Atlas (O’Reilly’s authoring and publishing platform) you don’t have to worry about markup as the content is created visually. The challenges begin when implementing this vocabulary outside the Atlas environment.

The project comes with a set of stylesheets to convert HTMLbook content to ePub, MOBI and PDF. The intriguing thing about the stylesheets is that they use CSS Paged Media stylesheets in conjunction with third party tools such as AntennaHouse, PrinceXML or their open source counterparts like XHTML2PDF or wkhtmltopdf.

The open source solutions offer permisive licenses that allow modification and integration into other products without requiring you to release your project under the same license like GPL and LGPL.

As with any solution that advocates creating HTML directly I have my reservations. In HTML formating in general and specialized formats like HTMLbooks in particular, the learning curve may be too steep for independent authors to use for creating content.

The user must learn not only the required HTML5 syntax but also the details regarding ePub semantic structure attributes and the other standards needed to create ePub books. While I understand that technologies such as this are not meant for independent authors or for poeple who are not comfortable or familiar with HTML but the learning curve may still be too steep for most users.

Example HTMLbook document

XML / XSLT

Perhaps the oldest solutions in the book to create HTML without actually creating HTML are XML-based. Docbook, TEI and DITA all have stylesheets that will take the XML content and convert it to HTML, PDF, ePub and other more esoteric formats.

In addition to stylesheets already available developers can create their own to adress specific needs.

Furthermore, tools like OxygenXML Author (and I would assume other tools in the same category have a visual mode that allow users to write XML content, validated against a schema in a way that is more familiar to people not used to creating content with raw XML tools.

The issues with xml are similar to those involved in creating HTML. The markup vocabulary requires brackets, attributes have to be enclosed in quotation marks and generall the syntax is as complicated as you make it. However, tools like Oxygen and smilar help alleviate this problem but don’t resolve it completely.

The screenshot below shows OxygenXML Author working in a Docbook 5 document using visual mode.

Visual editing using OxygenXML Author
Visual editing using OxygenXML Author

The positive side is that using XSLT there is no limit to what we can do with our XML content.

XML examples

Conclusion

After exploring a selection of HTML conversion options the question becomes which one is best?

The answer is it depends.

The best way to see how can these text-based tools can be incorporated is to ask yourself how much work you want to do in the backend versus how much work do you want you authors to do when creating the content. This is where the value of specialists in digital formats and publishing becomes essential, we can work with clients in providing the best solution to meet their needs.

Keep in mind who your audience and what the target vocabulary you’re working towards, it will dictate what your best strategy is. Are these all the solutions; definitely not. Other solutions may appear that fit your needs better than those presented here; I would love to hear if that is the case.

Striking the balance between author and publisher is a delicate one. I tend to fall on the side of making things easier for authors… The tools can be made to translate basic markup into the desired result with minimal requirements for authors to mark up the content; the same can’t necessarily be said about the publisher-first strategy

Theory and Practice of Digital Publishing

%d bloggers like this: