Quick and dirty ebook creation script

After creating all the content for an ebook there is still more work to do and I can’t always remember the exact commands to run to finish the book. I do remember that I have to do the following:

  • Delete the existing version of the book (if any) to make sure that changes are picked up in the final product
  • Delete all .DS_Store directories created in my Mac. This may not always be necessary but avoids epubcheck errors if you forget to remove the directory from one of the files being compressed
  • Zip all files to the zipped epub container
  • Runs epubcheck on the resulting epub file
#!/bin/sh

#1. removes the existing epub file (if any)
rm -rf mybook.epub
echo "book file deleted"

#2. Remove .DS_Store file
find . -type f -name '*.DS_Store' -ls -delete
echo "deleted mac specific files"

#3. Zip the necessary files and directories
zip -r -X mybook.epub mimetype META-INF OEBPS

#4. Run epubcheck on the resulting file from step 3
java -jar /usr/local/java/epubcheck/epubcheck-3.0/epubcheck-3.0.jar mybook.epub

#5. All Done :-)  
echo "All Done"

#open mybook.epub

CSS Paged Media Update

Ever since I wrote my original research on paged media the specs have changed considerably. Here’s an update based on the following specifications:

I’ve also tailored the project to work with Antenna House Formatter and Prince XML. Some of the idiosyncracies will come up while developing the stylesheet for this project.

HTML to be used in these examples

The basic HTML file that will be used throughout these examples is below. It’s not a complete example by any stretch of the imagination but it will be enough to get us started.

<html>
<head>
<title>My Awesome Book</title>
<meta charset="utf-8">
</head>
<body data-type='book'>
<section data-type='titlepage'>
<h1 class='title'>Lorem Ipsum</h1>
<h2 class='author'>Carlos Araya</h2>
</section>
<section data-type="toc">
  <!-- TO BE POPULATED LATER-->
</section>
<section data-type='chapter'>
  <p class="rh">Introduction</p>
  <h1>Introduction</h1>
  <p>Lórem ípsum dolór sít amet, vix graeco minímum no. Iudicó atomorum praesent cum éi. Méa quem accumsan adversárium ño, ut mea illum corpora. Vídit aperiri partieñdo iñ duo, vel dicta antiopam médiocrem ád. Et omnés dolorúm perpetua eúm, est át ália labore adversaríum.
  </p>
  <p>
 Usu et adhuc phaedrum philosophiá, nec posidónium mediocritatem et, dólorum euripidis mediocritátem per et. Vís harum adipiscing ei. Et eos qúañdo primis quodsi. Nullám accusata expetenda mel et. Facilisi deseruísse qui at, módo tritáñi legéndos id ius. Denique splendide dispútando ad sit, néc ex tale bonorum consulatu.
  </p>
  </section>
  </body>
  </html>

There are a few things to notice:

  • This is not a complete document. It lacks many of the features from the stylesheet
  • Each chapter title is set up twice
    • First as a paragraph with rh class that we’ll take out of the regular flow of text and use as our running headder
    • The second one, wrapped on h1 tags, will be shown as part of the flow of text
  • Instead of classes or ID attributes, we use data-type attributes to model after epub and the epub:type attributes

Defining the base page

To define the base page we’ve used the following three elements

The first one defines our default page and attributes. We reset the counter for every page and lay the footnotes at the bottom of every page spanning all pontential columns and allowing the height to take as much as it need to in order to fill the content.

@page {
  size: 8.5in 11in;
  margin: 0.5in 1in;
  /* Footnote related attributes */
  counter-reset: footnote;
  @footnote {
    counter-increment: footnote;
    float: bottom;
    column-span: all;
    height: auto;
    }
  }

For the chapter page we set up the layout of a running footer but doesn’t tell the page what the content of the footer is, just placement and content

@page chapter {
  @bottom-center {
    vertical-align: middle;
    content: element(heading);
  }
}

For the body of our css, body tag where the data-type is book, we set up a CMYK color rather than RGB as CMYK is what printers use. We also setup automatic hyphenation for the entire document so we don’t have to worry about it later.

body[data-type="book"] {
  color: cmyk(0%,0%,100%,100%);
  hyphens: auto;
}

Counters

/* page counters  */
body[data-type="book"] > div[data-type="part"]:first-of-type,
body[data-type="book"] > section[data-type="chapter"]:first-of-type { counter-reset: page 1 }
body[data-type="book"] > section[data-type="chapter"]+div[data-type="part"] { counter-reset: none }

We are setting page counters up so that they’ll reset when we want them to. We set the following scenarios:

  • When there is a part element that is the first of type direct child of book body[data-type="book"] > div[data-type="part"]:first-of-type or
  • There is a chapter child that is the first of type direct descendant of book body[data-type="book"] > div[data-type="part"]:first-of-type

Then reset the counters for page to 1.

  • if the first direct child of a book is a chapter that has a part siblibling body[data-type="book"] > section[data-type="chapter"]+div[data-type="part"] { counter-reset: none }

Do not reset the page counter

Title Page

/* Title Page*/
section[data-type="titlepage"] { page: titlepage }
section[data-type="titlepage"] * { text-align: center }

For the title page we made minimal customizations, we could definitely do more. We have chosen to align all the content

Front Matter

We define a series of pages to handle our front matter. We could define fewer pages but then we’d have to create them as we need them and that’s work we don’t need to do unless we really need to

Copyright

/* Copyright page */
section[data-type="copyright"] { page: copyright }

We define a page for copyright and other legal information. We are leaving it empty by default.

Dedication

/* Dedication */
section[data-type="dedication"] { page: dedication }
section[data-type="dedication"] p { font-style: italic }
section[data-type="dedication"] * { text-align: center }

For the dedication element we center all the content and we make all paragraphs italic.

Table of Content

/* TOC */
nav[data-type="toc"] { page: toc }
nav[data-type="toc"] ol { list-style-type: none }

Make the nav containing our TOC have an ordered list without numbers. This is the best semantics for the TOC I’ve found.

Foreword and Preface

/* Foreword  */
section[data-type="foreword"] { page: foreword }

/* Preface*/
section[data-type="preface"] { page: preface }

We mark both of these sections up but we don’t do any particular styling for them, not yet

Front Matter Page Definition

/* Comon Front Mater Page Numbering in lowercase ROMAN numerals*/
/* Right Side */
@page toc:right {
  @bottom-right-corner { content: counter(page, lower-roman) }
  @bottom-left-corner { content: normal }
}

@page foreword:right {
  @bottom-right-corner { content: counter(page, lower-roman) }
  @bottom-left-corner { content: normal }
}

@page preface:right {
  @bottom-right-corner { content: counter(page, lower-roman) }
  @bottom-left-corner { content: normal }
}

/* Left Side*/
@page toc:left  {
  @bottom-left-corner { content: counter(page, lower-roman) }
  @bottom-right-corner { content: normal }
}

@page foreword:left  {
  @bottom-left-corner { content: counter(page, lower-roman) }
  @bottom-right-corner { content: normal }
}

@page preface:left  {
  @bottom-left-corner { content: counter(page, lower-roman) }
  @bottom-right-corner { content: normal }
}

We define each set of pages (left and right) to allow setup of different elements on each facing page. On the left side pages we place the page number on the bottom left corner and we set it in the opposite corner for the right side. The page number is in addition to the running footer we set earlier

Parts, Chapters and Appendices

/* Part */
div[data-type="part"] { page: part }

Parts are the largest containers for our books, Right now they have no other definition but can be further extended

/* Chapter */
section[data-type="chapter"] {
  page: chapter;
  page-break-before: always;
}

/* Appendix */
section[data-type="appendix"] {
  page: appendix;
  page-break-before: always;
}

Chapters and Appendices always start at the top of a page by using the page-break-before selector set to always.

Back Matter

/* Glossary */
section[data-type="glossary"] { page: glossary }

/* Bibliography */
section[data-type="bibliography"] { page: bibliography }

/* Index */
section[data-type="index"] { page: index }

/* Colophon */
section[data-type="colophon"] { page: colophon }

The glossary, bibliography, index and colophon (to which I refer to as back matter) are set up with their own pages which we can style later.

Content Sections and Page Numbering

/* Common Content Page Numbering  in Arabic numerals 1... 199 */
@page titlepage{ /* Need this to clean up page numbers in titlepage in Prince*/
  @bottom-right-corner { content: normal }
  @bottom-left-corner { content: normal }
}

/* Right Side*/
@page chapter:right  {
  @bottom-right-corner { content: counter(page) }
  @bottom-left-corner { content: normal }
}

@page appendix:right  {
  @bottom-right-corner { content: counter(page) }
  @bottom-left-corner { content: normal }
}

@page glossary:right,  {
  @bottom-right-corner { content: counter(page) }
  @bottom-left-corner { content: normal }
}

@page bibliography:right  {
  @bottom-right-corner { content: counter(page) }
  @bottom-left-corner { content: normal }
}

@page index:right  {
  @bottom-right-corner { content: counter(page) }
  @bottom-left-corner { content: normal }
}

/* Left Side */
@page chapter:left {
  @bottom-left-corner { content: counter(page) }
  @bottom-right-corner { content: normal }
}

@page appendix:left {
  @bottom-left-corner { content: counter(page) }
  @bottom-right-corner { content: normal }
}

@page glossary:left, {
  @bottom-left-corner { content: counter(page) }
  @bottom-right-corner { content: normal }
}

@page bibliography:left {
  @bottom-left-corner { content: counter(page) }
  @bottom-right-corner { content: normal }
}

@page index:left {
  @bottom-left-corner { content: counter(page) }
  @bottom-right-corner { content: normal }
}

Like what we did with the front matter page numbering in roman numerals we do with our content and back matter pages using Arabic numerals

#Element Definitions

The following definitions are meant for the content.

##

/*  Block Elements*/
h1, h2, h3, h4, h5, h6 {
  hyphens: none;
  text-align: left;
}

h1.bookTitle { font-size: 200%; }
h2.author {
  font-size: 150%;
  font-style: italic;
}

All our headings are aligned left and will not be hyphenated. If a word would be hyphenated it will be moved to the next line instead.

We also setup specific styles for headings in our title page. We make the title h1.bookTitle 2 times bigger than our standard text and the name of the author h2.author italics and 1.5 times larger than the standard text size.

Paragraphs

p {
  orphans:4; /* min number of lines of a paragraph left at bottom of a page */
  widows:2; /* min number of lines of a paragraph that left at top of a page.*/
}

p.rh {
  position: running(heading);
  text-align: center;
  font-style: italic;

}

We set up orphans and widows for our paragraphs. Orphans and Widows are typographic terms that refer to hanging lines at the beginning or end of a paragraph

Widows refer to:

  • A paragraph-ending line that falls at the beginning of the following page/column, thus separated from the rest of the text.

Orphans refer to:

  • A paragraph-opening line that appears by itself at the bottom of a page/column.
  • A word, part of a word, or very short line that appears by itself at the end of a paragraph. Orphans result in too much white space between paragraphs or at the bottom of a page.

In our setup we don’t want paragraphs shorter than 2 lines to appear at the top of the page or paragraphs shorter than 4 lines to appear at the bottom. If either of those conditions are met move the entire paragraph to the next page.

We also use a paragraph style to set the content of our running header. We take the paragraph with class rh (p.rh and make it the text of our running header defined earlier. We then center it and make it italic to distinguish it from any surrounding text.

img { max-width: 100% }

code { font-family: monospace }

We make sure that images will take the full width available to them and that code will be laid out in a monospaced font.

Footnotes

/*
  Footnotes
*/
span.footnote {
  float: footnote;
}

The paged media and generated content for paged media specifications define a new value for the float attribute to be used with footnotes.

::footnote-marker {
  content: counter(footnote);
  list-style-position: inside;
}

::footnote-marker::after {
  content: '. ';
}

::footnote-call {
  content: counter(footnote);
  vertical-align: super;
  font-size: 65%;
}

We define three pseudoclasses for footnotes. We create a footnote-marker with the footnote counter’s current value as the content (content: counter(footnote);) and with a list stye position attribute. We then add a period (.) after it by using the after pseudo class ::footnote-marker::after

To call the footnotes we use the footnote-call pseudoclass (::footnote-call) and style it as a smaller superscript for the footnote number.

PDF Bookmarks

/*
  Bookmarks
*/
section[data-type="chapter"]  h1 {
  -ah-bookmark-level: 1;
  -ah-bookmark-state: open;
  -ah-bookmark-label: content();
  prince-bookmark-level: 1;
  prince-bookmark-state: closed;
  prince-bookmark-label: content();
  bookmark-level: 1;
  bookmark-state: closed;
  bookmark-label: content();}

section[data-type="chapter"]  h2 {
  -ah-bookmark-level: 2;
  -ah-bookmark-state: closed;
  -ah-bookmark-label: content();
  prince-bookmark-level: 2;
  prince-bookmark-state: closed;
  prince-bookmark-label: content();
  bookmark-level: 2;
  bookmark-state: closed;
  bookmark-label: content();}

section[data-type="chapter"]  h3 {
  -ah-bookmark-level: 3;
  -ah-bookmark-state: closed;
  -ah-bookmark-label: content();
  prince-bookmark-level: 3;
  prince-bookmark-state: closed;
  prince-bookmark-label: content();
  bookmark-leve: 3;
  bookmark-state: closed;
  bookmark-label: content();}

section[data-type="chapter"] h4 {
  -ah-bookmark-level: 4;
  prince-bookmark-level: 4;
  bookmark-level: 4;
}

section[data-type="chapter"] h5 {
  -ah-bookmark-level: 5;
  prince-bookmark-level: 5;
  bookmark-level: 5;
}

section[data-type="chapter"] h6 {
  -ah-bookmark-level: 6;
  prince-bookmark-level: 6;
  bookmark-level: 6;
}

One of the best features of the PDF generated from HTML/CSS is the ability to generate PDF bookmarks for the document content. Antenna

In this particular case, we’ll create bookmarks for chapters only.

Level 1 boomarks are based on the h1 element and it’s created as level 1 PDF header that is open by default. The label for the bookmark is the content of the associated tag.

Level 2 and level 3 are based on h2 and h3 respectively. They are linked to level 2 and level 3 bookmark levels and are closed by default to make the tree narrowers.

Level 4 through 6 are only associated with a bookmark level, nothing else.

Note how we repeat the content for each bookmark level 3 times, once with the Antenna House prefix, once with Prince and one unprefixed (although I don’t know if there is any vendor supporting the unprefixed properties)

Creating the PDF

Now that we have seen the CSS code needed to create the PDF, let’s see how to use the tools to create the PDF. I’ve tested the code with both Antenna House Formatter and Prince XML

Antenna House Formatter

/usr/local/AHFormatterV62/run.sh -d 
        paged-media.html -s 
        paged-media.css 
        -o ahf-test.pdf
/usr/local/AHFormatterV62/bin/AHFCmd 
        -d paged-media.html 
        -s paged-media.css 
        -o ahf-test.pdf
AHFCmd : AH Formatter V6.2 MR3 Evaluation for MacOSX (x86) : 6.2.5.18171 (2014/08/04 16:28JST)
         Copyright (c) 1999-2014 Antenna House, Inc.
AHFCmd :Formatting finished normally :total 16 pages

So far Antenna House has provided the best solution for creating paged content from HTML.

The main drawback of Antenna House is cost. Their evaluation version puts a page-sized watermark on every page of the output PDF and the watermark curs oer text, sometimes making it look like the text itself was not set or printed correctly.

The price starts at $400 for a standard XSL or CSS processor to $560 for both CSS and XSL processors as a single-user standalone version. This includes support.

Prince XML

prince -s paged-media.css 
        --no-author-style 
        paged-media.html 
        -o prince-test.pdf

Prince XML is another commercial solution that provides a fairly decent level of support. The current stylesheet prints page number in the blank page before the first chapter and ignores the page numbering for the table of content.

Cost is also a consideration with Prince although less so than with Antenna House. The $495 price tag includes all formats supported by the processor.

Other solutions

In earlier documents I mentioned open source solutions. I’ve tested the solutions mentioned in the earlier article against the new stylesheet. The results are show below

wkhtml2pdf

This product produced a bookmarked PDF but the result was less than optimal:

  • It moved the running footer to the header
  • It skipped page number altogether
  • It ignored our orphans and widows setting

Even with all these shortcomings this is the best option so far for creating paged media (PDF) using open source tools.

xhtml2pdf

This program can capture HTML+CSS output but seems to have a problem with the CSS in our stylesheet. I ran the command below with the error show below it. There seems to be an issue with the CSS parsers this application uses.

xhtml2pdf --css paged-media.css paged-media.html xhtml2pdft-test.pdf
Converting paged-media.html to /Users/carlos/code/css-paged-media/xhtml2pdft-test.pdf...
Traceback (most recent call last):
  File "/usr/local/bin/xhtml2pdf", line 9, in <module>
    load_entry_point('xhtml2pdf==0.0.5', 'console_scripts', 'xhtml2pdf')()
  File "build/bdist.macosx-10.9-x86_64/egg/xhtml2pdf/pisa.py", line 146, in command
  File "build/bdist.macosx-10.9-x86_64/egg/xhtml2pdf/pisa.py", line 363, in execute
  File "build/bdist.macosx-10.9-x86_64/egg/xhtml2pdf/document.py", line 89, in pisaDocument
  File "build/bdist.macosx-10.9-x86_64/egg/xhtml2pdf/document.py", line 57, in pisaStory
  File "build/bdist.macosx-10.9-x86_64/egg/xhtml2pdf/parser.py", line 673, in pisaParser
  File "build/bdist.macosx-10.9-x86_64/egg/xhtml2pdf/context.py", line 486, in parseCSS
  File "build/bdist.macosx-10.9-x86_64/egg/xhtml2pdf/w3c/cssParser.py", line 434, in parse
  File "build/bdist.macosx-10.9-x86_64/egg/xhtml2pdf/w3c/cssParser.py", line 533, in _parseStylesheet
  File "build/bdist.macosx-10.9-x86_64/egg/xhtml2pdf/w3c/cssParser.py", line 653, in _parseAtKeyword
  File "build/bdist.macosx-10.9-x86_64/egg/xhtml2pdf/w3c/cssParser.py", line 751, in _parseAtPage
TypeError: 'NotImplementedType' object is not iterable

Phantom JS

Phantom completed the capture but it had the following issues:

  • Ignored page breaks
  • It used the font size specified for one element for all the text in the document
  • No page numbers
  • No running footers (or headers)
  • It did not create all the pages in the document

Phantom is not suited to this task. It’ll capture basic pages and process them into PDF but the nature and structure of this particular document/style sheet combination make it ill suited for processing by Phantom.

HTML is the final product, not the initial source

HTML is the final product

In researching the technologies and tools that I use when developing digital content I’ve come across multiple discussions about what’s the best way to create HTML for X application (ebooks, web, transforming into other formats and any number of ideas. Some people think that HTML is perfect for everyone to write, regardless of experience and comfort with the technology. We forget that HTML now is very different to HTML as it was originally created.

HTML —which is short for HyperText Markup Language— is the official language of the World Wide Web and was first conceived in 1990. HTML is a product of SGML (Standard Generalized Markup Language) which is a complex, technical specification describing markup languages, especially those used in electronic document exchange, document management, and document publishing. HTML was originally created to allow those who were not specialized in SGML to publish and exchange scientific and other technical documents. HTML especially facilitated this exchange by incorporating the ability to link documents electronically using hyperlinks.

From http://www.ironspider.ca/webdesign101/htmlhistory.htm

The biggest issue, in my opinion, is that HTML has become a lot more complicated than the initial design. Creating HTML content (particularly when used in conjunction with CSS frameworks like Bootstrap or Zurb or with applications that use additional semantic elements like ePub) takes a lot more than just knowing markup to code them correctly. It takes knowledge of the document structure, the semantics needed for the content or the applications we are creating and the restrictions and schemas that we need to use so that the content will pass validation.

This article presents 4 different approaches to creating HTML. Two of them use HTML directly but target it as the final output for transformations and templating engines; the other two use markup like HTML without requiring strict HTML conformance. I’ve made these selections for two reasons:

  • People who are not profesionals should not have to learn all the details of creating an ePub3 table of content or know the classes to add to elements to create a Bootsrap or Foundation layout grid
  • It makes it easier for developers and designers to build the layout for the content without having to worry about the content itself; we can play with layout and content organization in parallel with content creation and, if we need to make any further changes, we just run our compilation process again

Markdown

Perhaps the simplest solution when moving content from text to HTML is Markdown.

Markdown is a text to (X)HTML conversion tool designed for writers. It refers both to the syntax used in the Markdown text files and the applications used to perform the conversion.

Markdown language was created in 2004 by John Gruber with the goal of allowing people “to write using an easy-to-read, easy-to-write plain text format, and optionally convert it to structurally valid XHTML (or HTML)” (http://daringfireball.net/projects/markdown/)

The language was designed to be readable as-is, without all the additional tags and attributes that makes it possible to covert markdown to languages like SGML, XML and HTML. Markdown is a formatting syntax for text that can be read by humans and can be easily converted to HTML.

The original implementation of Markdown is markdown.pl and has been implemented in several other languages as applications (Ruby Gems, NodeJS modules and Python packages). All versions of Markdown are distributed under open source licenses and are included or available as a plugin for, several content-management systems and text editors.

Sites such as GitHub, Reddit, Diaspora, Stack Overflow, OpenStreetMap, and SourceForge use variants of Markdown to facilitate content creation and discussion between users.

The biggest weakness of Markdown is the lack of a unified standard. The original Markdown language hasn’t been really supported since it was released in 2004 and all new version of Markdown, both parser and language specification have introduced not wholy compatible changes to Markdown. The lack of standard is also Markdown’s biggest strength. It means you can, like Github did, implement your own extensions to the Markdown syntax to acommodate your needs.

Markdown is not easy to learn but once your fingers get used to the way we type the different elements it becomes much easier to work with as it is nothing more than inserting specific characters in a specific order to obtain the desired effect. Once you train yourself, it is also easy to read without having to convert it to HTML or any other language.

Most modern text editors have support for Markdown either as part of the default installation or through plugins.

Example Markdown document

Markdown example form daringfireball

Asciidoctor

I only discovered Asciidocs recently, while researching O’Reilly Media’s publishing toolchains. It caught my attention because of it’s structure, the expresiveness of the markup without being HTML like HTMLbook and the extensibility of the templating system that it uses behind the scenes.

Asciidoctor has both a command line interface (CLI) and an API. The CLI is a drop-in replacement for the asciidoc command from the Standard python distribution. This means that you have a command line tool asciidoctor that will allow you to convert your marked documents without having to resort to a full blown application.

Syntax-wise, Asciidoctor is progressively more complex as you implement more advanced features. In the first example below no tables are used, for example. Tables are used in the second and thirs examples both as data tables and for layout.

The documentation provides more detailed instructions for the desired markup.

Example Asciidoc documents

HTMLBook

O’Reilly Media has developed several new tools to get content from authors to readers. Atlas is their authoring tool, a web based application that allows you to create content they developed HTMLbook, a subset of HTML geared towards authoring and multi format publishing.

Given O’Reilly’s history and association with open source publishing tools (they were an early adopter and promoter of Docbook and still use it for some of their publications) I found HTMLbook intriguing but not something to look at right away, as with many things you leave for later it fell off my radar.

It wasn’t until I saw Sanders Kleinfeld’s (O’Reilly Media Director of Publishing Technologies) presentation at IDPF Book World conference (embedded below) that I decided to take a second look at HTMLbook and its ecosystem.

Conceptually HTMLbook is very simple; it combines a subset of HTML5, the semantic structure of ePub documents and other IDPF specifications to create a flavor of HTML 5 that is designed specifically for publishing. There are also stylesheets that will allow you to convert Markdown and other text formats into HTMLbook (see [Markdown to HTMLBook(https://github.com/oreillymedia/htmlbook.js) and AsciiDoc to HTMLBook (via AsciiDoctor).)

If you use Atlas (O'Reilly's authoring and publishing platform) you don't have to worry about markup as the content is created visually. The challenges begin when implementing this vocabulary outside the Atlas environment.

The project comes with a set of stylesheets to convert HTMLbook content to ePub, MOBI and PDF. The intriguing thing about the stylesheets is that they use CSS Paged Media stylesheets in conjunction with third party tools such as AntennaHouse, PrinceXML or their open source counterparts like XHTML2PDF or wkhtmltopdf.

The open source solutions offer permisive licenses that allow modification and integration into other products without requiring you to release your project under the same license like GPL and LGPL.

As with any solution that advocates creating HTML directly I have my reservations. In HTML formating in general and specialized formats like HTMLbooks in particular, the learning curve may be too steep for independent authors to use for creating content.

The user must learn not only the required HTML5 syntax but also the details regarding ePub semantic structure attributes and the other standards needed to create ePub books. While I understand that technologies such as this are not meant for independent authors or for poeple who are not comfortable or familiar with HTML but the learning curve may still be too steep for most users.

Example HTMLbook document

XML / XSLT

Perhaps the oldest solutions in the book to create HTML without actually creating HTML are XML-based. Docbook, TEI and DITA all have stylesheets that will take the XML content and convert it to HTML, PDF, ePub and other more esoteric formats.

In addition to stylesheets already available developers can create their own to adress specific needs.

Furthermore, tools like OxygenXML Author (and I would assume other tools in the same category have a visual mode that allow users to write XML content, validated against a schema in a way that is more familiar to people not used to creating content with raw XML tools.

The issues with xml are similar to those involved in creating HTML. The markup vocabulary requires brackets, attributes have to be enclosed in quotation marks and generall the syntax is as complicated as you make it. However, tools like Oxygen and smilar help alleviate this problem but don't resolve it completely.

The screenshot below shows OxygenXML Author working in a Docbook 5 document using visual mode.

Visual editing using OxygenXML Author
Visual editing using OxygenXML Author

The positive side is that using XSLT there is no limit to what we can do with our XML content.

XML examples

Conclusion

After exploring a selection of HTML conversion options the question becomes which one is best?

The answer is it depends.

The best way to see how can these text-based tools can be incorporated is to ask yourself how much work you want to do in the backend versus how much work do you want you authors to do when creating the content. This is where the value of specialists in digital formats and publishing becomes essential, we can work with clients in providing the best solution to meet their needs.

Keep in mind who your audience and what the target vocabulary you’re working towards, it will dictate what your best strategy is. Are these all the solutions; definitely not. Other solutions may appear that fit your needs better than those presented here; I would love to hear if that is the case.

Striking the balance between author and publisher is a delicate one. I tend to fall on the side of making things easier for authors… The tools can be made to translate basic markup into the desired result with minimal requirements for authors to mark up the content; the same can’t necessarily be said about the publisher-first strategy

Epubcheck 4 is in the horizon

Epubcheck 4 is in the horizon, you should take another look at your books.

ePubcheck 4, the IDPF validator is currently in Alpha and available for download if you want test it. I downloaded it and made a first pass to what I thought was a valid book and was, not so pleasantly, surprised at how much it has changed.

I say not so pleasantly surprised because of the additonal breadth of checks the tools performs and the cryptic nature of the messages has prompted me to rethink the way in which my documents are laid out, particularly considering that I use Docbook to generate my epub content and it’s not laid out as standard HTML (uses tables to layout content wich means the table doesn’t need the same structure needed for data tables… I know… so 1990s but I did not create the stylesheets).

I’m concerned that publishers will use the new litmus test of epubcheck 4 to reject books taht would have normally passed the epubcheck 3.0.1 validation test.

The positive side of all these warnings is that documents will be more accessible because of the additional checks, assuming that authors (and publishers) bother with implementing any of the rules their books are flagged for.

Visualizing CSS properties

One of my earliest experiments in data visualization was to create visualization of the CSS properties as they are documented in the Web Platform Documentation project. Now that it’s in a shape where I’m willing to let it out in the wild it’s time to write about it and explain the rationale and the technology.

Rationale

I’m a visual person. Rather than search for something that I may or may not know exactly what it is, I’d rather look at something that, I hope, will make it easier for me to find what I’m looking for.

I’m also lazy. Instead of looking for a property in one place and then manually typing the full URL of the Web Platform Documentation project I’d rather add the URLs for all properties directly to the visualization so that, when the I find the property he’s looking for, I can go directly to it from the visualization tree.

Building the visualization

The data

The first thing I did was to pull the data from the Web Platform Documentation project using their API to generate an initial JSON file. I then had to edit the file manually to produce something closer to the JSON format that I was looking for:

{
    "name": "CSS",
    "children": [
        {
            "name": "Alignment",
            "children": [
                {
                    "size": 1000,
                    "name": "align-content",
                    "url": "http://docs.webplatform.org/wiki/css/properties/align-content"
                },
                {
                    "size": 1000,
                    "name": "align-items",
                    "url": "http://docs.webplatform.org/wiki/css/properties/align-items"
                },
                {
                    "size": 1000,
                    "name": "align-self",
                    "url": "http://docs.webplatform.org/wiki/css/properties/align-self"
                },
                {
                    "size": 1000,
                    "name": "alignment-adjust",
                    "url": "http://docs.webplatform.org/wiki/css/properties/alignment-adjust"
                },
                {
                    "size": 1000,
                    "name": "alignment-baseline",
                    "url": "http://docs.webplatform.org/wiki/css/properties/alignment-baseline"
                }
            ]
        }
    ]
}

Every time I edited or made a change to the JSON file (the resulting full JSON file is about 2500 lines of code) I ran it through JSON lint to make sure that the resulting content was valid JSON. I haven’t always done this and it has been a constant source of problems: The page appears blank, only part of the content is displayed and other anoyances that took forever to correct.

Once we have the JSON file working, we can move into the D3 code.

The technology

D3.js is a JavaScript library for manipulating documents based on data. D3 helps you bring data to life using HTML, SVG and CSS. D3’s emphasis on web standards gives you the full capabilities of modern browsers without tying yourself to a proprietary framework, combining powerful visualization components and a data-driven approach to DOM manipulation.
From http://d3js.org

What this means is that we can build visual content based on data have collected or arbitrary data we have available. In this case we are visualizing an arbitrary grouping of CSS properties from the Web Platform Documentation project; all properties are listed but the grouping may change as the group comes to a consensus regarding the groups.

D3 follows a fairly straightforward process. We start by defining all our variables at the top of the script to prevent Javascrp variable hoisting. The code looks like this:

// Starting values were:
// width: 2140 - margin.right - margin.left
// height : 1640 - margin.top - margin.bottom
var margin = {top: 20, right: 120, bottom: 20, left: 120},
    width = 1070 - margin.right - margin.left,
    height = 820 - margin.top - margin.bottom;

var i = 0,
    duration = 750,
    root

var tree = d3.layout.tree()
    .size([height, width]);

var diagonal = d3.svg.diagonal()
    .projection(function(d) { return [d.y, d.x]; });

We create the SVG-related elements that we need in order to display the visualization data. The steps in the code below are:

  • Select the body of the document
  • Append the svg element
  • Set up width and heigh attributes with default values
  • Create a SVG group (indicated by the <g> tag) and translate it (move it by the ammount indicated by the top and left margin)
var svg = d3.select("body")
  .append("svg")
    .attr("width", width + margin.right + margin.left)
    .attr("height", height + margin.top + margin.bottom)
  .append("g")
    .attr("transform", "translate(" + margin.left + "," + margin.top + ")");

We load the JSON file using D3′s JSON loader and set up the root element and its position.

The function collapse makes sure tha elements without children are collapsed when we first open the visualization page. I wanted to make sure that users would not be overwhelmed with all the information available in the visualization and had a choice as to what items they would click and what information they’d access.

Preventing children from automatically displaying also prevents the clutter of the tree. If there are too many children open the vertical space gets reduced and it becomes hard to distinguish which item we are clicking in.

I’ve also set a default height for all elements… 100px sounds good at this stage.

d3.json("json/css-data.json", function(error, css) {
  root = css;
  root.x0 = height / 2;
  root.y0 = 0;

function collapse(d) {
    if (d.children) {
      d._children = d.children;
      d._children.forEach(collapse);
      d.children = null;
    }
  }  root.children.forEach(collapse);
  update(root);
});

d3.select(self.frameElement).style("height", "100px");

Because the click will change the nature of the layout and the number of visible elements we need to update the layout everytime the user clicks on a valid element. This wil involve hidding old elements,showing showing new nodes in the tree.

function update(source) {

// Compute the new tree layout.
  var nodes = tree.nodes(root).reverse(),
      links = tree.links(nodes);

// Normalize for fixed-depth.
  nodes.forEach(function(d) { d.y = d.depth * 200; });

// Update the nodes…
  var node = svg.selectAll("g.node")
      .data(nodes, function(d) { return d.id || (d.id = ++i); });

// Enter any new nodes at the parent's previous position.
  var nodeEnter = node.enter().append("g")
      .attr("class", "node")
      .attr("class", function(d) {
        if (d.children) {
          return "inner node"
        } else {
          return "leaf node"
        }
      })
      .attr("transform", function(d) { return "translate(" + source.y0 + "," + source.x0 + ")"; })
      .on("click", click);

Using D3′s Enter/Append/Exist system we go back into the nodes we created, we append a new circle and set its radius and color (lines 1 – 3 in the code below).

Next I add the text for each node, set up their X (line 5) and Y (line 6) coordinates for the text node. I’ve aligned the text usin a D3 trick where setting the Y value to .35em centers the text vertically.

For each leaf node I set up a link as only elements without children have URL attributes. We do this in two steps:

  • Append a SVG anchor element (svg:a) which is different than our regular HTML anchor (line 7)
  • Add an Xlink, the XML vocabulary for defining links between resources (line 8) using the xlink:href syntax

Finally, we setup and place the linkend attribute for each node in such a way that nodes with children will display their text to the left of the assigned circle and nodes without children will display the text to the right of the circle (line 11)

  nodeEnter.append("circle")
    .attr("r", 1e-6)
    .style("fill", function(d) { return d._children ? "lightsteelblue" : "#fff"; });
    nodeEnter.append("text")
      .attr("x", function(d) { return d.children || d._children ? -10 : 10; })
      .attr("dy", ".35em")
      .append("svg:a")
      .attr("xlink:href", function(d){return d.url;})
      .style("fill-opacity", 1)
      .text(function(d) { return d.name; })
    .attr("text-anchor", function(d) { return d.children || d._children ? "end" : "start"; });

Most of the remaining work is to transition elements to and from their current position. This would be so much easier if we were using a library such as jQuery or Dojo but the result is worth the additional code.

The duration for all transitions is hardcoded to 750 miliseconds. Whether duration affects the user experiecne is an area to look further into.

// Transition nodes to their new position.
  var nodeUpdate = node.transition()
      .duration(duration)
      .attr("transform", function(d) { return "translate(" + d.y + "," + d.x + ")"; });

nodeUpdate.select("circle")
      .attr("r", 4.5)
      .style("fill", function(d) { return d._children ? "lightsteelblue" : "#fff"; });

nodeUpdate.select("text")
      .style("fill-opacity", 1);

// Transition exiting nodes to the parent's new position.
  var nodeExit = node.exit().transition()
      .duration(duration)
      .attr("transform", function(d) { return "translate(" + source.y + "," + source.x + ")"; })
      .remove();

nodeExit.select("circle")
      .attr("r", 1e-6);

nodeExit.select("text")
      .style("fill-opacity", 1e-6);

// Update the links…
  var link = svg.selectAll("path.link")
      .data(links, function(d) { return d.target.id; });

// Enter any new links at the parent's previous position.
  link.enter().insert("path", "g")
      .attr("class", "link")
      .attr("d", function(d) {
        var o = {x: source.x0, y: source.y0};
        return diagonal({source: o, target: o});
      });

// Transition links to their new position.
  link.transition()
      .duration(duration)
      .attr("d", diagonal);

// Transition exiting nodes to the parent's new position.
  link.exit().transition()
      .duration(duration)
      .attr("d", function(d) {
        var o = {x: source.x, y: source.y};
        return diagonal({source: o, target: o});
      })
      .remove();

// Stash the old positions for transition.
  nodes.forEach(function(d) {
    d.x0 = d.x;
    d.y0 = d.y;
  });

The final bit of magic is to use D3′s onClick event to toggle the display of our content.

var svg
// Toggle children on click.
function click(d) {
  if (d.children) {
    d._children = d.children;
    d.children = null;
  } else {
    d.children = d._children;
    d._children = null;
  }
  update(d);
}

Where to go next?

There are some areas I want to further explore as I move forward with the visualization and learn more about how to visualize data:

  • Does the length of the transitions change the way people react to the data?
  • How can we control the space between items when they are too many open?

I will post the answers to these questions as I find the answers :-)

Theory and Practice of Digital Publishing