Experiences with Pandoc

Software tools

For generating EPUB 3 from HTML

Converting HTML to EPUB

We are using Pandoc to convert HTML to EPUB. The following is the command being used:

pandoc 01-velocity.html -o 01-velocity.epub -w epub3 -f html -R

You can have multiple input files separated by spaces and output to a single EPUB file.


  • Self closing HTML void elements (i.e. <source> elements) should include the optional trailing "/" slash. Omitting this may cause certain epub readers (like Readium and iBooks) to improperly interpret the markup and incorrectly insert closing </source> tags.
  • controls attribute for <video> element must include a value. Otherwise an error may appear in the reader system. i.e. make sure to use <video controls="controls"> and not <video controls>

Media Overlays

Pandoc cannot currently handle including media overlays into an EPUB archive. The general workaround is to use pandoc to create the EPUB file without the media overlay and then add the media overlay to the archive manually. This section describes how we did that.

  1. Decide what level of granularity you want the highlighting to happen at: word, sentence, paragraph, etc.

  2. Ensure there's an ID attribute on any HTML element you want highlighted.
    1. NOTE: Pandoc currently moves and removes IDs inappropriately. See below for a workaround for this.

  3. Record an audio narration of the text. We used the free tool Audacity http://audacity.sourceforge.net/
  4. Identify start and end timecodes for the blocks of audio corresponding to the granularity level you chose:
    1. In Audacity, select the wave segment for the audio in question
    2. Insert "label" using ?? (the first time you do this, Audacity will automatically create a Label track).
    3. Name label using exact ID of the associated HTML element.

  5. Export Audacity's label file. The output will look something like this:

    0.185760	9.102222	c01p0002
    9.380862	11.702857	c01h02
    11.702857	15.185850	c01list001item001

    where each line consists of <start timecode> <end timecode> <label>.

  6. Convert timecodes into SMIL <par> elements as per EPUB overlay specification using the awk program included:

    > awk -f convert.awk -v htmlFile=01-velocity.html -v audioFile=audio/01-velocity.mp3 01-velocity-timecodes.txt > 01-velocity.smil
  7. Add the appropriate SMIL header and footer to the output of the awk script, as well as any desired <seq> elements.

    ... paste the output of the awk script here ...
  8. Use pandoc to create EPUB from HTML, etc. (see Converting HTML to EPUB above).

  9. Unzip the EPUB to access the manifest file, etc.

    > unzip velocity.epub
  10. Edit the manifest file content.opf as necessary:
    1. add duration metadata to the top of the document, inside the <metadata> element:

      <meta property="media:duration">0:00:59.000</meta>
      <meta property="media:duration" refines="#ch001_overlay">0:00:59.000</meta>
    2. add <item> elements for the new files, ensuring to include the correct mime type:
      1. the SMIL file
      2. the audio recording(s)
      <item id="ch001_overlay" href="01-velocity.smil" media-type="application/smil+xml"/>
      <item id="ch001_overlay_mp3" href="audio/01-velocity.mp3" media-type="audio/mpeg" />
    3. add a media-overlay attribute to <item>s for the html file(s), referencing the ID of the relevant SMIL file:

      <item id="ch001_xhtml" href="ch001.xhtml" media-type="application/xhtml+xml" properties="mathml" media-overlay="ch001_overlay" />
  11. Add the overlay-related resources and the edited manifest back into the resource using zip on the command line:

    > zip -X9Dr velocity.epub content.opf 01-velocity.smil audio/01-velocity.mp3

Pandoc Text Level Block Bug Testing

Input:  <span aria-label="This is an aria label" id="label-01">Some text</span>
Output: <p><span aria-label="This is an aria label" id="label-01">Some text</span></p>

Input:  <div aria-label="This is an aria label" id="label-02">Some text</div>
Output: <div aria-label="This is an aria label" id="label-02">Some text</div>

Input:  <p aria-label="This is an aria label" id="label-03">Some text</p>
Output: <p>Some text</p>

Input:  <h2 aria-label="This is an aria label" id="label-04">A header</h2>
Output: <section id="label-04" class="level2" aria-label="This is an aria label"><h2>A header</h2></section> 

Input:  <h2><span aria-label="This is an aria label" id="label-05"> A header</span></h2>
Output: <section id="a-header" class="level2"><h2><span id="label-05" aria-label="This is an aria label"> A header</span></h2></section>

This issue has been posted to the pandoc-discussion list: https://groups.google.com/forum/#!topic/pandoc-discuss/ofzgm8LxoG4