Decapod PDF Generation Documentation

High-Level Description

This module is responsible for the export feature of Decapod. As input it gets the dewarped camera-captured document images. These can be transformed into different types of PDFs:

  • PDF containing only the original images
  • PDF containing the original images with underlying text
  • PDF/A containing the OCRed version of the documents including font information

The PDF generation uses OCRopus as its OCR system.

Current State

Currently operational features:

  • PDF containing only the original images (this is the only supported output format for release 0.3)
  • PDF containing the original images with underlying text
  • tokenized PDF: this is a proof of concept for the final PDF/A output. It uses token-based compression similar to JBIG2.

Future Work

Tokenized PDF generation will be enhanced to incorporate automatically generated font information in order to create valid PDF/A output and achieving a good compression ratio while maintaining a faithful representation of the captured documents.

Technical Details

Technical details can be found in:

Link to the documentation in the decapod-genpdf repository