Decapod PDF Generation Documentation
High-Level Description
This module is responsible for the export feature of Decapod. As input it gets the dewarped camera-captured document images. These can be transformed into different types of PDFs:
- PDF containing only the original images
- PDF containing the original images with underlying text
- PDF/A containing the OCRed version of the documents including font information
The PDF generation uses OCRopus as its OCR system.
Current State
Currently operational features:
- PDF containing only the original images (this is the only supported output format for release 0.3)
- PDF containing the original images with underlying text
- tokenized PDF: this is a proof of concept for the final PDF/A output. It uses token-based compression similar to JBIG2.
Future Work
Tokenized PDF generation will be enhanced to incorporate automatically generated font information in order to create valid PDF/A output and achieving a good compression ratio while maintaining a faithful representation of the captured documents.
Technical Details
Technical details can be found in: