0.4 functionality

Testing genpdf

Using Aug 2, 2011 tip functionality

Observations:

automatic deskew correction if conditions are right
- not sure what conditions need to exist for deskew to occur
- does not happen with images -> possibly related to detected page geometry?
type 2 and 3 pdf generation functional. Quality of detected text not completely accurate (this is expected?).
binary version stored in Book Store
given a set of images that are handwritten with some colour (example), Type 2 PDF comes out greyscale with no selectable text (example).
given a set of images that are strictly colour images produce a Type 1 PDF:
- decapod-genpdfEXP.py produces an empty PDF.
- decapod-genpdf.py produces a greyscale PDF .
content of output PDFs are fit to A4 by default.
For type 3 PDF, genpdf automatically reverts to type 2 if character segmentation fails.
For type 1 PDF, binary and paragraph map generation, and line segmentation are always performed.
- Consequence is that images with no detectable text are subjected to long processing.
Some PDF metadata generated - i.e. PDF author is "DECAPOD GenPDF", Producer is "ReportLab http://www.reportlab.com"

Future Improvements & Questions

Colour output option: binary, greyscale, and colour.
Straight Image PDF generation option (no segmentation requested).
- Option to disable page & text segmentation.
PDF metadata options (user can specify their own metadata).
Ability to disable auto-deskew (so auto-deskew can happen earlier in the pipeline and not have to attempt deskew twice).
Does upgrading from Ocropus 0.4.4 improve text results? What work will we need to do to support a newer version of ocropus?
Is there a fail-quick method of detecting text on a page?

Example Test: Type 2 PDF Test - "Computer Generated 600DPI document"

Example Test: Type 3 PDF Test - "Computer Generated 600DPI document"

Still to do

More testing of:

Remote Capture

Import from file system

Questions
- What if images are different resolutions? Unpredictable segmentation results.
- What if pages are inconsistent sizes? Unpredictable segmentation results.
- What if images contain a mix of page spreads and single pages? Okay if resolution and page sizes are consistent.

During page management, present thumbnails and images as it would be on output to file.

Image with text case:

automatically deskew image
automatically binarize image? display images/thumbnails in binary? or we keep it colour and save any colour processing as part of an export "preview"?

Image with no recognizable text case:

Global functions

choose colour depth of output: binary or colour (original)
choose pdf format: image, overlaid, or scalable.
Any way to fail quickly at page segementation? - this way Decapod can switch to image PDF output if it doesn't detect text.