Decapod Training Module 3 - Digitization Best Practices

Table of Contents

Camera capture

Using cameras to digitize printed material has its benefits and challenges. This topic will help guide you through some of the issues.

When to use cameras

Cameras are best suited for situations where:

  • the material is too delicate to be flattened (like on a flatbed scanner or similar mechanical process)
  • the book can not be opened flat (due to fragile spine or too many pages).
  • the material is too large for a traditional scanner
  • the work can not be removed from its location (thus mobility of digitization setup is important)
  • speed is desired (most cameras work faster than a consumer grade flatbed scanner)

Properly lighting the material

Lighting is important when photographing printed material. Shadows, hotspots, and uneven lighting can make content hard to read and may affect OCR results.

Elminating Shadows:

  • The easiest way to eliminate or reduce shadows is to photograph in a well lit room.
  • Good overhead lighting and sunlight through windows are good sources of light.
  • Avoid spot lights as they tend to make shadows darker and create hotspots.
  • Where good lighting is unavailable, you will need to provide additional light sources yourself (i.e. using desk lamps, or some other form of lighting).
  • If using your own lighting, positioning the lights on opposite sides of the book will help reduce shadows.
  • If using two lights, placing one at the top of the center gutter, and another light at the bottom will help reduce the shadow caused by the center curve toward the gutter.
  • Avoid using the on-camera flash as this will likely create a hotspot in the center and shadows around the edges of the image.

Eliminating Hotspots:

Hotspots occur when a bright light is focused on a surface. The effect is a bright white spot on the image which washes away detail.

  • Avoid using camera flash and spotlighting.
  • if using camera flash or spotlighting, adding a light diffuser can help soften the light and reduce the effect of a hotspot.
  • if using camera flash or spotlighting, bouncing the light off another surface first can help scatter the light creating a more even / ambient light.

Camera "noise" and ISO

If the lighting is poor, images from digital cameras can appear blotchy or noisy. In this situation increasing the amount of ambient will help. If the camera also has manual ISO settings, reducing the ISO will generally improve the results as well.

white balance

Most current digital cameras should set white balance automatically, but occasionally the cameras' automatic setting yields photos that appear blue or orange - this is caused by the white balance being set improperly by the camera.

In these cases the user will need to set the white balance manually.

  • Change the white balance preset. Many cameras have preset white balance settings for different types of lights: incandescent, florescent, cloudy, etc. Select the mode that best matches your situation.
  • Try taking photos in a room with different lighting or near a window.
  • Mixed lighting (i.e. florescent with incandescent bulbs) can also cause the colours in an image to look inaccurate. If possible, turn off some of the lights of a particular type (i.e. turn off the florescent lights).
  • If your camera supports custom white balance, using a neutral-grey card or a white card to calibrate will yield the best results.

aperture and subject distance

There may be instances where parts of an image are in focus, and other parts are not - this is caused by a combination of the camera focus being too shallow and the distance of the subject to camera not being uniform.

  • Make sure that from edge-to-edge, the material is equidistant from the camera (i.e. one part of the page is not closer to the camera than another part). You may need to prop up the material (i.e. with a cradle) to help position the page surface properly.
  • Lighting is bright is balanced (brighter the scene, the camera will be able to keep more in focus).
  • If the camera has manual aperture settings, set the aperture to F5.6.

positioning materials

  • Position the material so that from edge to edge, it fills the camera's view as much as possible.
  • Page surface is equidistant to the camera as possible.
  • If the material can not be laid flat without damage to the spine, a cradle or support may help.
  • once both camera and content are positioned correctly, do not change the height, camera angles, or content positioning. Modifying any of these things once you have already started can change the visual consistency of the images (i.e. text sizes change sizes).

page curl

  • flatten the page by holding down the edges (fingers, paperclips, elastics)
  • use a transparent weighted material to flatten the pages (i.e. glass or clear plastic)
  • be careful not to damage the material.
  • if using Decapod's 3D capture methods, you will not have to physically flatten pages.

perspective distortion

  • ensure that the camera is 90-degrees to all parts of the page surface
  • may be easier by using a tripod or propping up the book on an angle
  • if available, an adjustable arm or boom to position the camera over the material can also help

Scanner capture

When to use a flatbed scanner?

  • small-medium size material that can fit on the scanner's surface (A4 / 8x5x11 or less typically).
  • material that isn't very thick (thick material is harder to place on the scanner due to page bulge. Harder to close lid).
  • sturdier material that can withstand being handled more.

Controlling bleed through

The bright light of a scanner can sometimes cause the text on the backside of the page to "bleed" through the paper. To mitigate this, placing a black piece of paper behind the page being scanned can reduce this effect.

Large materials

If the material to be digitized is larger than the glass surface of the scanner, then the user may have to scan the material in multiple parts and then stitch together using image editing software.

An alternative is to use a camera to digitize the material instead.

Image Resolution (DPI)

Image resolution on a flatbed scanner is measured in dots-per-inch, or DPI - higher the DPI, the more detail will appear in the resulting scan. While having a higher DPI is desirable for preserving detail, the file will consequently be large and may pose some challenges for storage and data transfer.

  • For archive and preservation quality, the minimum DPI is 600.
  • For printing - 300 to 600 DPI.
  • For screen viewing (and viewing over the web) - 100 to 200 DPI.

Processing and Managing Images

Allowable enhancements, modifications, and fidelity

The decision to post process images will depend on the requirements of the project. However, it is recommended that the master files as faithful to the original size, shape, colour, texture, and condition.

You can decide whether to remove marks added after publication and not part of the original document such as an ownership stamp / watermark. These marks can be removed digitally as long as important content is not removed in the process.

If bleedthrough is visible, you may also choose to fix this digitally although reducing the effect of bleedthrough during digitization may be better since it can be more efficient. See "Controlling Bleed Through" for more information.

Other digital enhancements should be used with caution as some processing can significantly alter the appearance of the content.

File formats

Images can exist in different formats, each with their own advantages and disadvantages.

  • For preservation and archiving, TIFF file format is the industry standard. For colour content, 24bit colour is used when creating the TIFF file. For black and white, and greyscale content, 8bit grey is used for the TIFF file.
  • For web distribution: PNG or JPEG.
  • PDF is a convenient format for collecting all images into a document. Can have searchable text.

File Management - Keeping Master files

In a digitization workflow, it is important to save a "Master" file before any further work is done to the image. The Master file is the original image as it was captured by the digitizing device (scanner, camera, etc.). This file is stored and preserved for 3 reasons:

  1. working from a copy of the master file provides the most flexibility compared to working from an image that has already been processed
  2. if contributing content to archiving services or content repositories, those services may only accept original master files
  3. serves as a backup

Once a master file has been saved, copies of the master file can be produced and modified as needed.

Unique Identifiers and File naming conventions

Unique identifiers for work to be digitized and resulting files should follow the identifier convention given by the metadata standard.

For example if following the Dublin Core metadata schema, the guideline states:

An unambiguous reference to the resource within a given context. Recommended best practice is to identify the resource by means of a string or number conforming to a formal identification system. Examples of formal identification systems include the Uniform Resource Identifier (URI) (including the Uniform Resource Locator (URL), the Digital Object Identifier (DOI) and the International Standard Book Number (ISBN).

Source: http://dublincore.org/documents/usageguide/elements.shtml

If a work is already catalogued in a database, then it is possible it may already have a unique ID (like an ISBN number). In this case it is fine to use this ID since it satisfies the metadata convention being followed.

Once an identifier has been chosen for a particular book, all digital files generated from that book should be named according to this chosen ID followed by an appropriate suffix.

Example

ISBN Identifier: 123-6-12-987654-0

Unique ID as entered into cataloging system or database: 123-6-12-987654-0

Digitized Images file naming:

  • 123-6-12-987654-0-0001.tif (i.e. front cover)
  • 123-6-12-987654-0-0002.tif (i.e. inside front cover)
  • 123-6-12-987654-0-0003.tif
  • 123-6-12-987654-0-0004.tif
  • 123-6-12-987654-0-0005.tif
  • ...

XML, Excel, or other files associated with book:

  • 123-6-12-987654-0.zip
  • 123-6-12-987654-0.txt
  • 123-6-12-987654-0.xls
  • etc.

Metadata standards

Metadata for books describe the resource for the purposes of discovery and consumption. The Dublin Core metadata standard is a widely adopted scheme for describing content for repositories, and features a structured way to describe content by title, creator, language, and by other elements.

Following a metadata standard (like Dublin Core, http://dublincore.org/) will help collection administrators generate essential data on their content, but storage and maintenance of this data is required.

While metadata can be stored in any digital format (text file, spreadsheet, or database), it is recommended that a standard be followed in all cases.

Suggested Dublinc Core fields (adapt to your project as needed):

  • Title
  • Creator
  • Subject
  • Description
  • Date
  • Date Original
  • Type
  • Format Medium
  • Format Extent
  • Identifier
  • Source
  • Coverage Spatial
  • Rights

Digital Content Management

Storing and maintaining digital content can become complex and unwieldy - therefore it is recommended that a strategy be developed before digitization of materials begins.

What is a DCM? (question)

Why use a DCM? (question)

It is possible to develop a custom strategy, or use a DCM offered by another party such as dSpace or Fedora Commons.

Copyright, permissions, and ownership

Before commencing any digitization effort, copyright and ownership should be addressed. It is important that a project have a clear permission statement for digitizing, distributing, modifying, and storing digital copies of printed content. The consequence of not doing this work can result in content that violates existing agreements and rights, which may result in voided digital content.

Consistency

When digitizing content, it is desirable to be consistent page to page as this will ensure a nice, uniform final product. To help maintain consistency while digitizing, keep the following in mind:

  • Once digitization has started, do not make drastic adjustments to positioning, treatment, or settings.
  • Maintain a consistent margin or frame around the main content area from start to finish.

(question) Other remarks? Combine this section with another?

Omitting pages

In any given book, there may be pages not be worth digitizing like blank pages, or pages inserted after publication (non-original content).

Clear guideline for omitting pages should be stated to help reduce confusion as work progresses.