John Burns Skype Meeting Notes (May 15, 2009)

Date of Meeting: May 15, 2009
In attendance: Jess Mitchell, James Yoon, Jacob Farber, Jonathan Hung, John Burns

About JStor

  • JStor: scans, stores, makes available scholarly journals, across disciplines
  • Initially JStor was trying to help solve the storage crisis in libraries (i.e. Too many books, not enough space)
  • JStor is more than just about saving bookshelf space now.
  • There are 5000 institutions participating in JStor.
  • Bulk of offerings are journals.
  • Recent journals are scanned when published, held and released after a period of time

Production

  • publisher just gives permission. Everything else is done by JStor.
  • Finding full run of a journal is hard
  • Find a whole run, send it off to somewhere to be scanned.
  • Finding missing journals in a run is hard and getting it scanned is difficult and expensive.
    o Going into collections to fill in the gaps is one place where Decapod can come in.
  • Digitizing pamphlets (screeds) now and art auction catalogs (social interest)

Motivation for Decapod:

  • No consumer grade end-to-end solution for digitization
  • originals are valuable.
    o owners not willing to let the objects out of their property.
    o State-run collections (like archaeological artifacts) are locked up
  • portable station is important to solve this problem of security.
  • How to recruit and train scanners and somehow generate well-structured PDFs
    o Ease of use important to getting volunteers up and running quickly with little overhead.
  • Send out transcription to other sites (does this mean scanning of actual books? -JH)
  • then put it into the decapod workflow for processing.

Functionality

Summary of Main operations

  • Reflow
  • Zoning / regions
  • Typing of regions
  • Dewarp
  • Crop and scale
  • De-noise
  • Page replacement

Reflow

  • easily change the way information flows / reads on a page

Dynamic font generation

  • preserve look of the original work
  • Therefore can be reflowable without any degradation in OCR conversion

Reflow + Dynamic font generation = Success on mobile devices

  • Mobile devices are mostly small screen
  • Therefore need to be able to reformat the flow (i.e. from 2 column to single tight column) while maintaining the look of the original work
    o (i.e. if no embedded font, you won't be able to determine where to wrap text on the smaller screen).

Zoning

  • Categorize and mark regions on a page.
  • Possible interactions
    o One click - software automatically detects regions
    o Drag boundaries / bounding box
  • Typing of regions
  • a problem at JStor

What intelligence does the software have?

  • how does it know when it runs into problems
  • automatic quality control?
    o what exactly is this? How will it deal with bad situations? How will the user be notified?
    o Thomas already seems to have something in place, but what is it exactly and what control does the user have?

Use Cases:

  • Old documents may not have a document structure.
    o tell the program how many columns etc, to redice false alarm rate
  • How do you dela with patent books?
    o some flow, but mostly diagrams and annotations.
    o in this case, tell program not to even attemp to reflow a page - don't even attempt document analysis.
  • Marginalia
    o stuff in margins are sometimes important or not.
    o currently feels to disregard this kind of stuff because of its vastly complicated.
  • Foreign text
    o How to tell software that you have foreign text.
  • Warnings
    o when detects broken text
  • Maps
    o know when to stop processing as text and deal with pages of illustrations / maps.
  • Rezoning
    o present alternatives simultaneously and pic which zones are correct. (John Burns likes this).

Problems with OCR old books:

"transfer" ink from one side imprints on opposite

"bleedthrough"

  • physical bleedthrough (ink bleed through page)
  • opposite bleedthrough (back page optically bleeds through)

Metric: Confidence?

  • Is it text but not recognized? Push it onto workflow
  • Is it text but not english? What do you do?

Future Meetings / Tasks:

John Burns to Toronto - June 8?

  • talk about some wireframes, use cases

JB anticipates two Pilots:
1 - user testing on UI early.
2 - prototype into hands of early adopters.

People Mentioned in Conversation

Dina Markham - national archives.

  • can't let out artifacts from facility.
  • Public accessible digital replicas and portable digitization station more convenient for getting people to see collections.

John B Howard http://www.educause.edu/Community/MemDir/Profiles/JohnBHoward/44694

Dave Simski?

  • HP had a massive OCR project
  • Dave Simksky - developed a way to manually zone regions / marking up zones

Software

  • ATIC / atise? / ATICE?- ocr? (not sure of the name)
  • Plastic Logics