John Burns Skype Meeting Notes (May 15, 2009)

John Burns Skype Meeting Notes (May 15, 2009)

Date of Meeting: May 15, 2009
In attendance: Jess Mitchell, James Yoon, Jacob Farber, Jonathan Hung, John Burns

About JStor

  • JStor: scans, stores, makes available scholarly journals, across disciplines

  • Initially JStor was trying to help solve the storage crisis in libraries (i.e. Too many books, not enough space)

  • JStor is more than just about saving bookshelf space now.

  • There are 5000 institutions participating in JStor.

  • Bulk of offerings are journals.

  • Recent journals are scanned when published, held and released after a period of time

Production

  • publisher just gives permission. Everything else is done by JStor.

  • Finding full run of a journal is hard

  • Find a whole run, send it off to somewhere to be scanned.

  • Finding missing journals in a run is hard and getting it scanned is difficult and expensive.
    o Going into collections to fill in the gaps is one place where Decapod can come in.

  • Digitizing pamphlets (screeds) now and art auction catalogs (social interest)

Motivation for Decapod:

  • No consumer grade end-to-end solution for digitization

  • originals are valuable.
    o owners not willing to let the objects out of their property.
    o State-run collections (like archaeological artifacts) are locked up

  • portable station is important to solve this problem of security.

  • How to recruit and train scanners and somehow generate well-structured PDFs
    o Ease of use important to getting volunteers up and running quickly with little overhead.

  • Send out transcription to other sites (does this mean scanning of actual books? -JH)

  • then put it into the decapod workflow for processing.

Functionality

Summary of Main operations

  • Reflow

  • Zoning / regions

  • Typing of regions

  • Dewarp

  • Crop and scale

  • De-noise

  • Page replacement

Reflow

  • easily change the way information flows / reads on a page

Dynamic font generation

  • preserve look of the original work

  • Therefore can be reflowable without any degradation in OCR conversion

Reflow + Dynamic font generation = Success on mobile devices

  • Mobile devices are mostly small screen

  • Therefore need to be able to reformat the flow (i.e. from 2 column to single tight column) while maintaining the look of the original work
    o (i.e. if no embedded font, you won't be able to determine where to wrap text on the smaller screen).

Zoning

  • Categorize and mark regions on a page.

  • Possible interactions
    o One click - software automatically detects regions
    o Drag boundaries / bounding box

  • Typing of regions

  • a problem at JStor

What intelligence does the software have?

  • how does it know when it runs into problems

  • automatic quality control?
    o what exactly is this? How will it deal with bad situations? How will the user be notified?
    o Thomas already seems to have something in place, but what is it exactly and what control does the user have?

Use Cases:

  • Old documents may not have a document structure.
    o tell the program how many columns etc, to redice false alarm rate

  • How do you dela with patent books?
    o some flow, but mostly diagrams and annotations.
    o in this case, tell program not to even attemp to reflow a page - don't even attempt document analysis.

  • Marginalia
    o stuff in margins are sometimes important or not.
    o currently feels to disregard this kind of stuff because of its vastly complicated.

  • Foreign text
    o How to tell software that you have foreign text.

  • Warnings
    o when detects broken text

  • Maps
    o know when to stop processing as text and deal with pages of illustrations / maps.

  • Rezoning
    o present alternatives simultaneously and pic which zones are correct. (John Burns likes this).

Problems with OCR old books:

"transfer" ink from one side imprints on opposite

"bleedthrough"

  • physical bleedthrough (ink bleed through page)

  • opposite bleedthrough (back page optically bleeds through)

Metric: Confidence?

  • Is it text but not recognized? Push it onto workflow

  • Is it text but not english? What do you do?

Future Meetings / Tasks:

John Burns to Toronto - June 8?

  • talk about some wireframes, use cases

JB anticipates two Pilots:
1 - user testing on UI early.
2 - prototype into hands of early adopters.

People Mentioned in Conversation

Dina Markham - national archives.

  • can't let out artifacts from facility.

  • Public accessible digital replicas and portable digitization station more convenient for getting people to see collections.

John B Howard http://www.educause.edu/Community/MemDir/Profiles/JohnBHoward/44694

Dave Simski?

  • HP had a massive OCR project

  • Dave Simksky - developed a way to manually zone regions / marking up zones

Software

  • ATIC / atise? / ATICE?- ocr? (not sure of the name)

  • Plastic Logics