John Burns Skype Meeting Notes (May 15, 2009)
Date of Meeting: May 15, 2009
In attendance: Jess Mitchell, James Yoon, Jacob Farber, Jonathan Hung, John Burns
About JStor
JStor: scans, stores, makes available scholarly journals, across disciplines
Initially JStor was trying to help solve the storage crisis in libraries (i.e. Too many books, not enough space)
JStor is more than just about saving bookshelf space now.
There are 5000 institutions participating in JStor.
Bulk of offerings are journals.
Recent journals are scanned when published, held and released after a period of time
Production
publisher just gives permission. Everything else is done by JStor.
Finding full run of a journal is hard
Find a whole run, send it off to somewhere to be scanned.
Finding missing journals in a run is hard and getting it scanned is difficult and expensive.
o Going into collections to fill in the gaps is one place where Decapod can come in.Digitizing pamphlets (screeds) now and art auction catalogs (social interest)
Motivation for Decapod:
No consumer grade end-to-end solution for digitization
originals are valuable.
o owners not willing to let the objects out of their property.
o State-run collections (like archaeological artifacts) are locked upportable station is important to solve this problem of security.
How to recruit and train scanners and somehow generate well-structured PDFs
o Ease of use important to getting volunteers up and running quickly with little overhead.Send out transcription to other sites (does this mean scanning of actual books? -JH)
then put it into the decapod workflow for processing.
Functionality
Summary of Main operations
Reflow
Zoning / regions
Typing of regions
Dewarp
Crop and scale
De-noise
Page replacement
Reflow
easily change the way information flows / reads on a page
Dynamic font generation
preserve look of the original work
Therefore can be reflowable without any degradation in OCR conversion
Reflow + Dynamic font generation = Success on mobile devices
Mobile devices are mostly small screen
Therefore need to be able to reformat the flow (i.e. from 2 column to single tight column) while maintaining the look of the original work
o (i.e. if no embedded font, you won't be able to determine where to wrap text on the smaller screen).
Zoning
Categorize and mark regions on a page.
Possible interactions
o One click - software automatically detects regions
o Drag boundaries / bounding boxTyping of regions
a problem at JStor
What intelligence does the software have?
how does it know when it runs into problems
automatic quality control?
o what exactly is this? How will it deal with bad situations? How will the user be notified?
o Thomas already seems to have something in place, but what is it exactly and what control does the user have?
Use Cases:
Old documents may not have a document structure.
o tell the program how many columns etc, to redice false alarm rate
How do you dela with patent books?
o some flow, but mostly diagrams and annotations.
o in this case, tell program not to even attemp to reflow a page - don't even attempt document analysis.
Marginalia
o stuff in margins are sometimes important or not.
o currently feels to disregard this kind of stuff because of its vastly complicated.
Foreign text
o How to tell software that you have foreign text.
Warnings
o when detects broken text
Maps
o know when to stop processing as text and deal with pages of illustrations / maps.
Rezoning
o present alternatives simultaneously and pic which zones are correct. (John Burns likes this).
Problems with OCR old books:
"transfer" ink from one side imprints on opposite
"bleedthrough"
physical bleedthrough (ink bleed through page)
opposite bleedthrough (back page optically bleeds through)
Metric: Confidence?
Is it text but not recognized? Push it onto workflow
Is it text but not english? What do you do?
Future Meetings / Tasks:
John Burns to Toronto - June 8?
talk about some wireframes, use cases
JB anticipates two Pilots:
1 - user testing on UI early.
2 - prototype into hands of early adopters.
People Mentioned in Conversation
Dina Markham - national archives.
can't let out artifacts from facility.
Public accessible digital replicas and portable digitization station more convenient for getting people to see collections.
John B Howard http://www.educause.edu/Community/MemDir/Profiles/JohnBHoward/44694
Dave Simski?
HP had a massive OCR project
Dave Simksky - developed a way to manually zone regions / marking up zones
Software
ATIC / atise? / ATICE?- ocr? (not sure of the name)
Plastic Logics