John Burns Skype Meeting Notes (May 15, 2009)
Date of Meeting: May 15, 2009
In attendance: Jess Mitchell, James Yoon, Jacob Farber, Jonathan Hung, John Burns
About JStor
- JStor: scans, stores, makes available scholarly journals, across disciplines
- Initially JStor was trying to help solve the storage crisis in libraries (i.e. Too many books, not enough space)
- JStor is more than just about saving bookshelf space now.
- There are 5000 institutions participating in JStor.
- Bulk of offerings are journals.
- Recent journals are scanned when published, held and released after a period of time
Production
- publisher just gives permission. Everything else is done by JStor.
- Finding full run of a journal is hard
- Find a whole run, send it off to somewhere to be scanned.
- Finding missing journals in a run is hard and getting it scanned is difficult and expensive.
o Going into collections to fill in the gaps is one place where Decapod can come in. - Digitizing pamphlets (screeds) now and art auction catalogs (social interest)
Motivation for Decapod:
- No consumer grade end-to-end solution for digitization
- originals are valuable.
o owners not willing to let the objects out of their property.
o State-run collections (like archaeological artifacts) are locked up - portable station is important to solve this problem of security.
- How to recruit and train scanners and somehow generate well-structured PDFs
o Ease of use important to getting volunteers up and running quickly with little overhead. - Send out transcription to other sites (does this mean scanning of actual books? -JH)
- then put it into the decapod workflow for processing.
Functionality
Summary of Main operations
- Reflow
- Zoning / regions
- Typing of regions
- Dewarp
- Crop and scale
- De-noise
- Page replacement
Reflow
- easily change the way information flows / reads on a page
Dynamic font generation
- preserve look of the original work
- Therefore can be reflowable without any degradation in OCR conversion
Reflow + Dynamic font generation = Success on mobile devices
- Mobile devices are mostly small screen
- Therefore need to be able to reformat the flow (i.e. from 2 column to single tight column) while maintaining the look of the original work
o (i.e. if no embedded font, you won't be able to determine where to wrap text on the smaller screen).
Zoning
- Categorize and mark regions on a page.
- Possible interactions
o One click - software automatically detects regions
o Drag boundaries / bounding box - Typing of regions
- a problem at JStor
What intelligence does the software have?
- how does it know when it runs into problems
- automatic quality control?
o what exactly is this? How will it deal with bad situations? How will the user be notified?
o Thomas already seems to have something in place, but what is it exactly and what control does the user have?
Use Cases:
- Old documents may not have a document structure.
o tell the program how many columns etc, to redice false alarm rate
- How do you dela with patent books?
o some flow, but mostly diagrams and annotations.
o in this case, tell program not to even attemp to reflow a page - don't even attempt document analysis.
- Marginalia
o stuff in margins are sometimes important or not.
o currently feels to disregard this kind of stuff because of its vastly complicated.
- Foreign text
o How to tell software that you have foreign text.
- Warnings
o when detects broken text
- Maps
o know when to stop processing as text and deal with pages of illustrations / maps.
- Rezoning
o present alternatives simultaneously and pic which zones are correct. (John Burns likes this).
Problems with OCR old books:
"transfer" ink from one side imprints on opposite
"bleedthrough"
- physical bleedthrough (ink bleed through page)
- opposite bleedthrough (back page optically bleeds through)
Metric: Confidence?
- Is it text but not recognized? Push it onto workflow
- Is it text but not english? What do you do?
Future Meetings / Tasks:
John Burns to Toronto - June 8?
- talk about some wireframes, use cases
JB anticipates two Pilots:
1 - user testing on UI early.
2 - prototype into hands of early adopters.
People Mentioned in Conversation
Dina Markham - national archives.
- can't let out artifacts from facility.
- Public accessible digital replicas and portable digitization station more convenient for getting people to see collections.
John B Howard http://www.educause.edu/Community/MemDir/Profiles/JohnBHoward/44694
Dave Simski?
- HP had a massive OCR project
- Dave Simksky - developed a way to manually zone regions / marking up zones
Software
- ATIC / atise? / ATICE?- ocr? (not sure of the name)
- Plastic Logics