fluid-work IRC Logs-2010-06-25

[03:26:35 CDT(-0500)] * thomas____ (~thomasain@213.246.129.247) has joined #fluid-work
[05:43:34 CDT(-0500)] * zafar (~zafar@2001:638:208:4807:a6ba:dbff:fe02:2266) has joined #fluid-work
[07:21:16 CDT(-0500)] * kasper (~kasper@189.130.61.33) has joined #fluid-work
[07:32:35 CDT(-0500)] * jameswy (~jameswy@pool-173-79-253-178.washdc.fios.verizon.net) has joined #fluid-work
[07:45:34 CDT(-0500)] * anastasiac (~stasia@dsl-173-206-253-45.tor.primus.ca) has joined #fluid-work
[07:58:52 CDT(-0500)] * yura (~yura@206-248-135-76.dsl.teksavvy.com) has joined #fluid-work
[08:02:45 CDT(-0500)] * jessm (~Jess@c-71-232-3-151.hsd1.ma.comcast.net) has joined #fluid-work
[08:10:11 CDT(-0500)] * michelled (~michelled@CPE001310472ade-CM0011aefd3ca8.cpe.net.cable.rogers.com) has joined #fluid-work
[08:18:52 CDT(-0500)] * clown (~clown@bas1-cooksville17-1177947439.dsl.bell.ca) has joined #fluid-work
[08:51:47 CDT(-0500)] * colinclark (~colin@bas2-toronto09-1176132185.dsl.bell.ca) has joined #fluid-work
[09:07:03 CDT(-0500)] * clown (~clown@bas1-cooksville17-1279271672.dsl.bell.ca) has joined #fluid-work
[09:59:46 CDT(-0500)] * bsparks (~bsparks@wsip-72-215-204-133.ph.ph.cox.net) has joined #fluid-work
[10:01:17 CDT(-0500)] * yura1 (~yura@206-248-135-76.dsl.teksavvy.com) has joined #fluid-work
[10:16:24 CDT(-0500)] * anastasiac_ (~stasia@dsl-173-206-253-45.tor.primus.ca) has joined #fluid-work
[10:36:42 CDT(-0500)] * mackrauss (~Armin@bas2-toronto09-2925336671.dsl.bell.ca) has joined #fluid-work
[10:37:49 CDT(-0500)] * clown (~clown@bas1-cooksville17-1279271586.dsl.bell.ca) has joined #fluid-work
[11:02:10 CDT(-0500)] * anastasiac (~stasia@dsl-173-206-253-45.tor.primus.ca) has joined #fluid-work
[12:18:45 CDT(-0500)] * clown (~clown@bas1-cooksville17-1279271586.dsl.bell.ca) has joined #fluid-work
[12:42:12 CDT(-0500)] * jhung (~Jon@H25.C204.cci.switchworks.net) has joined #fluid-work
[12:42:44 CDT(-0500)] <jhung> colinclark: you summoned?
[12:43:59 CDT(-0500)] <colinclark> hey jhung
[12:43:59 CDT(-0500)] <colinclark> ho
[12:43:59 CDT(-0500)] <colinclark> how's it going?
[12:44:03 CDT(-0500)] <colinclark> lol
[12:44:08 CDT(-0500)] <jhung> lol
[12:44:09 CDT(-0500)] * michelled (~michelled@CPE001310472ade-CM0011aefd3ca8.cpe.net.cable.rogers.com) has joined #fluid-work
[12:44:14 CDT(-0500)] <colinclark> So in the process of fixing other bugs, I was looking at DECA-82
[12:44:31 CDT(-0500)] <colinclark> As a side effect of getting pathing issues sorted out, I've fixed half of it
[12:44:35 CDT(-0500)] <jhung> ok let me look that up...
[12:44:38 CDT(-0500)] <colinclark> The other half mentions file name conventions
[12:44:45 CDT(-0500)] <colinclark> But I don't fully understand it
[12:44:48 CDT(-0500)] <colinclark> Maybe you can elaborate
[12:46:09 CDT(-0500)] <jhung> Sorry. it's a poorly written bug. :{
[12:46:28 CDT(-0500)] <jhung> Two issues: 1. The path should be something more logical.
[12:47:00 CDT(-0500)] <jhung> 2. The Image filenames should be expanded to use padded zeroes instead of "Image1.jpg" etc.
[12:47:40 CDT(-0500)] <jhung> The note about the Ocropus directory structure deals with both these issues - file naming and directory structure as it relates to book structure
[12:48:20 CDT(-0500)] <colinclark> Yeah
[12:48:25 CDT(-0500)] <colinclark> ok, so #1 is fixed
[12:48:42 CDT(-0500)] <colinclark> By default, captured images go into a "captured-images" directory inside the server, but this can be configured to be anywhere
[12:48:49 CDT(-0500)] <colinclark> #2 is basically padding
[12:48:54 CDT(-0500)] <colinclark> How many zeroes do we want?
[12:49:03 CDT(-0500)] <colinclark> 0001 seem about right?
[12:49:05 CDT(-0500)] <jhung> 4 should be sufficient.
[12:49:23 CDT(-0500)] <colinclark> Like 00001?
[12:49:41 CDT(-0500)] <jhung> 0001 - 9999
[12:49:45 CDT(-0500)] <colinclark> ok
[12:50:02 CDT(-0500)] <colinclark> So, tell me more about support for the OCRopus book format
[12:50:17 CDT(-0500)] <colinclark> Is that something currently handled by the genpdf script or something?
[12:50:31 CDT(-0500)] <jhung> Not exactly.
[12:51:11 CDT(-0500)] <jhung> Let me start by describing what I know of Ocropus' dir structure and then talk about gen-pdf.
[12:51:15 CDT(-0500)] <colinclark> ok
[12:52:17 CDT(-0500)] <jhung> A book is essentially a hierarchy of directories. The parent is the "book" directory container, and each 1st level child directory corresponds to a page.
[12:52:44 CDT(-0500)] <colinclark> makes sense so far
[12:53:17 CDT(-0500)] <jhung> Inside that 1st level directory are more subdirectories to intermediate files such as original "raw" images, binarized PNGs, and other intermediate files.
[12:53:44 CDT(-0500)] <colinclark> that makes sense
[12:53:48 CDT(-0500)] <jhung> So in a sense, each page's directory contains all its relevant information. So a collection of these atomic page directory creates your book.
[12:54:07 CDT(-0500)] <colinclark> are thumbnails also stored in that structure, or does ocropus not concern itself with such details?
[12:54:43 CDT(-0500)] <colinclark> Funnily enough, looking at all these pathing issues in the server I had come to the conclusion that we needed some reasonable containment hierarchy for books and pages
[12:54:44 CDT(-0500)] <jhung> Thumbnails are new and decapod specific. We'll have to add a /page/thumbnail/ directory ourselves.
[12:54:47 CDT(-0500)] <colinclark> So this is good
[12:54:49 CDT(-0500)] <colinclark> ok
[12:55:02 CDT(-0500)] <jhung> so on to gen-pdf?
[12:55:06 CDT(-0500)] <colinclark> Go for it
[12:55:37 CDT(-0500)] <jhung> so gen-pdf takes a sequence of PNGs in command line, or a multipage TIFF and generates a PDF.
[12:55:56 CDT(-0500)] <jhung> In the process of generating the PDF, the pages are segmented, OCR'ed etc.
[12:56:04 CDT(-0500)] <colinclark> when you say "multipage TIFF," does that mean a stitched page spread?
[12:56:30 CDT(-0500)] <jhung> Those intermediate computations from Gen-pdf can be stored into the book directory structure for later re-use and optimization.
[12:57:07 CDT(-0500)] <jhung> Multipage TIFF - a single TIFF file that has multiple images inside. Like an image PDF.
[12:57:12 CDT(-0500)] <colinclark> ah, ok
[12:57:45 CDT(-0500)] <jhung> Currently (I think), all of gen-pdf's intermediate files are being dumped into the "temp" directory specified in the command line.
[12:58:10 CDT(-0500)] <colinclark> and then just tossed, i guess?
[12:58:21 CDT(-0500)] <jhung> Yes. Each time we have to delete that temp directory.
[12:58:33 CDT(-0500)] <jhung> But I know Michael has been looking into reusing that.
[12:59:39 CDT(-0500)] <colinclark> Supporting the OCRopus book structure seems to make an awful lot of sense to me
[13:00:07 CDT(-0500)] <colinclark> since there will be a natural place for everything, as well as allowing us to carry around the intermediate files that would presumably speed up additional exports--makes a whole self-contained package
[13:00:15 CDT(-0500)] <jhung> yep. Especially with the amount of intermediate steps we take and the metadata for each page.
[13:00:29 CDT(-0500)] <jhung> yep
[13:00:34 CDT(-0500)] <colinclark> ok
[13:00:49 CDT(-0500)] <colinclark> As the bug says, it seems like a good feature for an upcoming release.
[13:01:04 CDT(-0500)] <colinclark> What I'll do is add the extra zero padding to file names now, since I've already touched that code.
[13:01:12 CDT(-0500)] <jhung> okay.
[13:01:14 CDT(-0500)] <colinclark> Then we can close DECA-82 and file a new one for OCRopus book format support
[13:01:25 CDT(-0500)] <colinclark> So, I have another fairly big question if you can spare the time
[13:01:27 CDT(-0500)] <jhung> Yep. I think there's already one for that
[13:01:45 CDT(-0500)] <jhung> Sure go ahead.
[13:01:53 CDT(-0500)] <colinclark> So stitching...
[13:02:02 CDT(-0500)] <colinclark> I don't have all the pieces of the puzzle clear in my head
[13:02:09 CDT(-0500)] <colinclark> I know we've got a JIRA for removing stitching
[13:02:29 CDT(-0500)] <colinclark> And, at the presentation layer, I think the idea was to push together two separate images using CSS
[13:02:39 CDT(-0500)] <colinclark> But is there more to it than that?
[13:02:46 CDT(-0500)] <jhung> yep.
[13:03:24 CDT(-0500)] <jhung> Sorry, I meant you're correct. We'll end up using the presentation layer to show two images instead of one.
[13:03:32 CDT(-0500)] <colinclark> ok
[13:03:41 CDT(-0500)] <colinclark> But are we losing anything else by throwing away stitching?
[13:03:53 CDT(-0500)] <colinclark> Will all the scripts continue to work? Will the outputted PDF still be good, etc?
[13:04:25 CDT(-0500)] <jhung> Yes, by throwing stitching away, we lose binarization. There's binarization code in there. There's a Jira to extract that into its own code.
[13:05:25 CDT(-0500)] <jhung> Export is independent of stitching because Export expects a set single paged images. So stitching kinda works against that.
[13:05:35 CDT(-0500)] <jhung> (another reason why we don't need it anymore)
[13:05:49 CDT(-0500)] <colinclark> Meaning, we actually split the images again before doing export?
[13:06:11 CDT(-0500)] <jhung> yep.
[13:06:34 CDT(-0500)] <jhung> Welll. not in the code, but it would have been that way if we keep stitching around.
[13:06:51 CDT(-0500)] <colinclark> So when you say we lose binarization by throwing away stitching, can you elaborate on that?
[13:07:03 CDT(-0500)] <colinclark> Is it that the stitch script does both stitching and binarization?
[13:07:57 CDT(-0500)] <jhung> Yep. There's greyscale and binarization that happens in stitching.
[13:08:03 CDT(-0500)] <jhung> This is filed as DECA-102.
[13:08:30 CDT(-0500)] <colinclark> ok, i'll take a look at that
[13:08:36 CDT(-0500)] <colinclark> So, just so I'm super clear
[13:08:54 CDT(-0500)] <colinclark> When we generate a PDF, we don't currently split the stitched images back apart?
[13:09:10 CDT(-0500)] <colinclark> But, in an alternate universe where we actually want stitching, we would want to do that?
[13:09:48 CDT(-0500)] <jhung> Yep. Currently we're not splitting the stitch on export. So our resulting PDFs currently have 2 pages per page.
[13:10:09 CDT(-0500)] <jhung> So for now if we keep stitching, we'll need to reverse that to extract the two images for exporting.
[13:10:48 CDT(-0500)] <colinclark> Aha, okay
[13:10:51 CDT(-0500)] <colinclark> this all makes sense
[13:11:10 CDT(-0500)] <jhung> Cool. (smile)
[13:11:21 CDT(-0500)] <colinclark> So how come we added stitching in the first place? Was there the idea that pages were always going to be spreads or something like that?
[13:11:43 CDT(-0500)] <colinclark> No big deal either way, I'm just curious about the history of this code, now that it all seems to be clear.
[13:12:31 CDT(-0500)] <jhung> Initially we were naively expecting captures from left-right cameras to produce nice clean images of left and right pages. Then we'd just merge those images together for a spread.
[13:12:54 CDT(-0500)] <jhung> "we" I mean, mostly me. lol
[13:13:03 CDT(-0500)] <colinclark> (smile)
[13:13:05 CDT(-0500)] <colinclark> That makes sense to me
[13:13:18 CDT(-0500)] <colinclark> Ok, so I think this has been very helpful
[13:13:36 CDT(-0500)] <colinclark> While fixing bugs, I've been doing some architectural work for the processing pipeline in the server
[13:13:52 CDT(-0500)] <colinclark> And I had imagined that the world of pages was actually oriented around page spreads--pairs of images
[13:14:22 CDT(-0500)] <colinclark> But in fact, both scenarios-L/R independent capture and stereo capture-is really oriented around individual pages
[13:14:43 CDT(-0500)] <colinclark> There may be multiple files for a given page (in the stereo scenario), but it's just one page at a time
[13:14:45 CDT(-0500)] <jhung> yep. The user acts on individual pages too. So it helps to keep those models in sync.
[13:14:52 CDT(-0500)] <colinclark> cool
[13:15:09 CDT(-0500)] <colinclark> Interestingly, it may make refactoring easier to take stitching out sooner rather than later
[13:15:17 CDT(-0500)] <colinclark> I'll have to take a look
[13:15:19 CDT(-0500)] <jhung> Stereo is a different beast which will require a different approach though...
[13:15:43 CDT(-0500)] <colinclark> Different in what ways, jhung?
[13:15:55 CDT(-0500)] <jhung> In stereo we have a SINGLE image for two pages. In left-right, we have TWO images for two pages.
[13:16:18 CDT(-0500)] <jhung> And that's the case where the files have been processed.
[13:17:05 CDT(-0500)] <jhung> In a pre-processed state in stereo, we have TWO images for two pages.
[13:17:42 CDT(-0500)] <colinclark> oh, ah
[13:17:44 CDT(-0500)] <jhung> So from a presentation layer, in pre-process we display 2 images of unprocessed spreads. Then after processing, we show 1 image of the processed spread.
[13:18:01 CDT(-0500)] <colinclark> So stereo actually does take pictures of the whole book's surface... meaning it works on page spreads
[13:18:18 CDT(-0500)] <colinclark> Two images representing different angles on the whole book
[13:18:22 CDT(-0500)] <jhung> Yes.
[13:18:30 CDT(-0500)] <colinclark> which then gets dewarped and split up into two discrete pages
[13:18:47 CDT(-0500)] <colinclark> ok
[13:18:54 CDT(-0500)] <colinclark> this is all starting to gel
[13:19:07 CDT(-0500)] <colinclark> and sort of make my head spin
[13:19:17 CDT(-0500)] <colinclark> (smile)
[13:19:20 CDT(-0500)] <jhung> hold on, I have a wiki page that will help... (smile)
[13:19:24 CDT(-0500)] <colinclark> cool
[13:19:38 CDT(-0500)] <jhung> http://wiki.fluidproject.org/display/fluid/Post-Capture+Processing
[13:19:58 CDT(-0500)] <jhung> It's a little different now, but fundamentally the same.
[13:20:41 CDT(-0500)] <jhung> The stitching will be removed of course.
[13:21:05 CDT(-0500)] <colinclark> cool, thanks
[13:21:16 CDT(-0500)] <colinclark> I'll take a read through it
[13:21:22 CDT(-0500)] <colinclark> Ok, I think I've got lots to keep me going
[13:21:29 CDT(-0500)] <colinclark> My first step is to push this newly working mock server
[13:21:30 CDT(-0500)] <jhung> k.
[13:21:41 CDT(-0500)] <colinclark> And start to factor out the duplicate code so I can get the real server to also be working
[13:21:46 CDT(-0500)] <colinclark> It wouldn't hurt to have you take it for a test spin
[13:22:15 CDT(-0500)] <jhung> sure. Probably Monday.... still working through this CFI stuff.
[13:22:30 CDT(-0500)] <colinclark> no problem
[13:22:35 CDT(-0500)] <colinclark> What's the CFI stuff?
[13:23:02 CDT(-0500)] <jhung> Video captioning equipment (cameras, post-processing hardware, etc.)
[13:23:13 CDT(-0500)] <colinclark> ah, nice
[13:23:32 CDT(-0500)] <jhung> Makes my head spin.... it's a whole different world than photography.
[13:25:13 CDT(-0500)] <colinclark> For sure. Let me know if you need any advice
[13:35:55 CDT(-0500)] * michelled (~michelled@CPE001310472ade-CM0011aefd3ca8.cpe.net.cable.rogers.com) has joined #fluid-work
[14:09:24 CDT(-0500)] * elicochran (~elicochra@dhcp-169-229-212-27.LIPS.Berkeley.EDU) has joined #fluid-work
[15:47:44 CDT(-0500)] * colinclark_ (~colin@bas2-toronto09-1176132185.dsl.bell.ca) has joined #fluid-work
[17:12:07 CDT(-0500)] * colinclark (~colin@bas2-toronto09-1176132185.dsl.bell.ca) has joined #fluid-work
[18:18:38 CDT(-0500)] * bsparks (~bsparks@wsip-72-215-204-133.ph.ph.cox.net) has joined #fluid-work
[18:26:31 CDT(-0500)] * colinclark (~colin@bas2-toronto09-1176132185.dsl.bell.ca) has joined #fluid-work
[18:28:56 CDT(-0500)] * bsparks (~bsparks@wsip-72-215-204-133.ph.ph.cox.net) has joined #fluid-work
[19:00:26 CDT(-0500)] * anastasiac (~stasia@dsl-173-206-253-45.tor.primus.ca) has left #fluid-work
[20:09:43 CDT(-0500)] * kasper (~kasper@189.130.61.33) has joined #fluid-work
[20:10:46 CDT(-0500)] * kasper (~kasper@189.130.61.33) has joined #fluid-work
[21:05:23 CDT(-0500)] * jhung (~Jon@H25.C204.cci.switchworks.net) has left #fluid-work