Last modified: 2010-05-05 16:14:08 UTC
The ProofreadPage extension can now extract the text layer from djvu files. Please enable this on all Wikisource subdomains. To enable, add the following line to LocalSettings : $wgDjvutxt = 'djvutxt'; (see revision 48533, url provided)
I need to review this change with Brion later. I am assigning this to me so its in my list.
This isn't ready to be used. a) We haven't killed the Fedora boxes, so installing new software is not convenient for deployment. b) We're avoiding adding new NFS-based failure points; this would add another access-to-files-over-NFS which is dangerous.
Is your point b) something that should be fixed by changing the code ? The djvu file is already accessed when the thumbnail is created. Sould the text be extracted at the same time ?
The thumbnail is created on the image scaler servers, not on the core Apache servers, which at least partially isolates it (but not completely since metadata is pulled and uploads/moves/etc are done on main Apaches). Once we've moved to a separate storage architecture, things like this'll either need to be done at upload time while we're working with a local file, or will need to fetch the file from the store, work on the local temp file, and then discard the temp file.
I think you do not want to fetch and discard a 20M djvu file everytime you extract a single page from it. And I suppose that doing this at upload time would require a schema change. Perhaps the text extraction should be performed on the scaler servers ? Please let me know if there is anything I can do to help/speedup this. I think having this feature is important for our project (currently our contributors need to ask robot owners to do the preprocessing for them; it will free them from this dependence), and I am willing to spend time on this if it can elp.
How about performing the text extraction in DjvuHandler::doTransform ? We could modify the thumbnail syntax a little bit, so that something like : /w/images/thumb/a/ab/Foo.djvu/page100-djvutxt-Foo.djvu.txt would return a text file.
Reverted latest versions in r50026; extra dependencies and security holes.
(In reply to comment #7) > Reverted latest versions in r50026; extra dependencies and security holes. Does that mean this should be re-opened?
I don't think so; check more recent revisions.