Last modified: 2010-05-05 16:14:08 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T20046, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 18046 - Enable text layer extraction from djvu on Wikisource
Enable text layer extraction from djvu on Wikisource
Status: RESOLVED FIXED
Product: Wikimedia
Classification: Unclassified
Site requests (Other open bugs)
unspecified
All All
: Normal enhancement (vote)
: ---
Assigned To: Rob Halsell
http://www.mediawiki.org/wiki/Special...
: shell
Depends on: 17452
Blocks:
  Show dependency treegraph
 
Reported: 2009-03-19 08:06 UTC by ThomasV
Modified: 2010-05-05 16:14 UTC (History)
4 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description ThomasV 2009-03-19 08:06:27 UTC
The ProofreadPage extension can now extract the text 
layer from djvu files. 
Please enable this on all Wikisource subdomains.

To enable, add the following line to LocalSettings : 

$wgDjvutxt = 'djvutxt';

(see revision 48533, url provided)
Comment 1 Rob Halsell 2009-03-19 15:18:31 UTC
I need to review this change with Brion later.  I am assigning this to me so its in my list.
Comment 2 Brion Vibber 2009-03-19 16:32:07 UTC
This isn't ready to be used.

a) We haven't killed the Fedora boxes, so installing new software is not convenient for deployment.

b) We're avoiding adding new NFS-based failure points; this would add another access-to-files-over-NFS which is dangerous.
Comment 3 ThomasV 2009-03-19 20:47:32 UTC
Is your point b) something that should be fixed by changing the code ?

The djvu file is already accessed when the thumbnail is created. 
Sould the text be extracted at the same time ?
Comment 4 Brion Vibber 2009-03-19 20:54:23 UTC
The thumbnail is created on the image scaler servers, not on the core Apache servers, which at least partially isolates it (but not completely since metadata is pulled and uploads/moves/etc are done on main Apaches).

Once we've moved to a separate storage architecture, things like this'll either need to be done at upload time while we're working with a local file, or will need to fetch the file from the store, work on the local temp file, and then discard the temp file.
Comment 5 ThomasV 2009-03-21 08:05:15 UTC
I think you do not want to fetch and discard a 20M djvu file 
everytime you extract a single page from it. And I suppose
that doing this at upload time would require a schema change.

Perhaps the text extraction should be performed on the scaler 
servers ?

Please let me know if there is anything I can do to help/speedup 
this. I think having this feature is important for our project 
(currently our contributors need to ask robot owners to do the 
preprocessing for them; it will free them from this dependence), 
and I am willing to spend time on this if it can elp.

Comment 6 ThomasV 2009-03-23 17:29:32 UTC
How about performing the text extraction in DjvuHandler::doTransform ?

We could modify the thumbnail syntax a little bit, so that something like :

/w/images/thumb/a/ab/Foo.djvu/page100-djvutxt-Foo.djvu.txt

would return a text file.
Comment 7 Brion Vibber 2009-04-28 22:55:17 UTC
Reverted latest versions in r50026; extra dependencies and security holes.
Comment 8 Mike.lifeguard 2010-04-26 02:47:17 UTC
(In reply to comment #7)
> Reverted latest versions in r50026; extra dependencies and security holes.

Does that mean this should be re-opened?
Comment 9 ThomasV 2010-05-05 16:14:08 UTC
I don't think so; check more recent revisions.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links