Last modified: 2014-09-23 19:53:09 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T44466, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 42466 - Text layer of DjVu files doesn't appear in Page namespace due to higher memory consumption after upgrade to Ubuntu 12.04
Text layer of DjVu files doesn't appear in Page namespace due to higher memor...
Status: NEW
Product: Wikimedia
Classification: Unclassified
Site requests (Other open bugs)
wmf-deployment
All All
: Normal normal with 6 votes (vote)
: ---
Assigned To: Nobody - You can work on this!
: ops
Depends on:
Blocks: Wikisource
  Show dependency treegraph
 
Reported: 2012-11-27 00:34 UTC by Aristoi
Modified: 2014-09-23 19:53 UTC (History)
17 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Aristoi 2012-11-27 00:34:17 UTC
The text layer of DjVu files doesn't appear when creating a new page in Page namespace.

Example : https://fr.wikisource.org/wiki/Livre:Barr%C3%A8s_-_Une_journ%C3%A9e_parlementaire_-_com%C3%A9die_de_m%C5%93urs_en_trois_actes_%281894%29.djvu

Problem appeared a few days ago, just after the last code update.
Comment 1 Philippe Elie 2012-11-27 01:01:28 UTC
same trouble reported on en.ws e.g. http://en.wikisource.org/wiki/Wikisource:Scriptorium/Help#Anyone_having_trouble_pulling_text_layers.3F

There is no change in the extension regarding to text layer since a while. Anyone now if something changed in the text layer extraction code in mediawiki or something related to caching the text layer with file metadata ?

I increased priority to high/major as no one can work on fr.ws with any new uploaded djvu.
Comment 2 baltoslavic 2012-11-28 02:11:16 UTC
This seems to affect the English, French, and Latin Wikisource projects, so presumably it is affecting *ALL* Wikisource projects.  For DjVu files newly uploaded to Commons (since the update), we cannot access the text layer component of the file.  

Normally, uploaded DjVu files contain a text layer that appears in the edit window when editing in the Page namespace, but for the past week, no such text has been appearing whenever anyone edits. This feature is used on Wikisource to convert the DjVu into a wiki-text, so without access to the text layer, work on all the Wikisource projects will grind to a halt.  For example, the English Wikisource has had to abanson its plans for the December collaboration, because we won't be able to pull the text from the file we were going to upload.

Until this bug is corrected, all Wikisource projects will be unable to begin any new texts from DjVu source files.
Comment 3 Tpt 2012-11-28 16:06:45 UTC
After some investigation this bug is not caused by ProofreadPage but by core or djvulibre.
Comment 4 Andre Klapper 2012-11-30 20:22:38 UTC
As written before, no recent changes in
  mediawiki/core/includes/media/DjVu.php
  mediawiki/core/includes/media/DjVuImage.php
(In reply to comment #3)
> this bug is not caused by ProofreadPage but by core or djvulibre.

(In reply to comment #1)
> Anyone now if something changed in the text layer extraction code in mediawiki
> or something related to caching the text layer with file metadata ?


(In reply to comment #2)
> Until this bug is corrected, all Wikisource projects will be unable to begin
> any new texts from DjVu source files.

=> Tentatively blocking bug 38865.
Comment 5 Doug 2012-12-01 01:37:12 UTC
(In reply to comment #2)
> 
> Until this bug is corrected, all Wikisource projects will be unable to begin
> any new texts from DjVu source files.

Well, not to minimize this bug but that's not true, it's only that they won't be able to rely on the text layer.  Frequently, the layer is such crap, especially on older texts, that this has no practical effect.  Furthermore, we have our own OCR tool that can be used on the fly with a gadget that's implemented by a button above the edit box that is turned on by default (i.e IPs can use it).  For example, I just generated https://fr.wikisource.org/wiki/Page:De_la_D%C3%A9monomanie_des_Sorciers_%281587%29.djvu/141 using that tool.  Considering that's a 16th C. work, that's about as good as I'd expect from the text layer associated with the djvu.  Furthermore the text layer can be copy pasted in or even botted in.  The text layer on this particular work is about equal to what the tool generated and presumably the folks at IA were able to optimize ABBYY FineReader 8.0 for the language and type, unlike the built in tool, which I think still uses tesseract.

This isn't critical either, there is no internal data loss, which is part of our definition of critical.  It's just a loss of function.  The text layer is still there in the file on commons.

I'm not saying this isn't an important bug, I'm saying, if you're a wikisourcerer, don't feel tied to the text layer that comes with a djvu or pdf.
Comment 6 baltoslavic 2012-12-01 03:08:56 UTC
(In reply to comment #5)
> Well, not to minimize this bug but that's not true, it's only that they won't
> be able to rely on the text layer.  Frequently, the layer is such crap,
> especially on older texts, that this has no practical effect.  Furthermore, we
> have our own OCR tool...

The OCR text layer that comes along with a file uploaded from a source such 
as the Internet Archive is far superior to what gets generated by our OCR tool. 
If we have to rely on our own OCR tool, it will greatly increase the work that 
has to be done cleaning up problems generated by the OCR. Our own tool is 
prone to far more stupid mistakes.

And we don't do many of the "older texts" (17th century and earlier), at least 
not on the English Wikisource.

The community also feels (has expressly stated and agrees) that we should not 
work on newly uploaded texts until the bug is corrected, because we can't judge 
the accuracy of a match against the text layer, nor can we spot problems in a 
page to text match for edited files.  As one member put it: 

"[Our OCR tool is] only intended for one-off use when a single page is missing 
text or has a very poor text-layer and not for every page in a work that already 
has a text-layer."

And another:

"As history as taught us many times before - if you work a file while in a state 
of error, you might not like the result of your misplaced efforts once the error is resolved."

So, whatever you might think about it, the problem is choking off community work.
The relevant discussion thread is in Wikisource's Scriptorium under the headers 
"Anyone having trouble pulling text layers?" and "Index text pages".
Comment 7 Doug 2012-12-01 06:00:31 UTC
Match and Split appears to function.
Comment 8 Philippe Elie 2012-12-01 19:25:55 UTC
match and split doesn't rely on mediawiki but extract the text layer directly from the djvu.
Comment 9 Doug 2012-12-01 19:44:40 UTC
My only point was that that it is still finding the layer, the data isn't gone, it's just not being found by mediawiki.
Comment 10 billinghurst 2012-12-03 13:52:20 UTC
A person who cannot access bugzilla has commented ...

Though the code involved between PDF & DjVu files are completely different when it comes to text dumping, this could mean the bug is not wmf-code-update related at all but DjVuLibre specific -- especially in light of the fact none of the DjVu PHP related code has been touched for sometime now. Errors after a main update where none of the affecting code has changed in the interim makes me think the existing software has become outdated when it comes to the common programing applied today.

Also......

Code: DjVuImage.php

    Line 295 - possible incorrect file path
        out "pubtext/DjVuXML-s.dtd"
        in "share/djvu/pubtext/DjVuXML-s.dtd"


and points to the djvu updates.

Might that be a reasonable thing to do anyway?  Can it be done and tested on the appropriate test server?
Comment 11 Tpt 2012-12-03 20:23:30 UTC
After some tests with the help of phe, we found that the issue is caused by an increased memory consumption by the djvutext vertion packed in 3.5.24. So the 100Mo memory limit of Wikimedia servers made the script fail. The new djvutxt script need at least 300Mo.
Comment 12 Philippe Elie 2012-12-03 21:15:41 UTC
Look like the trouble come from the recent upgrade to unbuntu 12.04. It's not caused by djvulibre as the same version of these tool use less than 60 Mo on a slackware. Perhaps locale file related which use nowadays a lot of more virtual memory than the old way.
Comment 13 Andre Klapper 2012-12-05 13:44:31 UTC
Moving to "Wikimedia" product and removing blocking 38865 as per last comment, as this seems to be related to the server upgrades to Ubuntu 12.04.

Aaron: Do you have an idea who could look into this, by any chance?
Comment 14 Tpt 2012-12-05 18:33:37 UTC
I've uploaded a patch that increase the memory limit of the djvutxt call. This solve the bug on my Fedora 16 (with the same version of DjVuLibre as Ubuntu 12.04): https://gerrit.wikimedia.org/r/#/c/36632/
Comment 15 Tpt 2012-12-07 15:37:04 UTC
This patch have been deployed on production cluster. Extraction of text layer works now. I let this bug open because it would be interesting to know why djvutxt use so much memory.
Comment 16 Dereckson 2012-12-07 21:04:29 UTC
Tpt and me agree this would be a good idea to use a constant. This would allow to adjust limit the time needed to track the issue.

Gerrit change #37495.
Comment 17 Andre Klapper 2012-12-11 15:11:23 UTC
[Workaround found => removing blocking 38865]
Comment 18 Sumana Harihareswara 2012-12-14 16:20:22 UTC
Both of Tpt's changes are now merged; is the problem still affecting Wikisources?
Comment 19 Tpt 2012-12-14 16:27:03 UTC
@Sumana Extraction of the text layer works fine in Wikisources now but we have kept this bug open because the increase of the memory consumption of djvutxt is very strange.
Comment 20 George Orwell III 2013-04-22 17:14:32 UTC
Let's back up a bit before my head explodes....

First - a DjVu is nothing more than a glorified zip file that is archiving a bunch of other stand-alone [indirect] djvu files - an Index "file" within directing the order viewed, any annotations, embedded hyperlinks, shared dictionaries, typical metadata, coordinate mappings of text-layers, images, etc. etc. for all the DjVus within it as a single [bundled] DjVu file. The premise behind the DjVu file format is largely mirrored by the Index: and Page: namespaces on Wikisource today.

Why it was treated like an image file rather than an archive file from day one around here I'll never quite understand (I can peek at a a single .jpg or .txt file compacted within a .zip file without having to exract/deflate the entire .zip archive to do it & it doesn't re-classify the .zip file as a pic or a doc file just because I can... So???...WtF???.... but I digress).

The point I'm trying to make is DjVus were never meant to be anything more than an quick and easy, compact alternative to PDF files (a hack). THAT is why there will always be issues ....

https://bugzilla.wikimedia.org/show_bug.cgi?id=8263#c3

https://bugzilla.wikimedia.org/show_bug.cgi?id=9327#c4

https://bugzilla.wikimedia.org/show_bug.cgi?id=24824#c10

https://bugzilla.wikimedia.org/show_bug.cgi?id=30751#c3

https://bugzilla.wikimedia.org/show_bug.cgi?id=21526#c16

https://bugzilla.wikimedia.org/show_bug.cgi?id=28146#c4

https://bugzilla.wikimedia.org/show_bug.cgi?id=30906#c0

<<< and I'm sure there are more; its my 1st day; sorry>>>>

... with the current "plain text dump" approach over the never fully developd extract & parse approach. An XML of the text-layer generated via OCR is how Archive.org does it & that is how we should be doing it too. Once the text is in XML form - you can wipe it from the DjVu file on Commons (leaving nothing but the image layers to pull thumnbails from) until at the very least its fixed up by the Wikisource/WikiBooks people if not just by BOT for reinsertion if need be.

Someone needs to revisit DjVuImage.php and finish off the extract & convert/parse to/from XML development portion [DjVuLibre?] abandoned or whatever because "it was too slow" 6 years ago. The current bloated text dump will still be there to fall back on
Comment 21 Tpt 2013-04-25 05:54:55 UTC
@George Orwell III
Yes, you have right but I'm not sure that this bug is the best place to put this comment as the topic of the bug is related to a very specific problem (increase of memory consumption of djvutext version in Ubuntu 12.4).
I think you should open a new bug related to XML text layer and copy/past there your comment.
Comment 22 Sumana Harihareswara 2014-09-23 19:53:09 UTC
What is the current status regarding memory consumption? Has it gone to a sustainable and serviceable level?

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links