Last modified: 2011-01-25 00:31:02 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T23526, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 21526 - Bug in Djvu text layer extraction
Bug in Djvu text layer extraction
Status: RESOLVED FIXED
Product: MediaWiki
Classification: Unclassified
DjVu (Other open bugs)
1.16.x
All All
: High normal with 2 votes (vote)
: ---
Assigned To: ThomasV
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2009-11-15 21:21 UTC by Simon Lipp
Modified: 2011-01-25 00:31 UTC (History)
7 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments
Patch (1.17 KB, patch)
2010-07-07 10:15 UTC, Simon Lipp
Details

Description Simon Lipp 2009-11-15 21:21:43 UTC
Bug has been encountered on fr.wikisource :

MediaWiki  	1.16alpha-wmf (r58524)
PHP 	5.2.4-2ubuntu5.7wm1 (apache2handler)
MySQL 	4.0.40-wikimedia-log

When the text layer of the Djvu file contains « ") », the MediaWiki parser produces an empty page and then the text layer is shifted by one page from the image. An example of problematic Djvu file can be found here :

http://commons.wikimedia.org/w/index.php?title=File:Sima_qian_chavannes_memoires_historiques_v4.djvu&oldid=31865251

In particular, we can find, in page 80, the following text (bad quality of scan) : « La quatrième année (.\"),*)()) ». The problem can be seen in the proofread version of this scan :

http://fr.wikisource.org/w/index.php?title=Page:Sima_qian_chavannes_memoires_historiques_v4.djvu/80&action=edit : the end of the text is missing
http://fr.wikisource.org/w/index.php?title=Page:Sima_qian_chavannes_memoires_historiques_v4.djvu/81&action=edit : no text layer
http://fr.wikisource.org/w/index.php?title=Page:Sima_qian_chavannes_memoires_historiques_v4.djvu/82&action=edit : text layer and image does not longer match

I have been able to track and fix the bug in my local mediawiki installation (same branch, same revision as fr.wikisource). The problem is located in DjvuImage::retrieveMetadata (includes/DjvuImage.php:257) : the regular expression considers any ") as the end of page marker, but a \ before the double quote should prevent this interpretation. 

I replaced the current regular expression by this one, and now the problem is fixed :

$reg = "/\(page\s[\d-]*\s[\d-]*\s[\d-]*\s[\d-]*\s*\"((?>\\\\.|(?:(?!\\\\|\").)++)*?)\"\s*\)/s";
$txt = preg_replace( $reg, "<PAGE value=\"$1\" />", $txt  );

Note for the regular expression : this is the adaptation of the regular expression used to match a text between double quotes with backslash as escape character, which in perl would be :
"((?>\\.|[^"\\]++)*?)". The rather ugly (but working) (?:(?!\\\\|\&quot;).) corresponds to the trivial [^"\\], but the problem is that [^\&quot;] and [^"] are not really the same thing…
Comment 1 Lars Aronsson 2010-04-29 19:26:43 UTC
In the Djvu file
File:Post- och Inrikes Tidningar 1836-01-27.djvu
page 4 contains the two character sequence ")
properly escaped. After this, on the same page is
the word "Eskilstuna" which you can search and
find in djview, if you download the djvu file.

But text extraction for the Wikisource ProofreadPage
extension stops at the "). To verify this, go to
http://en.wikisource.org/wiki/Page:Post-_och_Inrikes_Tidningar_1836-01-27.djvu/4
and click "create". (But don't create that page on
the English Wikisource. It already exists on the
Swedish Wikisource.)
Comment 2 Lars Aronsson 2010-05-01 01:12:56 UTC
To extract the OCR text (without pixel coordinates for
each word) for the page NNN, this command should do:

djvused -e 'select NNN; print-pure-txt' FILENAME.djvu
Comment 3 Lars Aronsson 2010-05-01 01:26:27 UTC
For /66 of commons:File:Östgötars_minne.djvu
contains the two character sequence ")
and that is where the extracted text ends.

For /67 the extracted text is empty.

For /68, the extracted text is the one that
belongs to the /67 image. All subsequent pages
have the text layer off by one or more pages.

The OCR quality is low (coming from Google), so a new
OCR should be generated before proofreading. But until
then, this file is another test case for this bug.

http://sv.wikisource.org/wiki/Index:%C3%96stg%C3%B6tars_minne.djvu
Comment 4 ThomasV 2010-07-06 08:12:44 UTC
The proposed patch is a perl-compatible regexp. I am not familiar with that syntax, this is why I have not commited it. 

Could someone have a look at it, or provide a posix regexp ?
Comment 5 Simon Lipp 2010-07-06 08:32:09 UTC
> or provide a posix regexp ?

That’s not possible. Matching C-like quoted strings needs look-ahead and possessive operators, which are not available in POSIX syntax. But if you have any question, feel free to contact me (I’m Sloonz on fr.wikisource)
Comment 6 ThomasV 2010-07-06 08:58:21 UTC
I tested your patch on this djvu file:
http://fr.wikisource.org/wiki/Livre:Revue_des_Romans_%281839%29.djvu

The file does not have the bug; djvu text extraction works without the patch. With the patch, pages are no longer aligned with the text.
Comment 8 Simon Lipp 2010-07-06 16:10:27 UTC
> With the patch, pages are no longer aligned with the text.

Strange ; by the time I made the patch, I didn’t see this problem. I’ll look at it during this week.
Comment 9 Simon Lipp 2010-07-07 10:15:52 UTC
Created attachment 7557 [details]
Patch

Found the problem (I dropped the empty-page case). Attached an updated patch that fix it. By doing htmlspecialchars after the matching phase, it allow to get rid of unreadable look-ahead. And I commented the regexp using /x modifier of PCRE. But that’s still not possible to convert this into POSIX regexp, since ereg_* doesn't have an equivalent of preg_replace_callback.

Also, your file has a problem page 8 (http://fr.wikisource.org/w/index.php?title=Page:Revue_des_Romans_%281839%29.djvu/8&action=edit). As a side-effect, the patch fixes that too ;)
Comment 10 ThomasV 2010-07-07 11:07:00 UTC
Thanks for patch and the detailed explanation. 
I commited it (r69139)
Comment 11 billinghurst 2010-07-25 14:44:47 UTC
It would be nice if this bug fix could be considered out of session to look to be implemented at the Wikisource sites ahead of scheduled updates (next full application review).

It is a minor bug that has major impediments for works for consideration.  It leaves a blank page, misaligns text, and requires every subsequent pages in a work to be moved incrementally forward.

Simple equations, even if we have only 20 works broken, and DjVu files are typically 200-500 pages in size, that would already start to equate to somewhere between 2000-8000 page moves.

Thanks for any consideration that could be made to this request.
Comment 12 Simon Lipp 2010-07-25 15:17:25 UTC
Well, in the meanwhile, it’s still possible to manually fix the broken djvu files ; my own pdf to djvu converter has these lines :

# Workaround for MediaWiki bug #21526
# see https://bugzilla.wikimedia.org/show_bug.cgi?id=21526
$text =~ s/"(?=\s*\))//g;

A quick look at man djvused give me this simple command to fix a djvu file (untested):

cp thefile.djvu thefile-fixed.djvu; djvused thefile.djvu -e output-all | perl -pe 's/"(?=\s*\))//g' | djvused thefile-fixed.djvu -s
Comment 13 billinghurst 2010-11-03 10:55:19 UTC
This is reported as fixed and for a period of time, and even with asking nicely for it to be given some priority for the Wikisource sites there is neither action, nor evidence of it being noticed. Something somewhere somehow would be nice, even a rough indication of who needs to sleep with whom, and where we have to send the photographs would be helpful. :-)
Comment 14 Tim Starling 2010-12-08 06:06:54 UTC
Deployed now. 

Note that the effect of create_function() is to create a global function with a random name and to return the name. Calling it in a loop will eventually use up all memory, because there is no way to delete global functions once they are created. For this reason alone, it shouldn't be used. But it is also slow, requiring a parse operation that is uncached by APC, and it's insecure in the sense that eval() is insecure: construction of PHP code can easily lead to arbitrary execution if user input is included in the code.
Comment 15 billinghurst 2010-12-08 11:28:21 UTC
Many thanks to all.

As a side not to Wikisourcerers the files need to be purged at Commons to get them to reload the text layer properly.
Comment 16 Simon Lipp 2010-12-08 11:41:32 UTC
@Tim Starling
I wasn’t aware of the performance issues of using create_function, sorry.
But since the created function is static, it should be trivial to factor it out ; I used create_function only because I’m used to use blocks in Ruby. The corresponding function should just be:

function convert_page_to_xml($matches) {
return '<PAGE value="'.htmlspecialchars($matches[1]).'" />';
}

Anyway, since the text layer is computed only once and then cached, I don’t fix that’s a big issue.
Comment 17 MZMcBride 2010-12-08 14:36:39 UTC
(In reply to comment #16)
> @Tim Starling
> I wasn’t aware of the performance issues of using create_function, sorry.
> But since the created function is static, it should be trivial to factor it out
> ; I used create_function only because I’m used to use blocks in Ruby. The
> corresponding function should just be:
> 
> function convert_page_to_xml($matches) {
> return '<PAGE value="'.htmlspecialchars($matches[1]).'" />';
> }
> 
> Anyway, since the text layer is computed only once and then cached, I don’t fix
> that’s a big issue.

Tim fixed the issue in r78046. The two revisions were then merged from trunk in r78047.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links