Last modified: 2011-04-06 17:17:59 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T30146, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 28146 - Memory limit hit while uploading DjVu file with embedded text
Memory limit hit while uploading DjVu file with embedded text
Status: RESOLVED FIXED
Product: Wikimedia
Classification: Unclassified
General/Unknown (Other open bugs)
unspecified
All All
: Normal normal (vote)
: ---
Assigned To: Brion Vibber
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2011-03-21 07:23 UTC by Yann Forget
Modified: 2011-04-06 17:17 UTC (History)
4 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments
Work in progress test patch (requires PHP 5.3) (9.65 KB, patch)
2011-04-01 21:13 UTC, Brion Vibber
Details

Description Yann Forget 2011-03-21 07:23:01 UTC
Hello,

I got a crash twice while uploading a file to Commons:
PHP fatal error in /usr/local/apache/common-local/php-1.17/includes/normal/UtfNormal.php line 285: 
Allowed memory size of 125829120 bytes exhausted (tried to allocate 56 bytes)

The file is http://ia600301.us.archive.org/11/items/MN40239ucmf_2/MN40239ucmf_2.djvu
(DJVU from Internet Archive).

Thanks, Yann
Comment 1 Yann Forget 2011-03-21 08:11:29 UTC
Zyephyrus tried 3 times: same error.
File is available from http://www.archive.org/details/MN40239ucmf_2
Comment 2 zephyrus4 2011-03-21 08:37:10 UTC
Here is the message that I got when trying:

PHP fatal error in /usr/local/apache/common-local/php-1.17/includes/normal/UtfNormal.php line 285:
Allowed memory size of 125829120 bytes exhausted (tried to allocate 24 bytes) 

Zeph (Zyephyrus)
Comment 3 Tim Starling 2011-03-21 11:56:53 UTC
That would be the preg_match_all() in quickIsNFCVerify(). It could easily be rewritten to use preg_match() with an offset.
Comment 4 Brion Vibber 2011-04-01 00:29:59 UTC
It might also be wise to divide up the giant DjVu data set better. It looks like the *entire* page text metadata for all pages in the file gets read in in a batch in DjVuImage::retrieveMetadata.

This entire set of output is run through UtfNormal::cleanUp() in one piece -- where the above error is occurring -- then divided up into pages, and then put back into a giant XML string which gets saved as the file's metadata. That giant string later gets read in and parsed into an XML DOM for access later, but the string is going to sit around bloating up the image table record, memcached, and anybody fetching the document info via InstantCommons.
Comment 5 Tim Starling 2011-04-01 02:04:29 UTC
Sure, but that doesn't conflict with the simple improvement I'm proposing for quickIsNFCVerify(). There's no reason that function can't work with large strings.
Comment 6 Brion Vibber 2011-04-01 18:14:52 UTC
Agreed, putting it on my fun queue.
Comment 7 Brion Vibber 2011-04-01 21:13:06 UTC
Created attachment 8362 [details]
Work in progress test patch (requires PHP 5.3)

I did a quick try serially running preg_match, bumping the offset, and found it to be too slow to the point of running for at least several minutes on a large German test set (never reached completion).

Redoing it to use preg_replace_callback() and bumping the loop into an anonymous function for convenience works, but still with a major performance regression for the German test set. (from 14 MB/sec to 0.5 MB/sec)

Russian, Japanese, and Korean are slowed down much less, from about 2.2 MB/sec to about 1.9 MB/sec.

This is likely due to splitting up the ASCII and non-ASCII sections being much more expensive for German, which like most European languages mixes ASCII and non-ASCII Latin characters together. The other scripts are mostly large non-ASCII blocks, so there are fewer pieces to split apart.

Per-loop overhead seems to be a lot higher with preg_replace (and much more so with serial preg_match()) than the preg_match_all() + foreach... but the giant array will also be super inefficient for European languages because many of the chunks will be very very short strings, which probably contributes to running out of memory.
Comment 8 Brion Vibber 2011-04-01 21:15:48 UTC
UtfNormalMemStress.php test script added in r85155 so the tests can be reproduced. Times above done with the existing UtfNormalBench.php.
Comment 9 Brion Vibber 2011-04-04 21:02:32 UTC
As a workaround, in r85377 I've changed DjVuImage::retrieveMetaData() so it runs individual page texts through UtfNormal::cleanUp() rather than the entire dumped document.

Verified that without the fix, I run out of memory uploading the sample file at 128M memory_limit, and with the fix I can upload it just fine.

Still should be fixed in UtfNormal; languages with heavy mixes of ASCII and non-ASCII use a LOT of memory due to being split into so many short strings, which makes the preg_match_all() much worse in terms of memory usage than just a copy of the string.

Very long page texts may also hit limits in these situations (the dump data for the DjVu file is about 3 megabytes of French text, not inconceivable for a realllllly long wiki page), and it'd be nice to fix.
Comment 10 Mark A. Hershberger 2011-04-05 01:16:09 UTC
Could you update the bug summary to reflect the new (non-preg?) target?  Lowering priority since it sounds like a big part of the problem has been fixed.
Comment 11 Brion Vibber 2011-04-05 01:34:07 UTC
The current summary reflects the as-yet unsolved problem (which is why I've left it open).

I've broken out the UtfNormal general issue (really big string of mixed Latin text -> fails) to bug 28427, and updated the summary here to be specific to the original issue with DjVu files, as that's now worked around.
Comment 12 Yann Forget 2011-04-06 13:03:02 UTC
I suppose that this error is related to this bug?

PHP fatal error in /usr/local/apache/common-local/php-1.17/includes/normal/UtfNormal.php line 285: 
Allowed memory size of 125829120 bytes exhausted (tried to allocate 71 bytes)

http://fr.wikisource.org/w/index.php?title=Fichier:Port_-_Dictionnaire_historique,_g%C3%A9ographique_et_biographique_du_Maine-et-Loire,_tome_1.djvu&action=purge
Comment 13 Yann Forget 2011-04-06 13:05:55 UTC
This is a big file: 882 pages (85,71 Mo)
Comment 14 Mark A. Hershberger 2011-04-06 15:47:18 UTC
(In reply to comment #11)
> The current summary reflects the as-yet unsolved problem (which is why I've
> left it open).

Looks like you closed it, though. Reopening since you apparently didn't intend to do that.
Comment 15 Brion Vibber 2011-04-06 15:50:19 UTC
that's the same bug. fix should be merged to 1.17.
Comment 16 Brion Vibber 2011-04-06 17:17:59 UTC
(In reply to comment #14)
> (In reply to comment #11)
> > The current summary reflects the as-yet unsolved problem (which is why I've
> > left it open).
> 
> Looks like you closed it, though. Reopening since you apparently didn't intend
> to do that.

No, I did intend that -- that's why I broke out the unresolved parts to a separate bug and changed the summary on this bug to the specific issue that was reported. Please leave closed unless there's a regression in the specific issue. :)

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links