Last modified: 2011-10-23 19:11:41 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T28304, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 26304 - OAIRepository fatal error stops search indexing
OAIRepository fatal error stops search indexing
Status: RESOLVED FIXED
Product: MediaWiki extensions
Classification: Unclassified
OAI (Other open bugs)
unspecified
All All
: High critical with 2 votes (vote)
: ---
Assigned To: Sam Reed (reedy)
: shell
: 28852 29679 (view as bug list)
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2010-12-11 14:56 UTC by Robert Stojnic
Modified: 2011-10-23 19:11 UTC (History)
8 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments
results of grep (2.50 KB, text/plain)
2011-07-02 23:37 UTC, Mark A. Hershberger
Details
More recent error logs (18.26 KB, text/plain)
2011-08-16 21:38 UTC, Sam Reed (reedy)
Details

Description Robert Stojnic 2010-12-11 14:56:44 UTC
This bug is related to bug 26203. OAIRepository returns 500 on certain queries on dewikisource. This makes the search indexer stuck. 

Here are the exact queries and errors: 

First query:
http://de.wikisource.org/w/index.php?title=Special:OAIRepository&verb=ListRecords&metadataPrefix=mediawiki&from=2010-03-17T21:09:10Z

Dec  2 14:48:29 10.0.2.227 apache2[32008]: PHP Fatal error:  Allowed memory size of 83886080 bytes exhausted (tried to allocate 76 bytes) in /usr/local/apache/
common-local/wmf-deployment/includes/normal/UtfNormal.php on line 272

Second query:
http://de.wikisource.org/w/index.php?title=Special:OAIRepository&verb=ListRecords&metadataPrefix=mediawiki&resumptionToken=mediawiki:1009861

Dec  2 15:03:39 10.0.2.182 apache2[9827]: PHP Fatal error:  Allowed memory size of 83886080 bytes exhausted (tried to allocate 26519414 bytes) in /usr/local/apache/common-local/wmf-deployment/includes/OutputHandler.php on line 8

The second query/error was produced when I tried to skip some articles by bumping the from= parameter forward in time.
Comment 1 Robert Stojnic 2011-02-25 22:22:03 UTC
Bump. Happened again, but now on en.wp:

http://en.wikipedia.org/w/index.php?title=Special:OAIRepository&verb=ListRecords&metadataPrefix=mediawiki&resumptionToken=mediawiki:414565305

I am going to try skipping some articles, but this really need to be fixed.
Comment 3 Robert Stojnic 2011-05-06 09:55:49 UTC
And again on commons:

http://commons.wikimedia.org/w/index.php?title=Special:OAIRepository&verb=ListRecords&metadataPrefix=mediawiki&resumption
Token=mediawiki:53981669

Bumping priority on this as it is breaking search index updates at random times on multiple wikis.
Comment 4 Sam Reed (reedy) 2011-05-06 15:16:46 UTC
*** Bug 28852 has been marked as a duplicate of this bug. ***
Comment 5 Mitch Harris 2011-05-08 06:51:55 UTC
Search won't work at all this morning. Yesterday evening, results that had been missing for days started appearing, but this morning, searches do not display any results.
Comment 6 Robert Stojnic 2011-05-08 09:19:27 UTC
This is a known problem unrelated to OAIRepository. We have only one search host (search11) that most low-accessed wikis (including commons) and it needs to be restarted every Sunday morning which introduces around 5-10 minutes downtime since it has so many wikis on it and it takes time for all of ~ 800 wikis to be properly cached loaded into memory. Hopefully at some point we'll able to spare two search hosts so we eliminate this downtime.
Comment 7 Mark A. Hershberger 2011-06-15 23:15:29 UTC
Are you still seeing this?  Brion thought some changes that were made may have mitigated this.
Comment 8 Robert Stojnic 2011-06-15 23:19:28 UTC
The URLs listed above now work. Closing as FIXED, will reopen if it occurs again.
Comment 9 Brion Vibber 2011-06-15 23:21:15 UTC
Bug 28427 notes a memory usage issue in UtfNormal::quickIsNFCVerify (part of the Unicode normalization routines we run on a lot of input and output), which could perhaps be contributing... I was thinking there was some related fix but that's actually for bug 28146 which was fixed by chopping up a run over a giant string into a run over smaller strings, and was specific to the calling code (in DjVu file handling) -- so that fix probably would not have an effect on this.

More generally, hitting memory limits here could simply be due to handling too much big data in one request, and it may need to change how it handles diving up the data into chunks.

It's possible that the errors were coming before and no longer are because specific large pages were coming over the channel that are no longer in the queue.
Comment 10 Mitch Harris 2011-07-02 10:17:30 UTC
It's back again! The search engine update lags by 48 hours now.
Comment 11 Robert Stojnic 2011-07-02 12:43:24 UTC
*** Bug 29679 has been marked as a duplicate of this bug. ***
Comment 12 Helder 2011-07-02 13:00:18 UTC
(In reply to comment #11)
> *** Bug 29679 has been marked as a duplicate of this bug. ***
Just to be sure, if this bug is fixed and Bug 29679 is a duplicate of this one, when should we expect the index to be updated on Portuguese Wikibooks?
Comment 13 Robert Stojnic 2011-07-02 13:02:05 UTC
Both commons and ptwikibooks should be updated by tomorrow. However, the bug still occurs. So, reopening.
Comment 14 Robert Stojnic 2011-07-02 17:04:05 UTC
If anyone is going to track this down and needs some fresh URLs that fail, do the following:

ssh searchidx2
grep "HTTP response code: 500" /a/search/log/log-all | tail
Comment 15 Brion Vibber 2011-07-02 18:37:45 UTC
I have no server access so couldn't tell you exactly what's being recorded now, but when pulling some of the older links shown above (for enwiki & commons) using the testing password I end up receiving what at first looks like a legit OAI-PMH XML response, but partway through after the end of one of the <Record>...</Record> elements suddenly the Wikimedia error page shows up like this:

... ^ OAI-PMH with embedded MediaWiki export XML
  </page>
</mediawiki>
</metadata>
</record>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/2002/REC-xhtml1-20020801/DTD/xhtml1-transitional.dtd">

<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
 <head>

  <title>Wikimedia Error</title>
  <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
  <meta name="author" content="Mark Ryan, with translation by many people; see http://meta.wikimedia.org/wiki/Multilingual_error_messages"/>
  <meta name="copyright" content="(c) 2005-2007 Mark Ryan and others. Text licensed under the GNU Free Documentation License. http://www.gnu.org/licenses/fdl.txt"/>
... v HTML error page

The OAI output, like Special:Export disables the regular output stuff and possibly the regular output buffering as well, so it might be that some error is triggering that error page (have we modified PHP to output that error page directly on a fatal error, perhaps?)


Taking a quick look over the current OAI extension code, I don't see any explicit use of UtfNormal::cleanUp(); though it'll be implicitly called already by the xmlsafe() string escaping wrapper in WikiExport common code & a few bits in OAIRepo itself.

A very large page with a lot of non-ASCII characters could perhaps be an individual thing that eats up a lot of memory in one chunk; it's also possible there's a memory leak I haven't noticed before.

Could try tweaking the number in OAIRepo::chunkSize() down -- this'll make it run fewer pages through each ListRecords call, and might help if there's a leak forcing memory usage up from record to record -- but *won't* help if it's dying on an individual page record being too-big.
Comment 16 Brion Vibber 2011-07-02 18:41:44 UTC
r91364 makes the ListRecords chunk size configurable via $oaiChunkSize (default is 50, the same as was previously hardcoded).

Should be straightforward to merge that to 1.17wmf1, push that out, and try lowering the value to see if it has any effect. Smaller chunk size will require more requests to get through the data so may slow it down a little, but if it's leaking memory from record to record this may help.
Comment 17 Mark A. Hershberger 2011-07-02 23:37:34 UTC
Created attachment 8737 [details]
results of grep

results of grepping the logs
Comment 18 Rob Lanphier 2011-07-28 21:20:58 UTC
Sam, can you take this one on?  Brion believes that this is ready to deploy, if I'm reading him right.
Comment 19 Brion Vibber 2011-07-29 17:10:07 UTC
Has this not been deployed yet? Really should be...
Comment 20 Sam Reed (reedy) 2011-08-02 17:09:47 UTC
Merged in r91364

Pushed out to cluster with default 50

Just trying to find where the logs are, and I'll try and reduce the number a bit
Comment 21 Sam Reed (reedy) 2011-08-11 14:00:45 UTC
Still nothing since July 27th
Comment 22 Sam Reed (reedy) 2011-08-16 21:37:38 UTC
Nothing since the 27th of July, but then on the 13th errors have started again seemingly

Reduced chunksize to 40 just now

Will upload more up to date list of just recent errors
Comment 23 Sam Reed (reedy) 2011-08-16 21:38:08 UTC
Created attachment 8928 [details]
More recent error logs
Comment 24 Brion Vibber 2011-08-16 21:39:28 UTC
        PHP fatal error in /usr/local/apache/common-local/php-1.17/extensions/OAI/OAIRepo_body.php line 881: <br/> 
        Call to a member function getUrl() on a non-object
Comment 25 Brion Vibber 2011-08-16 21:42:31 UTC
Looks like a regression in r80770: introduces a chance for a fatal error if we have an image row but don't succeed at loading the file object. (Not entirely sure how that happens but that's where the error is.)
Comment 26 Sam Reed (reedy) 2011-08-16 21:51:38 UTC
(In reply to comment #24)
>         PHP fatal error in
> /usr/local/apache/common-local/php-1.17/extensions/OAI/OAIRepo_body.php line
> 881: <br/> 
>         Call to a member function getUrl() on a non-object

Seems like it might be worth live hacking in some debugging for the time being.

If we get false from wfFindFile, output to debug, and return ''?
Comment 28 Brion Vibber 2011-08-23 22:22:53 UTC
r95169 on trunk, r95170 on 1.17wmf1 should resolve this. Has it been deployed?

Still needs merge to 1.18.
Comment 29 Sam Reed (reedy) 2011-08-23 22:47:55 UTC
(In reply to comment #28)
> r95169 on trunk, r95170 on 1.17wmf1 should resolve this. Has it been deployed?
> 
> Still needs merge to 1.18.

Yes, it has been deployed.

Did tag it 1.18
Comment 30 Antoine "hashar" Musso (WMF) 2011-08-28 15:35:52 UTC
r95169 applied to 1.18 with r95628
Comment 31 Sam Reed (reedy) 2011-09-02 11:43:42 UTC
Nothing again since the 21st August....
Comment 32 Mitch Harris 2011-09-04 08:25:20 UTC
Wikimedia Commons search engine returns only stale results when searching - once again! Most recent returned results are, at least, four days old, if not older. Comment 31 says "since 21st August". I cant' confirm this but it looks likely. This would be a record-breaking duration for a bug on a major database!
Comment 33 Robert Stojnic 2011-09-05 10:56:44 UTC
The problem reported in Comment 32 seems to be unrelated to this bug, but to a change in InitaliseSettings.php (enabling of HTTPSExperiment) that broke lucene-search parsing of this file, which is needed to get the full URLs of all wikis for incremental update. Should be fixed now by manual override in lsearch-global-2.1.conf
Comment 34 Sam Reed (reedy) 2011-09-05 11:16:54 UTC
Might want to ping roan if he doesn't already know
Comment 35 Roan Kattouw 2011-09-05 15:24:10 UTC
(In reply to comment #34)
> Might want to ping roan if he doesn't already know
I noticed this comment but didn't really understand what Robert meant. Parsing of what broke exactly? Is the override in lsearch-global-2.1.conf a suitable permanent solution?
Comment 36 Roan Kattouw 2011-09-05 16:27:57 UTC
(In reply to comment #35)
> (In reply to comment #34)
> > Might want to ping roan if he doesn't already know
> I noticed this comment but didn't really understand what Robert meant. Parsing
> of what broke exactly? Is the override in lsearch-global-2.1.conf a suitable
> permanent solution?
Robert and I talked on IRC, which led to me fixing the real issue in r96287. This fix is now deployed.
Comment 37 Sam Reed (reedy) 2011-09-27 23:00:06 UTC
[main] WARN  org.wikimedia.lsearch.oai.IncrementalUpdater  - Retry later: error while processing update for igwiki : Server returned HTTP response code: 500 for URL: http://ig.wikipedia.org/w/index.php?title=Special:OAIRepository&verb=ListRecords&metadataPrefix=mediawiki&from=2011-09-20T16:08:59Z
java.io.IOException: Server returned HTTP response code: 500 for URL: http://ig.wikipedia.org/w/index.php?title=Special:OAIRepository&verb=ListRecords&metadataPrefix=mediawiki&from=2011-09-20T16:08:59Z
java.io.IOException: Server returned HTTP response code: 500 for URL: http://ii.wikipedia.org/w/index.php?title=Special:OAIRepository&verb=ListRecords&metadataPrefix=mediawiki&from=2010-02-16
2011-09-20 19:03:26,440 [main] WARN  org.wikimedia.lsearch.oai.IncrementalUpdater  - Retry later: error while processing update for iiwiki : Server returned HTTP response code: 500 for URL: http://ii.wikipedia.org/w/index.php?title=Special:OAIRepository&verb=ListRecords&metadataPrefix=mediawiki&from=2010-02-16
java.io.IOException: Server returned HTTP response code: 500 for URL: http://ii.wikipedia.org/w/index.php?title=Special:OAIRepository&verb=ListRecords&metadataPrefix=mediawiki&from=2010-02-16
java.io.IOException: Server returned HTTP response code: 500 for URL: http://ik.wikipedia.org/w/index.php?title=Special:OAIRepository&verb=ListRecords&metadataPrefix=mediawiki&from=2011-09-20T16:09:04Z
2011-09-20 19:03:26,935 [main] WARN  org.wikimedia.lsearch.oai.IncrementalUpdater  - Retry later: error while processing update for ikwiki : Server returned HTTP response code: 500 for URL: http://ik.wikipedia.org/w/index.php?title=Special:OAIRepository&verb=ListRecords&metadataPrefix=mediawiki&from=2011-09-20T16:09:04Z
java.io.IOException: Server returned HTTP response code: 500 for URL: http://ik.wikipedia.org/w/index.php?title=Special:OAIRepository&verb=ListRecords&metadataPrefix=mediawiki&from=2011-09-20T16:09:04Z
2011-09-22 19:43:55,264 [main] WARN  org.wikimedia.lsearch.oai.OAIHarvester  - I/O exception listing records: Server returned HTTP response code: 500 for URL: http://de.wikipedia.org/w/index.php?title=Special:OAIRepository&verb=ListRecords&metadataPrefix=mediawiki&resumptionToken=mediawiki:90200283
java.io.IOException: Server returned HTTP response code: 500 for URL: http://de.wikipedia.org/w/index.php?title=Special:OAIRepository&verb=ListRecords&metadataPrefix=mediawiki&resumptionToken=mediawiki:90200283
Comment 38 Brion Vibber 2011-09-27 23:22:10 UTC
All of the above links work at present -- was this some transitory error?
Comment 39 Sam Reed (reedy) 2011-09-28 23:29:34 UTC
(In reply to comment #38)
> All of the above links work at present -- was this some transitory error?

It does seem that way. Think it's just worth closing this for the time being? As there isn't much that's actually actionable (ie do X and Y will be fixed)
Comment 40 Mark A. Hershberger 2011-10-16 16:21:05 UTC
This is working now.  Bumping priority and closing so that if it is re-opened it shows up more quickly.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links