Last modified: 2014-08-03 18:21:24 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T58041, the corresponding Phabricator task for complete and up-to-date bug report information.

Bug 56041 - updateCollation.php script prohibitively slow for very large wikis


Summary:	updateCollation.php script prohibitively slow for very large wikis

Status:	PATCH_TO_REVIEW

Product:	MediaWiki
Classification:	Unclassified
Component:	Maintenance scripts (Other open bugs)
Version:	unspecified
Hardware:	All All

Importance:	Normal normal (vote)
Target Milestone:	---
Assigned To:	Nobody - You can work on this!

URL:
Whiteboard:
Keywords:	performance

Depends on:
Blocks:	collations
	Show dependency tree / graph

Reported:	2013-10-23 10:57 UTC by Bartosz Dziewoński
Modified:	2014-08-03 18:21 UTC (History)
CC List:	7 users (show)

See Also:
Web browser:	---
Mobile Platform:	---
Assignee Huggle Beta Tester:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Description Bartosz Dziewoński 2013-10-23 10:57:36 UTC

updateCollation.php script is awfully slow. It took over a week for
fr.wp (bug 54680), it'll probably take months if we ever decide to run
it on en.wp. That kinda sucks.

I'm not sure what can be done, or if it's just a problem on WMF configuration,
or what, so I'm just filing this and asking for comments. Please resolve
as INVALID if we in fact can't do anything about this.

Possible causes:
* The workaround from bug 45970 which makes it use an index that's not
  entirely perfect for the task (but likely good enough, no idea how
  much slower that makes the script).
* Slave synchronisation, in which case maybe we can do something with
  ops involvement? Don't ask me.
* There's just too much data being sent back-and-forth between PHP and
  the database, in which case we can't do anything (unless we implement
  collating entirely database-side, which I've been told is a bad idea).

I'm CC-ing competent people. Help?

Comment 1 Bawolff (Brian Wolff) 2013-10-23 11:51:58 UTC

My guess would be that the icu algorithm is slow, and that is the bottleneck. (Possibly uca-fr is even slower than normal icu as it has to do special things with accents in that language). However that is pure speculation. We should do profiling to figure out where the bottleneck really is.

Comment 2 Tim Starling 2013-10-23 23:37:01 UTC

<springle>	Reedy: could UpdateCollation do reads from a slave?
<Reedy>	Couldn't see why not... We've got the wfWaitForSlave() calls in anyway, so they should be up to date for when we do the next select()
<Reedy>	TimStarling might have some input too
<TimStarling>	the select is too slow?
<springle>	yes. does a large filesort. innodb buffer pool gets out of whack pulling on old data, other writes pile up, then swap starts and everything crawls, then max_connections
<TimStarling>	I suppose it would work, as long as the slave is guaranteed to have no open snapshot
<springle>	trying without adaptive hash latch now, but not convinced that's a core issue -- 5.5 already has some fo the old related bugs fixed
<TimStarling>	maybe it could use a $dbr->commit() before the select to be on the safe side
<springle>	the select does ORDER BY cl_to, cl_type, cl_from .. is that definitely needed?
<TimStarling>	springle: yes, see https://gerrit.wikimedia.org/r/#/c/53301/
<springle>	ok

Comment 3 Gerrit Notification Bot 2014-01-07 23:54:59 UTC

Change 106162 had a related patch set uploaded by Reedy:
Make SELECT queries against slaves in updateCollation.php

https://gerrit.wikimedia.org/r/106162

Wikimedia Bugzilla is closed!

Search

Personal tools

Navigation

Links