Last modified: 2010-05-15 16:02:53 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T17460, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 15460 - Poor concurrent performance due to objectcache table locks
Poor concurrent performance due to objectcache table locks
Status: RESOLVED FIXED
Product: MediaWiki
Classification: Unclassified
General/Unknown (Other open bugs)
1.13.x
All All
: Normal critical (vote)
: ---
Assigned To: Tim Starling
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2008-09-03 17:07 UTC by Dan Barrett
Modified: 2010-05-15 16:02 UTC (History)
2 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments
The "Wait" extension from the case study (299 bytes, text/plain)
2008-09-03 17:07 UTC, Dan Barrett
Details

Description Dan Barrett 2008-09-03 17:07:37 UTC
Created attachment 5279 [details]
The "Wait" extension from the case study

A single wiki page that takes a long time to render -- due to a
long-running tag extension -- blocks ALL other users of the wiki from
viewing ANY pages.

This bug might sound unlikely -- given that Wikipedia has scaled so
massively -- but I've reproduced it now on four
different MediaWiki servers, two running Windows 2003 Server, and two
running Linux (CentOS, Ubuntu), all with different PHP and Apache
versions & configurations. One of them was a plain vanilla MediaWiki
install created just to test this bug.

The problem doesn't seem to be related to sessions (e.g., when session
data is locked to prevent concurrent writes, as described at
http://us2.php.net/manual/en/function.session-write-close.php).  Even
with cookies disabled, the same behavior happens.

CASE STUDY TO REPRODUCE THE PROBLEM

These steps will create two web pages, one slow and one fast, and
demonstrate that ALL other users cannot use the wiki at all while
the slow page is rendering.

1. Start with a plain vanilla install of MediaWiki 1.13.

2. Install the "Wait" extension attached to this bug ticket,
that simply sleeps for N seconds. The source code is:

<?php
# Wait for N seconds
$wgExtensionFunctions[] = 'wfWaitSetup';
function wfWaitSetup() {
  global $wgParser;
  $wgParser->setHook('wait', 'wfWait');
}
function wfWait($input) {
  global $wgParser;
  $wgParser->disableCache();
  sleep($input);
  return "Slept for $input seconds";
}

3. Create two wiki articles: one called "Slow" that contains:

   <wait>30</wait>

and one called "Fast" that contains the text "Hello world".  (Note
that saving the "Slow" page will take 30 seconds as the <wait> tag is
rendered.)

Now you are ready to demonstrate the problem.

4. Open web browsers on TWO DIFFERENT PCs.  (Alternatively, open two
different browsers -- say, Firefox and IE -- on the same PC.)  Do not
just open 2 windows or tabs of the same browser instance, because
they might share session information, which would invalidate this test.

5. In the first browser, hit the "Slow" page.

6. While the Slow page is sleeping, in the second browser, hit the
"Fast" page.  Ensure that you are NOT getting a cached copy of Fast,
say, by force-refreshing the browser (shift-refresh in Firefox,
ctrl-F5 in IE, etc.) or appending action=purge.  (Or hit
Special:Version or other dynamic page instead of the "Fast" page.)

7. While the Slow page is sleeping, the Fast page WILL NOT RENDER.  It
will be delayed until the Slow page has finished sleeping.  This is
the problematic behavior: the Slow page -- in an entirely different
session -- is blocking the Fast page.

SOME ANALYSIS

Long-running parser tags like <wait> are realistic. Imagine a parser
tag that hits an external database to run SQL queries (that's how
we found this problem), or a tag that requires an external resource
that is slow or unavailable.

This does not seem to be an Apache problem.  While the "Slow" page is
sleeping, you can hit another non-MediaWiki PHP page on the same
webserver (say, test.php that runs "phpinfo()" and exits) and there
are no delays.

This also does not seem to be a PHP session problem. On the same
webserver, if you create two simple PHP scripts that open a session,
and hit them with separate browsers as above, the problem does NOT
occur.  Here are example scripts:

<?php
# Script number 1
session_start();
sleep(20);
echo "Done";

<?php
# Script number 2
session_start();
echo "Done";

If you hit script #1 and, while it's sleeping, you hit script #2,
using independent browsers in different sessions, the problem does not
occur. (Which is good & expected.)

Why doesn't Wikipedia experience this problem? Probably because of its
extreme caching architecture (with squid, etc.), and that it doesn't
have any long-running extensions installed.  Or maybe it has a magical
PHP/Apache/MySQL configuration that prevents the problem?

STACK DUMP

I did the case-study above and let the "Fast" page time out and fail
while "Slow" was sleeping. Here is the stack dump from that moment,
produced by xdebug:

Fatal error:  Maximum execution time of 30 seconds exceeded in includes/db/Database.php 
on line 579
Stack trace:
1. {main}() index.php:0
2. MediaWiki->checkInitialQueries() index.php:60
3. Title::newMainPage() includes/Wiki.php:105
4. wfMsgForContent() includes/Title.php:293
5. wfMsgReal() includes/GlobalFunctions.php:380
6. wfMsgGetKey() includes/GlobalFunctions.php:432
7. StubObject->get() includes/GlobalFunctions.php:467
8. StubObject->__call() includes/StubObject.php:0
9. StubObject->_call() includes/StubObject.php:76
10. call_user_func_array() includes/StubObject.php:58
11. MessageCache->get() includes/StubObject.php:0
12. MessageCache->getMsgFromNamespace() includes/MessageCache.php:543
13. MessageCache->load() includes/MessageCache.php:606
14. MessageCache->saveToCaches() includes/MessageCache.php:250
15. BagOStuff->add() includes/MessageCache.php:419
16. SqlBagOStuff->set() includes/BagOStuff.php:100
17. SqlBagOStuff->delete() includes/BagOStuff.php:274
18. SqlBagOStuff->_query() includes/BagOStuff.php:288
19. MediaWikiBagOStuff->_doquery() includes/BagOStuff.php:319
20. Database->query() includes/BagOStuff.php:433
21. Database->doQuery() includes/db/Database.php:540

And the offending line 579 is:

   $ret = mysql_query( $sql, $this->mConn );

BRION VIBBER'S TEST

Brion did a quick test with this on his local wiki and thought the
problem might be session-related. He writes in mediawiki-l:

"If I preview a page with <wait>20</wait>, then go load something up
in another tab in the same browser, it sits there waiting on both
tabs.  (Confirmed with Firefox 3 and Safari 3 on Mac OS X.) If on the
other hand I go load things up in another browser, there's no delay
there. If I disable cookies (thus removing session affinity), then a
second tab in the same browser has no slowdown. And indeed, it appears
that PHP session files are by default locked to prevent multiple
simultaneous accesses."

I don't know why Brion had different results than I did. Perhaps his
"Fast" page was cached so there was no slowdown due to the Slow page.
I've asked him to try again and hit Special:SpecialPages instead of
"Fast".  Personally, I would LOVE to hear that my wikis are
misconfigured in some way, or that I've made some mistake.
This problem is killing performance for us.
Comment 1 Brion Vibber 2008-09-03 17:13:14 UTC
I definitely do *not* get the described effect when using separate browsers.

Special:Specialpages, recentchanges, a few random pages, etc all load fine in my second browser (IE in a Windows VM) while my "slow" page is previewing, saving, or rendering in Firefox on my mac.
Comment 2 Antoine "hashar" Musso (WMF) 2008-09-03 17:16:02 UTC
Looks like a browser issue. In firefox you can probably tweak your
network parameters (about:config) such as :

network.http.max-connections
network.http.max-connections-per-server
network.http.max-persistent-connections-per-server
Comment 3 Dan Barrett 2008-09-03 17:22:18 UTC
I don't see how it can be a browser issue.  In the case study, we have two separate users on two separate PCs, and each is simply hitting ONE single (different) wiki page.

For the record, my settings are:

network.http.max-connections = 24
network.http.max-connections-per-server = 8
network.http.max-persistent-connections-per-server = 2
Comment 4 Brion Vibber 2008-09-03 17:24:17 UTC
Ok, I *do* see this effect on a fresh 1.13 install. I'll see if there's some configuration funniness going on...
Comment 5 Dan Barrett 2008-09-03 17:26:07 UTC
Thank you thank you thank you for looking at this. Can I provide more information that would be helpful?  php.ini?  httpd.conf?  LocalSettings.conf (though it's plain vanilla)?  mysql config?
Comment 6 Brion Vibber 2008-09-03 17:44:36 UTC
Ok here's the skinny:

1.12 branch fresh install, all default options -- no hang.

1.13 branch fresh install, all default options -- hang.

Trunk (1.14 dev) fresh install, all default options -- no hang.

These are all tested on Mac OS X 10.5, MySQL 5.0.51a, default system Apache/PHP (5.2.6).

So there may be some kind of regression in 1.13, which is interestingly fixed (or didn't happen) in dev trunk.
Comment 7 Brion Vibber 2008-09-03 18:15:11 UTC
Poked around a little looking for obvious changes in Article and Title DB usage, didn't see anything clear. Assigning to Tim to investigate this as a regression in release branch.
Comment 8 Dan Barrett 2008-09-03 18:19:00 UTC
I installed the 1.14 dev trunk, and the problem disappears for me too.  (And I
was pretty sure we hadn't experienced it in 1.12.)

Any chance that a fix could make it into the 1.13.1 patch release?  I am happy to help in any way I can.
Comment 9 Roan Kattouw 2008-09-03 18:32:26 UTC
(In reply to comment #8)
> I installed the 1.14 dev trunk, and the problem disappears for me too.  (And I
> was pretty sure we hadn't experienced it in 1.12.)
> 
> Any chance that a fix could make it into the 1.13.1 patch release?  I am happy
> to help in any way I can.
> 

If we want to track down what caused this, someone who has time for it (which I don't, and I guess it's safe to say Brion doesn't either) should do a binary search-like revision witchhunt (see <http://en.wikipedia.org/wiki/Binary_search#Number_Guessing_Game> for a quick explanation) to see which revision introduced this bug.
Comment 10 Brion Vibber 2008-09-03 18:35:31 UTC
That's why I *assigned* it to Tim. But if you want to do it before he wakes up, be my guest. :)
Comment 11 Roan Kattouw 2008-09-03 18:37:24 UTC
(In reply to comment #10)
> That's why I *assigned* it to Tim. But if you want to do it before he wakes up,
> be my guest. :)
> 

Yeah, I only noticed that later.
Comment 12 Dan Barrett 2008-09-03 19:45:32 UTC
Found it!  (Via the binary search technique suggested.)

The problem was fixed in 1.14alpha by Tim in revision 40209:

$ svn log -r 40209 -v
r40209 | tstarling | 2008-08-29 04:35:00 -0400 (Fri, 29 Aug 2008) | 3 lines
Changed paths:
   M /trunk/phase3/RELEASE-NOTES
   M /trunk/phase3/includes/BagOStuff.php

* Fixed intermittent deadlock errors involving objectcache table queries. Use a
separate database connection for the objectcache table to avoid long-lasting locks on that table.

I've confirmed that the bug is present in 40208 and fixed in 40209.
Comment 13 Dan Barrett 2008-09-03 19:47:47 UTC
Is the delta between 40208 and 40209 safe to patch into 1.13?
Comment 14 Dan Barrett 2008-09-03 19:48:21 UTC
On my own wiki, I mean.  (I don't have commit access to your svn repository.)
Comment 15 Antoine "hashar" Musso (WMF) 2008-09-03 19:55:54 UTC
I applied the delta to mw13 and it seems to fix the issue.
Comment 16 Antoine "hashar" Musso (WMF) 2008-09-03 20:17:42 UTC
Back ported in 1.13. Thanks for bisecting this issue.

------------------------------------------------------------------------
r40391 | hashar | 2008-09-03 22:15:33 +0200 (mer, 03 sep 2008) | 4 lines

Fix #15460 : One slow extension can block all wiki pages for all users
Backport of trunk@40209 by Tim Starling. Pointed by Dan Barrett.
------------------------------------------------------------------------

Comment 17 Dan Barrett 2008-09-04 00:23:06 UTC
Your patch works for me. Thanks so much Ashar and Brion!!!
Comment 18 Tim Starling 2008-09-04 07:18:08 UTC
Fixed summary.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links