Last modified: 2011-11-30 16:58:20 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T31233, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 29233 - Broken failover for DB slave connection errors
Broken failover for DB slave connection errors
Status: RESOLVED FIXED
Product: MediaWiki
Classification: Unclassified
Database (Other open bugs)
unspecified
All All
: High major (vote)
: ---
Assigned To: Tim Starling
:
Depends on:
Blocks: 29068
  Show dependency treegraph
 
Reported: 2011-06-01 19:35 UTC by ctwoo
Modified: 2011-11-30 16:58 UTC (History)
4 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description ctwoo 2011-06-01 19:35:32 UTC
Failure from reading from replicated database could bring down the rest of the site.

When the application tries to read a 'failed' db, it should catch the exception (error 2003?) and redirects it to another db. Instead, today, the application hangs, resulting in a site failure.
Comment 1 Roan Kattouw 2011-06-01 19:37:57 UTC
Clarification: "DB" in this report seems to mean database server, so this is a load balancing thing.
Comment 2 Mark A. Hershberger 2011-06-16 23:04:13 UTC
More details from talking to CT (and my own surprise that this is not a "solved" problem):

1. Apache's end up using a single slave DB for their DB access.
2. When the DB goes down, client apps aren't smart enough to fail-over to the next available DB -- they sit there waiting until they timeout and fail.
3. Apparently, because Apache/PHP is stupid and doesn't realize that it shouldn't try again, step 2 keeps repeating until the DB is fixed.
Comment 3 Brion Vibber 2011-06-16 23:12:57 UTC
Apache/PHP doesn't select databases; MediaWiki's LoadBalancer class does. It should already be failing over to the next available server in the case of a connection error (MySQL error 2003 is "can't connect").

* what's "failed db" mean?
* what's "hangs" mean?
* what's "site failure" mean?

Is this a situation where there *are* no other working slave servers to fail over to? What exactly is it doing?
Comment 4 Mark A. Hershberger 2011-06-17 00:22:17 UTC
CT, could you provide more details (per Brion's comments) so that Tim has a clue what is actually happening?
Comment 5 Asher Feldman 2011-06-17 00:28:30 UTC
During an 8 minute period on May 29 where mysql was down on db32 after being killed by the kernel due to an OOM condition but the server was otherwise up, en.wikipedia.org was observed to be down hard and 1209654 messages of the following type were logged:

Sun May 29 2:05:32 UTC 2011     srv230  enwiki  Error connecting to 10.0.6.42:  Lost connection to MySQL server at 'reading initial communication packet', system error: 111
Sun May 29 2:05:32 UTC 2011     srv170  enwiki  Error connecting to 10.0.6.42:  Can't connect to MySQL server on '10.0.6.42' (115)
Sun May 29 2:05:32 UTC 2011     srv175  enwiki  Error connecting to 10.0.6.42: Lost connection to MySQL server at 'reading initial communication packet', system error: 111 (10.0.6.42)

From the message formatting, it appears that all of these messages were logged within DatabaseMysql::open(). 

The connection failures should have occurred within milliseconds and it does appear that the LoadBalancer class should handle such an occurrence with minimal impact. However, LoadBalancer::reportConnectionError only calls wfLogDBError() in the following two ways:


                        wfLogDBError( "LB failure with no last connection\n" );

                        wfLogDBError( "Connection error: {$this->mLastError} ({$server})\n" );

Neither of these messages appear in the dberror log during the enwiki / db32 outage.
Comment 6 Tim Starling 2011-06-17 00:44:24 UTC
(In reply to comment #3)
> Apache/PHP doesn't select databases; MediaWiki's LoadBalancer class does. It
> should already be failing over to the next available server in the case of a
> connection error (MySQL error 2003 is "can't connect").

Yes, LoadBalancer has failover code. During the downtime on May 24, the appropriate sort of connection error was logged:

Tue May 24 13:41:01 UTC 2011	srv191	rowiki	Connection error: No working slave server: No working slave server: Unknown error ()

This error indicates that the fallback sequence was exhausted, so whatever is going on, it's clear that we're not just letting connection error exceptions leak out of LoadBalancer. There were 4775 instances on that day.

If we get an error in Database::query(), then we don't close the connection and switch over to another database. I'm not sure if that's what CT is asking for. We don't appear to properly log "MySQL server has gone away" or "Lost connection to MySQL server during query" errors. They are dealt with by automatically reconnecting, and then if the reconnection fails, a DBConnectionError would probably be thrown.

There were 162,973 instances of "LB failure with no last connection" on May 24, which is somewhat concerning. But it's hard to know if that's the problem of interest. 

What we really need to know is: when exactly was this "site failure", and what were the observed symptoms of it?
Comment 7 ctwoo 2011-06-17 04:28:08 UTC
On that day, I noticed the EN site went down and was serving  the error page when I tried to access en.wikipedia.org site. About 7 minutes later,when  the db server came back up and the site came up as well.
Asher's earlier comment has more detail of the failure.
Tim - noticed you were referring to a problem on May 24. That problem happened on May 29.
Comment 8 Asher Feldman 2011-06-17 05:08:27 UTC
The errors around this outage begin at Sun May 29 2:05:21 UTC 2011 and end at Sun May 29 2:19:21 UTC 2011, see dberror.log-20110529.gz.
Comment 9 Tim Starling 2011-06-17 07:14:09 UTC
I tracked it down to r75343.
Comment 10 Tim Starling 2011-06-20 00:48:35 UTC
The fix in r90423 is now deployed.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links