Last modified: 2014-04-20 15:35:42 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T56934, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 54934 - Wikimedia Labs database replication has seemingly stopped (s1 and s2?)
Wikimedia Labs database replication has seemingly stopped (s1 and s2?)
Status: RESOLVED FIXED
Product: Wikimedia Labs
Classification: Unclassified
tools (Other open bugs)
unspecified
All All
: Normal major
: ---
Assigned To: Marc A. Pelletier
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2013-10-03 20:11 UTC by Liangent
Modified: 2014-04-20 15:35 UTC (History)
7 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Liangent 2013-10-03 20:11:43 UTC
max(rc_timestamp) are usually around 20131003170000 for wikis there ( https://noc.wikimedia.org/conf/s2.dblist ).

Other wikis seem fine.
Comment 1 Betacommand 2013-10-06 02:19:40 UTC
Has spread to at least enwiki
Comment 2 MZMcBride 2013-10-06 03:00:03 UTC
Confirmed the issue:

MariaDB [zhwiki_p]> select max(rc_timestamp) from recentchanges\G
*************************** 1. row ***************************
max(rc_timestamp): 20131003170159
1 row in set (0.04 sec)

MariaDB [enwiki_p]> select max(rc_timestamp) from recentchanges\G
*************************** 1. row ***************************
max(rc_timestamp): 20131004074947
1 row in set (0.03 sec)

It seems database replication is broken. Is replication lag logged/graphed anywhere?

Copying Sean and Ryan L. here. I think Asher previously worked on Labs' database replication, but he's gone. I'm not sure who the new maintainer is.
Comment 3 Sean Pringle 2013-10-07 01:41:08 UTC
The relevant sanitarium (upstream) replication had stopped due to a lock wait timeout caused by a slow audit process. The issue has been fixed and labsdbs should catch up quickly.

Also found the icinga replication check for our mysql_multi_instance class in puppet is unreliable. Switching it over to the pt-heartbeat method used by the core dbs...
Comment 4 Sean Pringle 2013-10-07 02:46:26 UTC
jeremyb pointed out in IRC that I missed the question on replag.

Replag graph mysql_slave_lag is not setup for the sanitarium hosts. It can be done as part of the same general fix I mentioned in comment #3.

Don't know the ganglia situation on labsdb. Marc might. FWIW a replag graph on labs in this case would not have showed anything as the problem was upstream. Something graphing replication rate, rather than lag, would have been useful.
Comment 5 Liangent 2013-10-07 02:52:46 UTC
(In reply to comment #4)
> FWIW a replag graph on
> labs in this case would not have showed anything as the problem was upstream.
> Something graphing replication rate, rather than lag, would have been useful.

For DBA's view, this is true; for practical view, a graph of the difference between the latest recentchange entry's timestamp and the current timestamp would be useful enough, assuming there're always edits happening on the wiki.
Comment 6 jeremyb 2013-10-07 03:01:23 UTC
(In reply to comment #5)
> a graph of the difference
> between the latest recentchange entry's timestamp and the current timestamp
> would be useful enough, assuming there're always edits happening on the wiki.

We can probably do better than that. There's a heartbeat DB visible (at least on enwiki.labsdb) and we can probably open that up for everyone to read and graph it.
Comment 7 Betacommand 2014-04-20 12:10:12 UTC
enwiki replication is over two days behind
Comment 8 Tim Landscheidt 2014-04-20 14:01:50 UTC
(In reply to Betacommand from comment #7)
> enwiki replication is over two days behind

As this is a different issue, I've filed bug #64154 for that.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links