Last modified: 2014-08-22 05:09:09 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T40945, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 38945 - Need a way to simulate replication lag to test replag issues
Need a way to simulate replication lag to test replag issues
Status: NEW
Product: Wikimedia
Classification: Unclassified
Continuous integration (Other open bugs)
unspecified
All All
: Low enhancement (vote)
: ---
Assigned To: Nobody - You can work on this!
: ops, performance
Depends on:
Blocks: db-repl-track 46716 51731 54579
  Show dependency treegraph
 
Reported: 2012-08-02 08:38 UTC by Niklas Laxström
Modified: 2014-08-22 05:09 UTC (History)
13 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Niklas Laxström 2012-08-02 08:38:36 UTC
We really need a easy place/way to test for possible replication lag issues in core and extensions. Currently it is just rolling a dice with (relatively) slow deployment cycle.
Comment 1 Tim Starling 2012-08-02 23:14:47 UTC
You mean like a labs instance with MediaWiki and two instances of MySQL on it, a master and an artificially delayed slave?
Comment 2 Niklas Laxström 2012-08-03 09:48:30 UTC
Something like that, yes. I don't know how to implement such thing technically, but I read once that some MySQL versions (probably not ours) have configuration option to add replication delay.

The features I am looking are:
* artificially increased delay to make it easier to catch the issues
* easy access for developers - should not be necessary to have someone else approve and deploy your commits while testing possible fixes
Comment 3 Andre Klapper 2012-12-31 15:07:44 UTC
Sounds like something for Wikimedia to me, but not necessarily MediaWiki codebase.
Comment 4 Krinkle 2013-04-09 02:16:42 UTC
What kind of bugs would we catch with this?

Having the environment is one thing, but what kind of tests do you have in mind (and how to run them).
Comment 5 Krinkle 2013-04-09 02:18:50 UTC
Filed under continuous integration for now.

Depending on the kind of issues you want to test for and how it is implemented, it may be more suitable to have QA test this from the outside instead of with PHPUnit from Jenkins.
Comment 6 Niklas Laxström 2013-04-09 07:33:17 UTC
I was more concerned about actually reproducing the issues reliably and having possibility to debug them easily to understand the causes and to also to come up and test fixes without going through gerrit. I'm doubtful that you can do that with QA tests.
Comment 7 Krinkle 2013-04-10 06:33:27 UTC
Right, so if I understand you correctly, you're looking for an environment where you can work on fixing bugs and testing bugfixes related to replication lag.

In other words, a wiki (say, lagged.wikipedia.beta.wmflabs.org) to do things with (as a human being).

Not a build step for continuous integration environment. Not a test suite for MediaWiki core.

If so, lets move this to as a feature request for labs. To set up a wiki there that is artificially lagged.
Comment 8 Siebrand Mazeland 2013-10-31 10:44:30 UTC
Adding some information from an email by Sean:

Nothing built into MariaDB 5.5, but Percona Toolkit has a decent tool:

http://www.percona.com/doc/percona-toolkit/2.2/pt-slave-delay.html

However it will depend on how accurate a delay is needed to be useful. The tool starts and stops the replication SQL thread predictably but the minimum time granularity is one transaction, which fluctuates, obviously.

Essentially a delay in the order of minutes is easy to maintain. Seconds... sort of.

Oracle's MySQL 5.6 has slave delay built-in using CHANGE MASTER TO MASTER_DELAY = <sconds>. The next MariaDB major relase may get that port -- havn't checked -- but that doesn't help us today.
 

      - the DB are Ubuntu Lucid instances with MySQL installed manually (aka
    no puppet class applied)


Ubuntu has percona toolkit packages in our repos. At least coredb have them installed by default. Only depends on perl.
Comment 9 Niklas Laxström 2013-11-02 20:04:58 UTC
Labs would be nice, but something that allows debugging and tweaking of the code would be even nicer. I wonder if it would be possible to do this with MediaWiki-Vagrant.
Comment 10 Faidon Liambotis 2013-11-04 13:30:03 UTC
I doubt it but I know little about Vagrant; adding Ori to the loop.
Comment 11 Antoine "hashar" Musso (WMF) 2014-08-21 13:25:51 UTC
When we migrated the beta cluster from pmtpa to eqiad, Sean Pringle added a master / slave setup on beta.   Apparently the slave is usually laggy.

It seems to be possible to make it always lagged.  Someone can reach out with Sean to figure out how to make it happen.
Comment 12 Niklas Laxström 2014-08-21 14:46:11 UTC
According to my latest knowledge the replication delay setting only exists in recent MySQL [1] and not in MariaDB - unless the feature has been added recently.

[1] I created a three server setup with replication manually. Unfortunately it did not survive an upgrade so it was broken before we got the chance to use it.
Comment 13 Antoine "hashar" Musso (WMF) 2014-08-21 14:52:51 UTC
That might depends on the percona toolkit + some custom setup.  I think production has slaves which have a 24 hours delay.

Someone should talk about it with Sean Pringle.
Comment 14 Sean Pringle 2014-08-22 05:08:39 UTC
Production has slaves delayed by 24h using the MariaDB event scheduler [1] to start/stop the replication threads. This is fine for a coarse lag values of a few minutes, but inaccurate for anything less.

The MySQL 5.6 CHANGE MASTER TO MASTER_DELAY = N; (seconds) can be more accurate, roughly ~10s, but still highly dependent on the traffic generating the replicated events. Have also not seen it in action on our traffic, so... pinch of salt.

A series of 10+ second writes such as our periodic bot update/delete traffic on recentchanges or links can confuse both methods for short delays, with lag cycling between 0 and N*2.

It might be possible to achieve finer granularity on beta slave by interleaving something like FLUSH TABLES WITH READ LOCK on another thread (or another event) to ensure the slave thread does not catch up so easily.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links