Last modified: 2005-12-10 18:01:35 UTC
Special:Watchlist appears to have stopped working at about 11:00 CST.
DB server lags, removed from replication
Watchlists have stopped again at about noon.
Will catch up shortly.
For background, there are three main causes of this:
1. Too much load on the slave, so it can't keep up with replication while it
handles queries. In this situation, we adjust load by adjusting the amount of
search we turn off. If they get significantly behind, we turn off search for
some of the big wikis using that slave so it can catch up more quickly.
2. An operating system version-related issue on the slave Bacon, which causes it
to stop replicating. We can't risk losing Bacon at present so we can't try
different operating system versions yet. Becuase we get fast reports of this
problem from en, we have this machine set to serve en and Zh wikipedias. The
rest are normally unaffected by this issue with the current setup, though in
the past any could be affected. The split is mainly for performance reasons - we
just had to choose which wikis got the one with the problem.
3. Any other operation which causes replication to stop. There are a wide range
of possibilities. This is less commmon than 1 or 2.
For 1 and 2, on 14 October 2004 we ordered two more database slaves to add to
the two we have. They are being set up now, after delays at both the vendor due
to a compatibility issue and with our install person being unavailable. The new
ones have a different operating system version from Bacon and will confirm
whether that resolves the problem Bacon is having, as well as giving us enough
excess capacity to risk losing Bacon for a while if there is a problem while
switching it to that version.
The Bacon problem is still around but has been worked around with a modification
to servmon which automtically corrects the problem. It's seen less often on the
new system with the later operating system version, only once so far.
The two new database servers have reduced the general lag problems. Search is
now on full time at full rate. Some MediaWiki 1.4 issues (changed queries) which
can cause lag are still being identified and dealt with - either with querybane
rules or programming chances in MediaWiki.
Two comon causes of significant lag have been removed: special page updating is
now done on a different, not in service, server and copied in without
significant lag. Searchindex updating is also done while slaves are offline and
no longer causes lag.
Guess this is no longer an issue. ;)
*** Bug 2637 has been marked as a duplicate of this bug. ***