Last modified: 2005-12-10 18:01:35 UTC
Special:Watchlist appears to have stopped working at about 11:00 CST.
DB server lags, removed from replication
Watchlists have stopped again at about noon.
Will catch up shortly.
For background, there are three main causes of this: 1. Too much load on the slave, so it can't keep up with replication while it handles queries. In this situation, we adjust load by adjusting the amount of search we turn off. If they get significantly behind, we turn off search for some of the big wikis using that slave so it can catch up more quickly. 2. An operating system version-related issue on the slave Bacon, which causes it to stop replicating. We can't risk losing Bacon at present so we can't try different operating system versions yet. Becuase we get fast reports of this problem from en, we have this machine set to serve en and Zh wikipedias. The rest are normally unaffected by this issue with the current setup, though in the past any could be affected. The split is mainly for performance reasons - we just had to choose which wikis got the one with the problem. 3. Any other operation which causes replication to stop. There are a wide range of possibilities. This is less commmon than 1 or 2. For 1 and 2, on 14 October 2004 we ordered two more database slaves to add to the two we have. They are being set up now, after delays at both the vendor due to a compatibility issue and with our install person being unavailable. The new ones have a different operating system version from Bacon and will confirm whether that resolves the problem Bacon is having, as well as giving us enough excess capacity to risk losing Bacon for a while if there is a problem while switching it to that version.
The Bacon problem is still around but has been worked around with a modification to servmon which automtically corrects the problem. It's seen less often on the new system with the later operating system version, only once so far. The two new database servers have reduced the general lag problems. Search is now on full time at full rate. Some MediaWiki 1.4 issues (changed queries) which can cause lag are still being identified and dealt with - either with querybane rules or programming chances in MediaWiki. Two comon causes of significant lag have been removed: special page updating is now done on a different, not in service, server and copied in without significant lag. Searchindex updating is also done while slaves are offline and no longer causes lag.
Guess this is no longer an issue. ;)
*** Bug 2637 has been marked as a duplicate of this bug. ***