Last modified: 2014-09-22 22:05:49 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T22085, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 20085 - [scap] [l10n] Atomic updates for sync scripts
[scap] [l10n] Atomic updates for sync scripts
Status: NEW
Product: Wikimedia
Classification: Unclassified
Deployment systems (Other open bugs)
unspecified
All All
: Low enhancement (vote)
: ---
Assigned To: Nobody - You can work on this!
deploysprint-13
:
Depends on: 27294
Blocks:
  Show dependency treegraph
 
Reported: 2009-08-05 23:55 UTC by Brion Vibber
Modified: 2014-09-22 22:05 UTC (History)
15 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Brion Vibber 2009-08-05 23:55:07 UTC
Currently when we make site software updates with scap, sync-common-all, etc the web servers are still running while they work.

This has the unfortunate side effect that a portion of web requests will come in to a server whose copy of MediaWiki is only partially updated, which can cause transient but very scary-looking errors. A common type of error is where files in different directories are both changed and have a dependency on each other; especially problematic with skin files since skins may be synced out ahead of time... this can toss up big scary PHP fatal errors or exceptions.


We want the updates to be atomic, so any given request will get _either_ the old deployment version _or_ the new version, but never a mix.

There's two main ways we could implement this:

1) Shut down Apache before rsync, restart it after.

Simple, but could make updates slower, or leave us with most machines out of service simultaneously for a minute or two.

2) rsync to a staging directory, then swap the entire thing out for the live one.

(I'm not sure if it's possible to totally atomically swap out two directories in posix semantics.)

or maybe also

3) rsync to a staging directory, then swap which directory we refer to in the .conf files and do an apachectl graceful restart.

This would avoid holes in response time, but we may have a magical moving directory which could be confusing madness. :)


(Another thing to consider might be keeping the 'live' skin and extension JS/CSS files in a separate subdir, so we can update those en masse first with no code safety issues, then run the code updates -- atomic per server -- guaranteeing we'll have the new css/JS on all new hits.)
Comment 1 Tomasz Finc 2009-08-06 00:04:42 UTC
For any of these we would need to have to have strict control of how many machines are being activated at any point in time along with the interval length to the next block of activations in order to minimize brown/black outs

Comment 2 Domas Mituzas 2009-08-06 13:16:03 UTC
atomic scaps is much much smaller issue than having failed syncs or outdated trees running in cluster for days. 
Comment 3 Platonides 2011-10-25 21:53:45 UTC
> (I'm not sure if it's possible to totally atomically swap out two directories
> in posix semantics.)

I don't think you can do so, but you can atomically replace a symlink so that it magically points to a different directory (I am using that approach in a scap script).

> or maybe also
> 
> 3) rsync to a staging directory, then swap which directory we refer to in the
> .conf files and do an apachectl graceful restart.

This looks even better, as far as the don't get full for copying, it is easy to change that configuration entry, and anyone doing a can do a graceful restart (no root-only errors)-

> This would avoid holes in response time, but we may have a magical moving
> directory which could be confusing madness. :)

The directory could be named by the revision, so it looks logical.
Comment 4 Ori Livneh 2013-06-26 01:10:44 UTC
Today's mobile deployment removed SkinMobileBase.php from MobileFrontend, a file that was previously loaded on every request. This provided an illustration of the problem and gives some sense of its magnitude.

MaxSem started scap at 21:16. Between 21:18:39 and 21:26:24 we had a total of 65 fatals caused by SkinMobileBase.php having been deleted prior to the calling code receiving the update:

[25-Jun-2013 23:26:18] Fatal error: require() [<a href='function.require'>function.require</a>]: Failed opening required '/usr/local/apache/common-local/php-1.22wmf7/extensions/MobileFrontend/includes/skins/SkinMobileBase.php' (include_path='/usr/local/apache/common-local/php-1.22wmf7/extensions/TimedMediaHandler/handlers/OggHandler/PEAR/File_Ogg:/usr/local/apache/common-local/php-1.22wmf7:/usr/local/lib/php:/usr/share/php') at /usr/local/apache/common-local/php-1.22wmf7/includes/AutoLoader.php on line 1155

Of mw* hosts in the mediawiki-installation group, 172 had no errors, 30 had one fatal, 16 had 2 fatals, and one host had three fatals.

Using rsync's --delete-after or --delete-delayed option would not make scap atomic, but it could still significantly reduce the rate at which these kinds of errors occur.
Comment 5 Greg Grossmeier 2014-02-21 17:01:03 UTC
(In reply to Brion Vibber from comment #0)
> There's two main ways we could implement this:
> 
> 1) Shut down Apache before rsync, restart it after.
>  
> 2) rsync to a staging directory, then swap the entire thing out for the live
> one.
>  
> or maybe also
> 
> 3) rsync to a staging directory, then swap which directory we refer to in
> the .conf files and do an apachectl graceful restart.

The simple fix for this that is currently in use is passing the --delay-update option to rsync (when called from scap). This makes it so the rsync to any given apache copies the files to a tmp dir then switches it over as the last step (basically, brion's suggestion #2).

There's also ideas on cluster wide atomicity, but I'm calling that out of scope for this bug :).

Is the --delay-update good enough for this bug for now?
Comment 6 Bryan Davis 2014-02-21 17:08:00 UTC
One thing delay update doesn't fix is that the l10n cache rebuild on each node until after all nodes get the new code. This leads to some errors like were seen yesterday.
Comment 7 James Forrester 2014-04-10 23:29:31 UTC
This bug not yet being fixed caused bug 63791 today, the fifth or sixth VE breakage from this that I recall. :-( It'd be really great if we could get it fixed some time soon.
Comment 8 Andre Klapper 2014-04-10 23:50:22 UTC
Greg: Should this have higher priority and an assignee set?
Comment 9 Bryan Davis 2014-04-10 23:55:20 UTC
It requires re-writing the deployment system and the requirement is on the list of known issues for that work.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links