Last modified: 2014-09-22 22:05:49 UTC
Currently when we make site software updates with scap, sync-common-all, etc the web servers are still running while they work. This has the unfortunate side effect that a portion of web requests will come in to a server whose copy of MediaWiki is only partially updated, which can cause transient but very scary-looking errors. A common type of error is where files in different directories are both changed and have a dependency on each other; especially problematic with skin files since skins may be synced out ahead of time... this can toss up big scary PHP fatal errors or exceptions. We want the updates to be atomic, so any given request will get _either_ the old deployment version _or_ the new version, but never a mix. There's two main ways we could implement this: 1) Shut down Apache before rsync, restart it after. Simple, but could make updates slower, or leave us with most machines out of service simultaneously for a minute or two. 2) rsync to a staging directory, then swap the entire thing out for the live one. (I'm not sure if it's possible to totally atomically swap out two directories in posix semantics.) or maybe also 3) rsync to a staging directory, then swap which directory we refer to in the .conf files and do an apachectl graceful restart. This would avoid holes in response time, but we may have a magical moving directory which could be confusing madness. :) (Another thing to consider might be keeping the 'live' skin and extension JS/CSS files in a separate subdir, so we can update those en masse first with no code safety issues, then run the code updates -- atomic per server -- guaranteeing we'll have the new css/JS on all new hits.)
For any of these we would need to have to have strict control of how many machines are being activated at any point in time along with the interval length to the next block of activations in order to minimize brown/black outs
atomic scaps is much much smaller issue than having failed syncs or outdated trees running in cluster for days.
> (I'm not sure if it's possible to totally atomically swap out two directories > in posix semantics.) I don't think you can do so, but you can atomically replace a symlink so that it magically points to a different directory (I am using that approach in a scap script). > or maybe also > > 3) rsync to a staging directory, then swap which directory we refer to in the > .conf files and do an apachectl graceful restart. This looks even better, as far as the don't get full for copying, it is easy to change that configuration entry, and anyone doing a can do a graceful restart (no root-only errors)- > This would avoid holes in response time, but we may have a magical moving > directory which could be confusing madness. :) The directory could be named by the revision, so it looks logical.
Today's mobile deployment removed SkinMobileBase.php from MobileFrontend, a file that was previously loaded on every request. This provided an illustration of the problem and gives some sense of its magnitude. MaxSem started scap at 21:16. Between 21:18:39 and 21:26:24 we had a total of 65 fatals caused by SkinMobileBase.php having been deleted prior to the calling code receiving the update: [25-Jun-2013 23:26:18] Fatal error: require() [<a href='function.require'>function.require</a>]: Failed opening required '/usr/local/apache/common-local/php-1.22wmf7/extensions/MobileFrontend/includes/skins/SkinMobileBase.php' (include_path='/usr/local/apache/common-local/php-1.22wmf7/extensions/TimedMediaHandler/handlers/OggHandler/PEAR/File_Ogg:/usr/local/apache/common-local/php-1.22wmf7:/usr/local/lib/php:/usr/share/php') at /usr/local/apache/common-local/php-1.22wmf7/includes/AutoLoader.php on line 1155 Of mw* hosts in the mediawiki-installation group, 172 had no errors, 30 had one fatal, 16 had 2 fatals, and one host had three fatals. Using rsync's --delete-after or --delete-delayed option would not make scap atomic, but it could still significantly reduce the rate at which these kinds of errors occur.
(In reply to Brion Vibber from comment #0) > There's two main ways we could implement this: > > 1) Shut down Apache before rsync, restart it after. > > 2) rsync to a staging directory, then swap the entire thing out for the live > one. > > or maybe also > > 3) rsync to a staging directory, then swap which directory we refer to in > the .conf files and do an apachectl graceful restart. The simple fix for this that is currently in use is passing the --delay-update option to rsync (when called from scap). This makes it so the rsync to any given apache copies the files to a tmp dir then switches it over as the last step (basically, brion's suggestion #2). There's also ideas on cluster wide atomicity, but I'm calling that out of scope for this bug :). Is the --delay-update good enough for this bug for now?
One thing delay update doesn't fix is that the l10n cache rebuild on each node until after all nodes get the new code. This leads to some errors like were seen yesterday.
This bug not yet being fixed caused bug 63791 today, the fifth or sixth VE breakage from this that I recall. :-( It'd be really great if we could get it fixed some time soon.
Greg: Should this have higher priority and an assignee set?
It requires re-writing the deployment system and the requirement is on the list of known issues for that work.