Last modified: 2014-09-29 08:48:04 UTC
The beta cluster is serving old Echo JS code. e.g. http://bits.beta.wmflabs.org/static-master/extensions/Echo/modules/overlay/ext.echo.overlay.js isn't the latest file. This makes user and browser tests invalid. Erik Bernhardson says failures show up in deployment-bastion:/data/project/logs/scap.log and and /var/log/syslog starting Aug 14 00:31:27 E.g. Aug 14 06:44:04 deployment-bastion rsyncd[1091]: rsync: writefd_unbuffered failed to write 4 bytes to socket [sender]: Connection reset by peer (104) Aug 14 06:44:04 deployment-bastion rsyncd[1091]: rsync error: error in rsync protocol data stream (code 12) at io.c(1532) [sender=3.0.9]
rsync is failing because the root volume is full on deployment-bastion.eqiad.wmflabs
Mukunda: if Antoine doesn't bet you to it, can you take a look into this?
https://wikitech.wikimedia.org/w/index.php?title=Nova_Resource:Deployment-prep/SAL&curid=9925&diff=123489&oldid=123370 02:46:37 UTC <ebernhardson> !log beta /dev/vda1 full. moved /srv-old to /mnt/srv-old and freed up 2.1G
moved /srv-old to /mnt/srv-old and freed up 2.1G scap has resumed its normal schedule. /var is within 100M of having the same problem. I'm still not seeing new code making it from deployment-bastion to deployment-mediawiki01 though, so leaving the bug open
Erik has freed enough space by moving /srv-old which was in the root partition. Thank you! Labs instances in eqiad have a 2GB /var/ which is often not large enough. There is 1.1GB in /var/log :-/ Top offenders: 538M /var/log/account/ 335M /var/log/atop*.log 168M /var/log/diamond/ When diamond got enabled on labs, it had some full debug log being emitted. That was bug 66458 "Service diamond creates 500+ MByte /var/log/diamond/diamond.log". I have manually removed the old large logs. I removed some archived files from /var/log/account/ but that will fill up quickly again. Follow up bugs: * Bug 69601 - Log files on labs instance fill up disk (/var is only 2GB) (tracking) ** Bug 69602 - diamond does not compress its logs ** Bug 69604 - acct (process and login accounting) fill up instances /var/ partition ** Bug 69605 - atop (monitoring system) logs fill up instances /var/ partition
theres something else going on as well: ebernhardson@deployment-bastion:~$ dsh -M -g mediawiki-installation md5sum /a/common/php-master/extensions/Flow/includes/Formatter/TopicListFormatter.php 2>/dev/null deployment-bastion.eqiad.wmflabs: aebe7cb482bd4db26a9a34942b2881c1 /a/common/php-master/extensions/Flow/includes/Formatter/TopicListFormatter.php deployment-jobrunner01.eqiad.wmflabs: 2344d153193c02780cf2a02bd724c125 /a/common/php-master/extensions/Flow/includes/Formatter/TopicListFormatter.php deployment-mediawiki01.eqiad.wmflabs: 2344d153193c02780cf2a02bd724c125 /a/common/php-master/extensions/Flow/includes/Formatter/TopicListFormatter.php deployment-mediawiki02.eqiad.wmflabs: 2344d153193c02780cf2a02bd724c125 /a/common/php-master/extensions/Flow/includes/Formatter/TopicListFormatter.php deployment-rsync01.eqiad.wmflabs: aebe7cb482bd4db26a9a34942b2881c1 /a/common/php-master/extensions/Flow/includes/Formatter/TopicListFormatter.php deployment-videoscaler01.eqiad.wmflabs: 2344d153193c02780cf2a02bd724c125 /a/common/php-master/extensions/Flow/includes/Formatter/TopicListFormatter.php That change was from yesterday so i looked up a more recent merge, https://gerrit.wikimedia.org/r/#/c/154278/ was merged 15 min ago and also didn't rsync all the way out: ebernhardson@deployment-bastion:~$ dsh -M -g mediawiki-installation md5sum /a/common/php-master/extensions/BetaFeatures/BetaFeaturesHooks.php 2>/dev/null deployment-bastion.eqiad.wmflabs: d52e9791a81870af920eb199494d1795 /a/common/php-master/extensions/BetaFeatures/BetaFeaturesHooks.php deployment-jobrunner01.eqiad.wmflabs: 78c1075dd7743c8609fab81c7428be4a /a/common/php-master/extensions/BetaFeatures/BetaFeaturesHooks.php deployment-mediawiki01.eqiad.wmflabs: 78c1075dd7743c8609fab81c7428be4a /a/common/php-master/extensions/BetaFeatures/BetaFeaturesHooks.php deployment-mediawiki02.eqiad.wmflabs: 78c1075dd7743c8609fab81c7428be4a /a/common/php-master/extensions/BetaFeatures/BetaFeaturesHooks.php deployment-rsync01.eqiad.wmflabs: d52e9791a81870af920eb199494d1795 /a/common/php-master/extensions/BetaFeatures/BetaFeaturesHooks.php deployment-videoscaler01.eqiad.wmflabs: 78c1075dd7743c8609fab81c7428be4a /a/common/php-master/extensions/BetaFeatures/BetaFeaturesHooks.php
Reedy: since Bryan is out, can you look into this?
We deploy using scap harnessed in Jenkins job https://integration.wikimedia.org/ci/job/beta-scap-eqiad/ . I have manually changed the job to pass '--verbose' to scap: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/17342/console
scap use deployment-rsync01 as a proxy from which application servers are instructed to pull from. A shortened version of the rsync command executed on deployment-rsync01 is: rsync01$ rsync ... deployment-bastion.eqiad.wmflabs::common /usr/local/apache/common-local And: rsync01$ readlink -f /usr/local/apache/common-local /usr/local/apache/common-local That copy is up to date. rsync01 also has a /srv/common-local directory which is out of date. The most frequent file I found is from August 13th 21:13 UTC (might be one a bit more recent). I suspect the apache sync from /srv/common-local instead of /usr/local/apache/common-local or that /usr/local/apache/common-local should symlink to /srv/common-local. Running puppet on deployment-mediawiki01 : Error: Could not retrieve catalog from remote server: Error 400 on SERVER: Duplicate declaration: File[/usr/local/apache/common-local] is already declared in file /etc/puppet/modules/beta/manifests/common.pp:8; cannot redeclare at /etc/puppet/modules/mediawiki/manifests/sync.pp:26 on node i-0000044e.eqiad.wmflabs And that sounds familiar. So as usual, the issue lies in our configuration management which is not surprising.
The root cause is: https://gerrit.wikimedia.org/r/#/c/153807/ mediawiki: create common-local directory merged on Aug 13 22:28 It adds to puppet class mediawiki::sync : + file { '/usr/local/apache/common-local': + ensure => directory, + owner => 'mwdeploy', + group => 'mwdeploy', + mode => '0775', + } On beta that should be a symbolic link as described in beta::common: file { '/usr/local/apache/common-local': ensure => link, # Link to files managed by scap target => $::beta::config::scap_deploy_dir, } The change cause two issues: 1) on deployment-rsync01 it is no more a symbolic link and scap instructs apaches from that directory though it updates the other 2) break puppet with a duplicate definition on the application server.
Change 154329 had a related patch set uploaded by Hashar: Revert "mediawiki: create common-local directory" https://gerrit.wikimedia.org/r/154329
Cherry picked https://gerrit.wikimedia.org/r/154329 on beta cluster puppet master. On rsync01 I have deleted all the content of /usr/local/apache/common-local and MANUALLY created a symbolic link to /srv/common-local I then triggered a run of scap on beta via https://integration.wikimedia.org/ci/job/beta-scap-eqiad/ Rerunning Erik Bernhardson command: hashar@deployment-bastion:~$ sudo -u mwdeploy dsh -M -g mediawiki-installation md5sum /a/common/php-master/extensions/Flow/includes/Formatter/TopicListFormatter.php 2>/dev/null|cut -d\ -f-2 deployment-bastion.eqiad.wmflabs: aebe7cb482bd4db26a9a34942b2881c1 deployment-jobrunner01.eqiad.wmflabs: aebe7cb482bd4db26a9a34942b2881c1 deployment-mediawiki01.eqiad.wmflabs: aebe7cb482bd4db26a9a34942b2881c1 deployment-mediawiki02.eqiad.wmflabs: aebe7cb482bd4db26a9a34942b2881c1 deployment-rsync01.eqiad.wmflabs: aebe7cb482bd4db26a9a34942b2881c1 deployment-videoscaler01.eqiad.wmflabs: aebe7cb482bd4db26a9a34942b2881c1 All good. Lowering priority of the bug since it is hacked/manually fixed. I am leaving it open until the Gerrit change is reviewed / agreed / better solution found.
Change 154329 abandoned by Hashar: Revert "mediawiki: create common-local directory" Reason: The paths have been reworked entirely in both prod and beta. We now use /srv/mediawiki/ and /srv/mediawiki-staging/ https://gerrit.wikimedia.org/r/154329