Last modified: 2014-09-29 08:48:04 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T71590, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 69590 - rsync errors to beta cluster, inconsistent state after scap
rsync errors to beta cluster, inconsistent state after scap
Status: PATCH_TO_REVIEW
Product: Wikimedia
Classification: Unclassified
Deployment systems (Other open bugs)
unspecified
All All
: Normal major (vote)
: ---
Assigned To: Antoine "hashar" Musso (WMF)
:
Depends on: 69601
Blocks:
  Show dependency treegraph
 
Reported: 2014-08-15 02:13 UTC by spage
Modified: 2014-09-29 08:48 UTC (History)
7 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description spage 2014-08-15 02:13:52 UTC
The beta cluster is serving old Echo JS code.
e.g. 
  http://bits.beta.wmflabs.org/static-master/extensions/Echo/modules/overlay/ext.echo.overlay.js

isn't the latest file.  This makes user and browser tests invalid.

Erik Bernhardson says failures show up in deployment-bastion:/data/project/logs/scap.log and and /var/log/syslog starting Aug 14 00:31:27

E.g.
Aug 14 06:44:04 deployment-bastion rsyncd[1091]: rsync: writefd_unbuffered failed to write 4 bytes to socket [sender]: Connection reset by peer (104)
Aug 14 06:44:04 deployment-bastion rsyncd[1091]: rsync error: error in rsync protocol data stream (code 12) at io.c(1532) [sender=3.0.9]
Comment 1 Erik Bernhardson 2014-08-15 02:26:38 UTC
rsync is failing because the root volume is full on deployment-bastion.eqiad.wmflabs
Comment 2 Greg Grossmeier 2014-08-15 03:16:36 UTC
Mukunda: if Antoine doesn't bet you to it, can you take a look into this?
Comment 3 jeremyb 2014-08-15 03:22:12 UTC
https://wikitech.wikimedia.org/w/index.php?title=Nova_Resource:Deployment-prep/SAL&curid=9925&diff=123489&oldid=123370

02:46:37 UTC <ebernhardson> !log beta /dev/vda1 full. moved /srv-old to /mnt/srv-old and freed up 2.1G
Comment 4 Erik Bernhardson 2014-08-15 03:38:37 UTC
moved /srv-old to /mnt/srv-old and freed up 2.1G

scap has resumed its normal schedule.   /var is within 100M of having the same problem.

I'm still not seeing new code making it from deployment-bastion to deployment-mediawiki01 though, so leaving the bug open
Comment 5 Antoine "hashar" Musso (WMF) 2014-08-15 12:42:13 UTC
Erik has freed enough space by moving /srv-old which was in the root partition. Thank you!

Labs instances in eqiad have a 2GB /var/ which is often not large enough.  There is 1.1GB in /var/log :-/


Top offenders:

538M	/var/log/account/
335M    /var/log/atop*.log
168M	/var/log/diamond/


When diamond got enabled on labs, it had some full debug log being emitted. That was bug 66458 "Service diamond creates 500+ MByte /var/log/diamond/diamond.log". I have manually removed the old large logs.

I removed some archived files from /var/log/account/ but that will fill up quickly again.

Follow up bugs:

* Bug 69601 - Log files on labs instance fill up disk (/var is only 2GB) (tracking)
** Bug 69602 - diamond does not compress its logs
** Bug 69604 - acct (process and login accounting) fill up instances /var/ partition
** Bug 69605 - atop (monitoring system) logs fill up instances /var/ partition
Comment 6 Erik Bernhardson 2014-08-15 16:41:02 UTC
theres something else going on as well:

    ebernhardson@deployment-bastion:~$ dsh -M -g mediawiki-installation md5sum /a/common/php-master/extensions/Flow/includes/Formatter/TopicListFormatter.php 2>/dev/null
    deployment-bastion.eqiad.wmflabs: aebe7cb482bd4db26a9a34942b2881c1  /a/common/php-master/extensions/Flow/includes/Formatter/TopicListFormatter.php
    deployment-jobrunner01.eqiad.wmflabs: 2344d153193c02780cf2a02bd724c125  /a/common/php-master/extensions/Flow/includes/Formatter/TopicListFormatter.php
    deployment-mediawiki01.eqiad.wmflabs: 2344d153193c02780cf2a02bd724c125  /a/common/php-master/extensions/Flow/includes/Formatter/TopicListFormatter.php
    deployment-mediawiki02.eqiad.wmflabs: 2344d153193c02780cf2a02bd724c125  /a/common/php-master/extensions/Flow/includes/Formatter/TopicListFormatter.php
    deployment-rsync01.eqiad.wmflabs: aebe7cb482bd4db26a9a34942b2881c1  /a/common/php-master/extensions/Flow/includes/Formatter/TopicListFormatter.php
    deployment-videoscaler01.eqiad.wmflabs: 2344d153193c02780cf2a02bd724c125  /a/common/php-master/extensions/Flow/includes/Formatter/TopicListFormatter.php

That change was from yesterday so i looked up a more recent merge, https://gerrit.wikimedia.org/r/#/c/154278/ was merged 15 min ago and also didn't rsync all the way out:

    ebernhardson@deployment-bastion:~$ dsh -M -g mediawiki-installation md5sum /a/common/php-master/extensions/BetaFeatures/BetaFeaturesHooks.php 2>/dev/null
    deployment-bastion.eqiad.wmflabs: d52e9791a81870af920eb199494d1795  /a/common/php-master/extensions/BetaFeatures/BetaFeaturesHooks.php
    deployment-jobrunner01.eqiad.wmflabs: 78c1075dd7743c8609fab81c7428be4a  /a/common/php-master/extensions/BetaFeatures/BetaFeaturesHooks.php
    deployment-mediawiki01.eqiad.wmflabs: 78c1075dd7743c8609fab81c7428be4a  /a/common/php-master/extensions/BetaFeatures/BetaFeaturesHooks.php
    deployment-mediawiki02.eqiad.wmflabs: 78c1075dd7743c8609fab81c7428be4a  /a/common/php-master/extensions/BetaFeatures/BetaFeaturesHooks.php
    deployment-rsync01.eqiad.wmflabs: d52e9791a81870af920eb199494d1795  /a/common/php-master/extensions/BetaFeatures/BetaFeaturesHooks.php
    deployment-videoscaler01.eqiad.wmflabs: 78c1075dd7743c8609fab81c7428be4a  /a/common/php-master/extensions/BetaFeatures/BetaFeaturesHooks.php
Comment 7 Greg Grossmeier 2014-08-15 17:30:14 UTC
Reedy: since Bryan is out, can you look into this?
Comment 8 Antoine "hashar" Musso (WMF) 2014-08-15 19:09:40 UTC
We deploy using scap harnessed in Jenkins job https://integration.wikimedia.org/ci/job/beta-scap-eqiad/ .

I have manually changed the job to pass '--verbose' to scap: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/17342/console
Comment 9 Antoine "hashar" Musso (WMF) 2014-08-15 20:27:14 UTC
scap use deployment-rsync01 as a proxy from which application servers are instructed to pull from.

A shortened version of the rsync command executed on deployment-rsync01 is:

 rsync01$ rsync ... deployment-bastion.eqiad.wmflabs::common /usr/local/apache/common-local

And:

 rsync01$ readlink -f /usr/local/apache/common-local
 /usr/local/apache/common-local

That copy is up to date.

rsync01 also has a /srv/common-local directory which is out of date. The most frequent file I found is from August 13th 21:13 UTC (might be one a bit more recent).

I suspect the apache sync from /srv/common-local instead of /usr/local/apache/common-local or that /usr/local/apache/common-local should symlink to /srv/common-local.



Running puppet on deployment-mediawiki01 :

Error: Could not retrieve catalog from remote server: Error 400 on SERVER:
   Duplicate declaration:
   File[/usr/local/apache/common-local] is already declared in file
     /etc/puppet/modules/beta/manifests/common.pp:8;
   cannot redeclare at
     /etc/puppet/modules/mediawiki/manifests/sync.pp:26 on node i-0000044e.eqiad.wmflabs


And that sounds familiar.  So as usual, the issue lies in our configuration management which is not surprising.
Comment 10 Antoine "hashar" Musso (WMF) 2014-08-15 20:37:47 UTC
The root cause is:

https://gerrit.wikimedia.org/r/#/c/153807/
mediawiki: create common-local directory

merged on Aug 13 22:28

It adds to puppet class mediawiki::sync :

+    file { '/usr/local/apache/common-local':
+        ensure => directory,
+        owner  => 'mwdeploy',
+        group  => 'mwdeploy',
+        mode   => '0775',
+    }


On beta that should be a symbolic link as described in beta::common:

     file { '/usr/local/apache/common-local':
        ensure  => link,
        # Link to files managed by scap
        target  => $::beta::config::scap_deploy_dir,
    }

The change cause two issues:

1) on deployment-rsync01  it is no more a symbolic link and scap instructs apaches from that directory though it updates the other

2) break puppet with a duplicate definition on the application server.
Comment 11 Gerrit Notification Bot 2014-08-15 21:00:48 UTC
Change 154329 had a related patch set uploaded by Hashar:
Revert "mediawiki: create common-local directory"

https://gerrit.wikimedia.org/r/154329
Comment 12 Antoine "hashar" Musso (WMF) 2014-08-15 21:05:40 UTC
Cherry picked https://gerrit.wikimedia.org/r/154329 on beta cluster puppet master.

On rsync01 I have deleted all the content of /usr/local/apache/common-local and MANUALLY created a symbolic link to /srv/common-local

I then triggered a run of scap on beta via https://integration.wikimedia.org/ci/job/beta-scap-eqiad/

Rerunning Erik Bernhardson command:


hashar@deployment-bastion:~$ sudo -u mwdeploy  dsh -M -g mediawiki-installation md5sum /a/common/php-master/extensions/Flow/includes/Formatter/TopicListFormatter.php 2>/dev/null|cut -d\  -f-2
deployment-bastion.eqiad.wmflabs: aebe7cb482bd4db26a9a34942b2881c1
deployment-jobrunner01.eqiad.wmflabs: aebe7cb482bd4db26a9a34942b2881c1
deployment-mediawiki01.eqiad.wmflabs: aebe7cb482bd4db26a9a34942b2881c1
deployment-mediawiki02.eqiad.wmflabs: aebe7cb482bd4db26a9a34942b2881c1
deployment-rsync01.eqiad.wmflabs: aebe7cb482bd4db26a9a34942b2881c1
deployment-videoscaler01.eqiad.wmflabs: aebe7cb482bd4db26a9a34942b2881c1



All good.


Lowering priority of the bug since it is hacked/manually fixed.  I am leaving it open until the Gerrit change is reviewed / agreed / better solution found.
Comment 13 Gerrit Notification Bot 2014-09-29 08:48:04 UTC
Change 154329 abandoned by Hashar:
Revert "mediawiki: create common-local directory"

Reason:
The paths have been reworked entirely in both prod and beta. We now use /srv/mediawiki/ and /srv/mediawiki-staging/

https://gerrit.wikimedia.org/r/154329

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links