Last modified: 2014-11-14 01:45:23 UTC

Wikimedia Bugzilla is closed!

Wikimedia has migrated from Bugzilla to Phabricator. Bug reports should be created and updated in Wikimedia Phabricator instead. Please create an account in Phabricator and add your Bugzilla email address to it.
Wikimedia Bugzilla is read-only. If you try to edit or create any bug report in Bugzilla you will be shown an intentional error message.
In order to access the Phabricator task corresponding to a Bugzilla report, just remove "static-" from its URL.
You could still run searches in Bugzilla or access your list of votes but bug reports will obviously not be up-to-date in Bugzilla.
Bug 72275 - no log in deployment-bastion:/data/project/logs from "503 server unavailable" on beta labs
no log in deployment-bastion:/data/project/logs from "503 server unavailable"...
Status: NEW
Product: Wikimedia Labs
Classification: Unclassified
deployment-prep (beta) (Other open bugs)
All All
: High normal
: ---
Assigned To: Sam Reed (reedy)
Depends on:
Blocks: documentation
  Show dependency treegraph
Reported: 2014-10-20 20:42 UTC by spage
Modified: 2014-11-14 01:45 UTC (History)
9 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Description spage 2014-10-20 20:42:13 UTC
I'm getting 503 errors from beta labs when I visit

I can't find these errors in deployment-bastion:/data/project/logs where I would expect to find something. I notice some weirdness:

* No fatal.log since August 19 (maybe it's all in HHVM logs?)
* No hhvm log
* No apache log (but maybe this isn't relevant with hhvm?)
* The error message and its HTTP header gives no indication what server handled it, so you have to visit all the front-end servers.

I think I found the problem in deployment-mediawiki02:/var/log/hhvm/error.log . Maybe this bug more succinctly is "Server hhvm logs are not brought to deployment-bastion".  FWIW there is a fluorine:/a/mw-log/hhvm.log for production logs.

Here's the complete error headers and text of error message.  To be clear, I'm not reporting this particular 503 error, I'm reporting problems *logging* 503 errors.

HTTP/1.1 503 Service Unavailable^M
Server: Varnish^M
Content-Type: text/html; charset=utf-8^M
X-Varnish: 281336368, 1116870927^M
Via: 1.1 varnish, 1.1 varnish^M
Content-Length: 2717^M
Accept-Ranges: bytes^M
Date: Mon, 20 Oct 2014 19:59:36 GMT^M
Age: 0^M
Connection: keep-alive^M
X-Cache: deployment-cache-text02 miss (0), deployment-cache-text02 frontend miss (0)^M
Cache-Control: private, s-maxage=0, max-age=0, must-revalidate^M
X-Analytics: php=zend^M
Set-Cookie: GeoIP=::::v4; Path=/;

Wikimedia Foundation

Our servers are currently experiencing a technical problem. This is probably temporary and should be fixed soon. Please try again in a few minutes.
If you report this error to the Wikimedia System Administrators, please include the details below.
Request: GET, from via deployment-cache-text02 deployment-cache-text02 ([]:3128), Varnish XID 281335977
Forwarded for:,
Error: 503, Service Unavailable at Mon, 20 Oct 2014 19:57:01 GMT
Comment 1 Antoine "hashar" Musso (WMF) 2014-10-21 09:25:36 UTC
Well operations/mediawiki-config has:

  $ mwscript eval.php --wiki=enwiki
  > print $wmfUdp2logDest;

which is deployment-bastion.

udp2log / python demux script  crash from time to time so we have to restart it:

 deployment-bastion:~$ sudo /etc/init.d/udp2log-mw restart
  * Restarting UDP log receiver (udp2log) udp2log-mw      [ OK ] 

So that should fix the log writing for now.

We also have an instance udplog.eqiad.wmflabs  which receives some other stream of udp2log events, maybe from hhvm instance?  That is rather inconsistent, maybe we should have everything to point to udplog instance and remove the service from deployment-bastion.   CCing Ori who has set up the new instance.
Comment 2 spage 2014-10-24 19:44:11 UTC
(In reply to Antoine "hashar" Musso (WMF) from comment #1)

> udp2log / python demux script  crash from time to time so we have to restart
> it:
> ...
> So that should fix the log writing for now.

No, there's nothing.  There are a few archived hhvm-log-<DATE>.gz files in
deployment-bastion:/data/project/logs/archive , last one from August 28th.

> We also have an instance udplog.eqiad.wmflabs  which receives some other
> stream of udp2log events, maybe from hhvm instance?  That is rather
> inconsistent

That seems to have the same content in /data/project/logs and ../archive as deployment-bastion

Also, ErikB comments the HHVM fatals don't show up on, I don't know if that's a separate bug.

This means we don't notice problems on the beta cluster and they're harder to debug, so I'm bumping the priority.
Comment 3 Antoine "hashar" Musso (WMF) 2014-10-24 19:52:39 UTC
I haven't took time to properly investigate the issue.  But recently we had a new udp2log-mw instance on udplog.eqiad.wmflabs in addition to the one that always existed on deployment-bastion.eqiad.wmflabs.

So we have two daemons writing to the same file, and two logrotate runs on different instances that moves the files and reload the daemon so it open new files.   That is definitely a race condition which eventually cause the logs to be written in the wrong place and/or kill the daemon.

What would need to be done is figure out what udplog.eqiad.wmflabs is for, Ori set it up (is on CC of this bug) so there must be a purpose.  From there, the easiest is probably to get rid of the udplog-mw instance on deployment-bastion and point everything at the new udplog.eqiad.wmflabs.  That would need a few changes in operations/puppet.git and operations/mediawiki-config.git.
Comment 4 spage 2014-10-24 20:11:34 UTC
(In reply to spage from comment #2)
> We also have an instance udplog.eqiad.wmflabs 

There's also a deployment-fluoride machine. Both udplog and deployment-bastion have the same /etc/udp2log/mw trying to send '^(fatal|exception) to it, but first fatals have to be in the logs.

> That seems to have the same content in /data/project/logs and ../archive as
> deployment-bastion

Duh, they all mount the same deployment-project NFS share :-)
Comment 5 Bryan Davis 2014-11-05 01:24:07 UTC
Fatals come from the apache or hhvm process logs. It looks like we haven't recorded any new apache2.log data in the beta cluster since 2014-07-23 and the last hhvm-fatal.log is dated 2014-08-29. I would expect these log streams to come in via rsyslog forwarding from each host as configured by modules/mediawiki/files/rsyslog.conf in operations/puppet.

Our loss of logging for the hhv-fatal.log file quite nicely lines up with the first commit of that bit of refactored puppet configuration on Aug 28, 2014. The last log message we have is timestamped "Aug 28 19:47:43" and the change was merged on 08-28 13:47 (<>).
Comment 6 Greg Grossmeier 2014-11-06 21:57:58 UTC
(Commented on that change)
Comment 7 Bryan Davis 2014-11-14 01:45:23 UTC
Debugging some...

On udplog.eqiad.wmflabs `sudo tcpdump udp port 8420` does not show any log packets being received from any host.

On deployment-mediawiki01 `sudo tcpdump udp port 8420 and host` shows lots of traffic headed to deployment-bastion. This is traffic from the udp2log client built into MediaWiki. It's making it to the right place and being logged to files and forwarded to logstash.

Again on deployment-mediawiki01 `sudo tcpdump udp port 8420 and host` shows no traffic headed to udplog.eqiad.wmflabs.

The /etc/rsyslog.d/40-mediawiki.conf file on deployment-mediawiki01 matches <>. This config appears to be active as there is data in the local /var/log/apache2.log file.

So for some reason currently unknown to me, the rsyslog server on deployment-mediawiki01 is up and running and processing local files but not forwarding messages via udp to udplog.eqiad.wmflabs.

Note You need to log in before you can comment on or make changes to this bug.