Last modified: 2014-11-14 01:45:23 UTC
I'm getting 503 errors from beta labs when I visit http://en.wikipedia.beta.wmflabs.org/w/index.php?title=Topic:S2kt0b7x2jrv7gw0&action=compare-postsummary-revisions&topicsummary_newRevision=s3ems8b3lyhnoafm I can't find these errors in deployment-bastion:/data/project/logs where I would expect to find something. I notice some weirdness: * No fatal.log since August 19 (maybe it's all in HHVM logs?) * No hhvm log * No apache log (but maybe this isn't relevant with hhvm?) * The error message and its HTTP header gives no indication what server handled it, so you have to visit all the front-end servers. I think I found the problem in deployment-mediawiki02:/var/log/hhvm/error.log . Maybe this bug more succinctly is "Server hhvm logs are not brought to deployment-bastion". FWIW there is a fluorine:/a/mw-log/hhvm.log for production logs. Here's the complete error headers and text of error message. To be clear, I'm not reporting this particular 503 error, I'm reporting problems *logging* 503 errors. HTTP/1.1 503 Service Unavailable^M Server: Varnish^M Content-Type: text/html; charset=utf-8^M X-Varnish: 281336368, 1116870927^M Via: 1.1 varnish, 1.1 varnish^M Content-Length: 2717^M Accept-Ranges: bytes^M Date: Mon, 20 Oct 2014 19:59:36 GMT^M Age: 0^M Connection: keep-alive^M X-Cache: deployment-cache-text02 miss (0), deployment-cache-text02 frontend miss (0)^M Cache-Control: private, s-maxage=0, max-age=0, must-revalidate^M X-Analytics: php=zend^M Set-Cookie: GeoIP=::::v4; Path=/; Domain=.wmflabs.org Wikimedia Foundation Error Our servers are currently experiencing a technical problem. This is probably temporary and should be fixed soon. Please try again in a few minutes. If you report this error to the Wikimedia System Administrators, please include the details below. Request: GET http://en.wikipedia.beta.wmflabs.org/w/index.php?title=Topic:S2kt0b7x2jrv7gw0&action=compare-postsummary-revisions&topicsummary_newRevision=s3ems8b3lyhnoafm, from 127.0.0.1 via deployment-cache-text02 deployment-cache-text02 ([127.0.0.1]:3128), Varnish XID 281335977 Forwarded for: 198.73.209.3, 127.0.0.1 Error: 503, Service Unavailable at Mon, 20 Oct 2014 19:57:01 GMT
Well operations/mediawiki-config has: $ mwscript eval.php --wiki=enwiki > print $wmfUdp2logDest; 10.68.16.58:8420 > which is deployment-bastion. udp2log / python demux script crash from time to time so we have to restart it: deployment-bastion:~$ sudo /etc/init.d/udp2log-mw restart * Restarting UDP log receiver (udp2log) udp2log-mw [ OK ] $ So that should fix the log writing for now. We also have an instance udplog.eqiad.wmflabs which receives some other stream of udp2log events, maybe from hhvm instance? That is rather inconsistent, maybe we should have everything to point to udplog instance and remove the service from deployment-bastion. CCing Ori who has set up the new instance.
(In reply to Antoine "hashar" Musso (WMF) from comment #1) > udp2log / python demux script crash from time to time so we have to restart > it: > ... > So that should fix the log writing for now. No, there's nothing. There are a few archived hhvm-log-<DATE>.gz files in deployment-bastion:/data/project/logs/archive , last one from August 28th. > We also have an instance udplog.eqiad.wmflabs which receives some other > stream of udp2log events, maybe from hhvm instance? That is rather > inconsistent That seems to have the same content in /data/project/logs and ../archive as deployment-bastion Also, ErikB comments the HHVM fatals don't show up on logstash-beta.wmflabs.org, I don't know if that's a separate bug. This means we don't notice problems on the beta cluster and they're harder to debug, so I'm bumping the priority.
I haven't took time to properly investigate the issue. But recently we had a new udp2log-mw instance on udplog.eqiad.wmflabs in addition to the one that always existed on deployment-bastion.eqiad.wmflabs. So we have two daemons writing to the same file, and two logrotate runs on different instances that moves the files and reload the daemon so it open new files. That is definitely a race condition which eventually cause the logs to be written in the wrong place and/or kill the daemon. What would need to be done is figure out what udplog.eqiad.wmflabs is for, Ori set it up (is on CC of this bug) so there must be a purpose. From there, the easiest is probably to get rid of the udplog-mw instance on deployment-bastion and point everything at the new udplog.eqiad.wmflabs. That would need a few changes in operations/puppet.git and operations/mediawiki-config.git.
(In reply to spage from comment #2) > We also have an instance udplog.eqiad.wmflabs There's also a deployment-fluoride machine. Both udplog and deployment-bastion have the same /etc/udp2log/mw trying to send '^(fatal|exception) to it, but first fatals have to be in the logs. > That seems to have the same content in /data/project/logs and ../archive as > deployment-bastion Duh, they all mount the same deployment-project NFS share :-)
Fatals come from the apache or hhvm process logs. It looks like we haven't recorded any new apache2.log data in the beta cluster since 2014-07-23 and the last hhvm-fatal.log is dated 2014-08-29. I would expect these log streams to come in via rsyslog forwarding from each host as configured by modules/mediawiki/files/rsyslog.conf in operations/puppet. Our loss of logging for the hhv-fatal.log file quite nicely lines up with the first commit of that bit of refactored puppet configuration on Aug 28, 2014. The last log message we have is timestamped "Aug 28 19:47:43" and the change was merged on 08-28 13:47 (<https://gerrit.wikimedia.org/r/#/c/154710/>).
(Commented on that change)
Debugging some... On udplog.eqiad.wmflabs `sudo tcpdump udp port 8420` does not show any log packets being received from any host. On deployment-mediawiki01 `sudo tcpdump udp port 8420 and host 10.68.16.58` shows lots of traffic headed to deployment-bastion. This is traffic from the udp2log client built into MediaWiki. It's making it to the right place and being logged to files and forwarded to logstash. Again on deployment-mediawiki01 `sudo tcpdump udp port 8420 and host 10.68.17.218` shows no traffic headed to udplog.eqiad.wmflabs. The /etc/rsyslog.d/40-mediawiki.conf file on deployment-mediawiki01 matches <https://github.com/wikimedia/operations-puppet/blob/565672c41b016b465afc871080dbf440bced9451/modules/mediawiki/files/rsyslog.conf>. This config appears to be active as there is data in the local /var/log/apache2.log file. So for some reason currently unknown to me, the rsyslog server on deployment-mediawiki01 is up and running and processing local files but not forwarding messages via udp to udplog.eqiad.wmflabs.