Last modified: 2014-01-21 01:22:26 UTC

Wikimedia Bugzilla is closed!

Wikimedia has migrated from Bugzilla to Phabricator. Bug reports should be created and updated in Wikimedia Phabricator instead. Please create an account in Phabricator and add your Bugzilla email address to it.
Wikimedia Bugzilla is read-only. If you try to edit or create any bug report in Bugzilla you will be shown an intentional error message.
In order to access the Phabricator task corresponding to a Bugzilla report, just remove "static-" from its URL.
You could still run searches in Bugzilla or access your list of votes but bug reports will obviously not be up-to-date in Bugzilla.
Bug 57249 - intermittent 503 errors on
intermittent 503 errors on
Product: Wikimedia Labs
Classification: Unclassified
deployment-prep (beta) (Other open bugs)
All All
: High normal
: ---
Assigned To: Nobody - You can work on this!
: browser-test-bug
Depends on:
  Show dependency treegraph
Reported: 2013-11-19 15:02 UTC by Chris McMahon
Modified: 2014-01-21 01:22 UTC (History)
15 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---

503 from automated test on beta labs (170.69 KB, image/png)
2013-11-19 15:02 UTC, Chris McMahon

Description Chris McMahon 2013-11-19 15:02:07 UTC
Created attachment 13833 [details]
503 from automated test on beta labs

seen as of Nov 19, possibly earlier

automated tests sometimes trigger 503 error, see screen shot
Comment 1 Arthur Richards 2013-11-26 16:26:06 UTC
Been seeing this more frequently yesterday and today, particularly on large articles on mobile - eg
Comment 2 Andre Klapper 2013-11-26 17:36:34 UTC
(In reply to comment #1)
> Been seeing this more frequently yesterday and today, particularly on large
> articles on mobile

Refering to yesterday and today only, according to the Operations team (mailing list), yesterday there were some serious network issues on the eqiad-esams link (including packet loss). This produced site-wide outages in Europe.

Bug 57499 might be a very similar / same problem.
Comment 3 Arthur Richards 2013-12-02 20:16:35 UTC
I'm still getting occasional 503 errors on betalabs on the Barack Obama article. This is causing a lot of noise as we have automated browser tests failing as a result that use that article to test some things that historically have broken on large articles.
Comment 4 Jon 2013-12-03 18:08:38 UTC
In the meantime, can't we just reduce the size of the Barack Obama article to get the tests to stop failing..?
Comment 5 Arthur Richards 2013-12-03 18:10:56 UTC
Just like we don't want to disable the tests because they keep failing and generating noise, we should really get the underlying problem fixed rather than come up with a workaround.
Comment 6 Faidon Liambotis 2013-12-03 18:30:21 UTC
esams has nothing to do with betalabs -- betalabs runs exclusively on pmtpa and there's no cross-datacenter traffic that would be affected by networking issues.

I don't think anyone from ops has made any changes recently that would affect the betalabs infrastructure. I haven't debugged this but my guess is that this is either some underlying MediaWiki issue (lingering issues from #57026 perhaps?) that is making pages take longer to load, or the Labs infrastructure being overloaded and vCPUs running slower.

In any case, I've made a change yesterday that would affect both production & betalabs, increasing Varnish's & nginx's timeout to 180s to make it on par with what PHP's max_execution_time is set to (these would be & ).
Comment 7 Andre Klapper 2013-12-04 12:02:13 UTC
(In reply to comment #6)
> or the Labs infrastructure being overloaded and vCPUs running slower.

Reminded me of bug 57968 but that only refers to the non-HTTP Labs website.
Comment 8 Faidon Liambotis 2013-12-05 14:50:39 UTC
Is this still happening?
Comment 9 Antoine "hashar" Musso (WMF) 2013-12-05 16:05:19 UTC
I am not sure how to confirm. I was looking for some varnish metric in Ganglia which would give us the number of backend requests that timeout but haven't found such metric :(
Comment 10 Chris McMahon 2013-12-05 22:31:28 UTC
This is still occurring as of 5 Dec.  A fresh request for the Barack Obama page still often yields a 503 error.
Comment 11 Antoine "hashar" Musso (WMF) 2013-12-06 11:09:28 UTC
On beta I had a 503 on

Error: 503, Service Unavailable at Fri, 06 Dec 2013 11:04:16 GMT

The related slow-parse.log entry is:

2013-12-06 11:04:25 deployment-apache33 enwiki: 43.42 Barack_Obama

Something is hitting a timeout but I have no idea which part exactly :(
Comment 12 Bartosz Dziewoński 2013-12-07 01:18:21 UTC
Another report (just reopened) is bug 57026.
Comment 13 Matthew Flaschen 2013-12-07 01:41:30 UTC
I'm not sure if it's the same underlying cause (we may or may not need a separate bug).  However, the intermittent 503's on are not limited to Barack Obama, or even long articles.

I've also gotten it just today trying to POST a new, one-sentence Draft (I'm testing on Beta), and trying to create an account (accidentally left on the CAPTCHA if that matters).
Comment 14 dan 2013-12-09 22:57:47 UTC
Request: POST, from via deployment-cache-text1 frontend ([]:80), Varnish XID 2142135504
Forwarded for:
Error: 503, Service Unavailable at Mon, 09 Dec 2013 20:20:02 GMT
Comment 15 Chris McMahon 2013-12-10 18:19:09 UTC
I took a look at the historical browser test failures and to my surprise discovered that all of the 503 errors are happening in Chrome and none (or very few) are happening in Firefox. 

I asked about this in #wikimedia-ops.  It may be that "Chrome presents a header that the varnish setup has a Vary on". 

I haven't yet investigated the headers being sent by MF in the browsers, but this might be a factor in the ongoing 503s.
Comment 16 Matthew Flaschen 2013-12-10 19:26:42 UTC
(In reply to comment #15)
> I took a look at the historical browser test failures and to my surprise
> discovered that all of the 503 errors are happening in Chrome and none (or
> very few) are happening in Firefox. 

I've had it in Firefox (manual testing, not an automated browser test).  I believe the examples I posted above were.  The next time it happens in Firefox, I'll try to post the debugging info here.
Comment 17 Antoine "hashar" Musso (WMF) 2013-12-11 11:05:18 UTC
I have started a background varnishlog process on deployment-cache-text1.pmtpa.wmflabs instance. Started as root using:

 nohup varnishlog -w /data/project/logs/varnish-cache-text1.log \
  2>&1 > /data/project/logs/varnishlogger-cache-text1.log &

The log file being on the shared directory, it is accessible from deployment-bastion.pmtpa.wmflabs.
Comment 18 Antoine "hashar" Musso (WMF) 2013-12-11 11:09:05 UTC
The log files is not that great for human consumption/filtering. One can review the logs by connection on deployment-cache-text1 and then using

 varnishlog -r /data/project/logs/varnish-cache-text1.log

That will reply all the logs. Can later on filter out the 503 and hopefully find out the reason for the failure.
Comment 19 Arthur Richards 2013-12-11 19:11:19 UTC
Thanks so much Antoine, I am really happy you are on the path to getting to the bottom of this :)
Comment 20 Ryan Kaldari 2013-12-12 00:10:08 UTC
Almost all of my API request on beta labs are returning 503 errors today.
Comment 21 Ryan Kaldari 2013-12-12 00:21:10 UTC
I'm getting the 503s in Firefox (regarding comment 15):

Request: GET, from via deployment-cache-mobile01 frontend ([]:80), Varnish XID 2095358945
Forwarded for:
Error: 503, Service Unavailable at Thu, 12 Dec 2013 00:06:23 GMT
Comment 22 Antoine "hashar" Musso (WMF) 2013-12-12 10:25:11 UTC
One issue is that puppet keep restarting Apache. I haven't correlated the restarts mentioned in /data/project/logs/apache-error.log with the 503 errors reported in varnish log.   Anyway filled a RT for ops to look at the puppet conf RT #6500.
Comment 23 Antoine "hashar" Musso (WMF) 2013-12-12 13:31:20 UTC
RT #6500 fixed, puppet is no more restarting Apache every time it runs. That impacted production as well.  It was definitely the root cause for some of the 503 we were receiving albeit probably not the only cause.
Comment 24 Gerrit Notification Bot 2013-12-12 13:38:58 UTC
Change 100988 had a related patch set uploaded by Hashar:
beta: dont restart apache on fake mwsync
Comment 25 Gerrit Notification Bot 2013-12-12 13:54:11 UTC
Change 100988 merged by ArielGlenn:
beta: dont restart apache on fake mwsync
Comment 26 Antoine "hashar" Musso (WMF) 2013-12-12 14:02:44 UTC
Puppet is no more restarting Apache on beta cluster.

I have cleared the log file mentioned in Comment #17.
Comment 27 Chris McMahon 2013-12-19 16:50:54 UTC
As of Dec 12 we see 503s on beta for Firefox and also for Chrome:
Comment 28 Chris McMahon 2013-12-20 15:24:55 UTC
I figured out what's going on here, and I created a workaround in the Mobile browser tests for it.  Here is some more information: 

We have four tests that use the Barack_Obama page.  

The first test to hit the Barack_Obama page (or any number of other pages in beta labs I assume) gets a 503 error. 

After that, the Barack_Obama page is in the Varnish cache, so the next three tests to load the page succeed without error.  

At some point after the Mobile browser test build finishes, the version of the page in Varnish expires, so the next browser test build encounters the 503 again. 

The workaround is to have the test load the page while disregarding any error, then load the page again for real.  It is not a significant performance hit to do this: 

Given /^I am on the (.+) article$/ do |article|
    # put the page into the Varnish cache, avoid 503 errors
    visit(ArticlePage, :using_params => {:article_name => article})
  visit(ArticlePage, :using_params => {:article_name => article})
Comment 29 Arthur Richards 2013-12-20 17:43:57 UTC
Glad we at least have a workaround for now, thanks Chris!
Comment 30 Antoine "hashar" Musso (WMF) 2013-12-20 20:02:54 UTC
Some entries from /data/project/logs/archive/slow-parse.log, that logs the parsing time of articles that takes longer than X seconds

2013-12-18 06:56:11 deployment-apache32 enwiki: 34.53 Barack_Obama
2013-12-18 15:11:21 deployment-apache32 enwiki: 34.99 Barack_Obama
2013-12-18 15:11:35 deployment-apache33 enwiki: 14.24 Barack_Obama
2013-12-18 21:29:00 deployment-apache33 enwiki: 37.49 Barack_Obama
2013-12-18 21:29:11 deployment-apache33 enwiki: 13.17 Barack_Obama
2013-12-19 06:16:51 deployment-apache32 enwiki: 13.79 Barack_Obama
2013-12-19 06:16:51 deployment-apache32 enwiki: 48.87 Barack_Obama
2013-12-19 06:16:51 deployment-apache33 enwiki: 83.96 Barack_Obama
2013-12-19 17:27:58 deployment-apache33 enwiki: 43.98 Barack_Obama
2013-12-19 17:28:01 deployment-apache32 enwiki: 12.60 Barack_Obama
2013-12-19 21:32:00 deployment-apache32 enwiki: 41.26 Barack_Obama
2013-12-19 21:32:07 deployment-apache32 enwiki: 12.45 Barack_Obama
2013-12-20 06:09:31 deployment-apache33 enwiki: 35.53 Barack_Obama
2013-12-20 06:09:43 deployment-apache32 enwiki: 12.77 Barack_Obama
2013-12-20 14:43:16 deployment-apache33 simplewiki: 3.17  Barack_Obama

As of Fri Dec 20 19:58:03 UTC 2013

The text caches have 185 seconds first byte timeout.

The mobile caches apparently have a 35 seconds first byte timeout. If the 503 are mostly encountered for mobile tests, that would explain the 503.

One might want to talk about it with ops and find out why the timeout are different on text and mobiles.  We might be able to raise on mobile for the beta cluster (puppet change in manifests/role/cache.pp)
Comment 31 Gerrit Notification Bot 2013-12-23 16:54:50 UTC
Change 103376 had a related patch set uploaded by Faidon Liambotis:
varnish: adjust first_byte_timeout for mobile too
Comment 32 Gerrit Notification Bot 2013-12-23 16:55:59 UTC
Change 103376 merged by Faidon Liambotis:
varnish: adjust first_byte_timeout for mobile too
Comment 33 Faidon Liambotis 2013-12-23 17:01:05 UTC
Chris, your theory sounds very plausible. As seen above, I fixed & merged your patch to increase the timeout for mobile as well, which should help reduce the occurence of this or even eliminate it.

It's not the underlying cause, though. On comment #6 I had the theory that is lingering issues from #57026; the bug has been reopened since (by others :), and Brad has investigated and identified another issue and has proposed a fix that is yet still not merged.

What probably happens is that due to this bug, double-parsing occurs and doubles the amount of time it takes to render large pages, such as Barack Obama, which make response times pass the 30s Varnish timeout mark.
Comment 34 Antoine "hashar" Musso (WMF) 2013-12-31 13:58:55 UTC
Is that still an issue? We can enable varnish log on the mobile cache as described in Comment #17.
Comment 35 Chris McMahon 2013-12-31 15:10:43 UTC
Between changing the timeout value for Mobile varnish, putting the first page load in a try/catch clause in the automated browser test, and hopefully reducing parsing time everywhere, I'm going to go ahead and close this.

Note You need to log in before you can comment on or make changes to this bug.