Last modified: 2013-06-18 15:18:48 UTC
[originally reported on wikitech]
I've been using secure for login for over a year now, and at first it seemed
pretty good, other than the inability to switch sites easily (bug 5440).
And always editing links from secure.wikimedia.org/.../w to en.wikipedia.org/w,
but I've gotten used to doing that extra bit by hand.
Anyway, it's just been a dog lately. During EDT daylight hours, it often
gives an error not able to access page, especially saving.
So, I've reverted to the old practice from the days of 2005-2006, and
mostly edit in very off-peak hours. Yet it slowed down drastically again!
Here's my test log, edits queued and ready to go, demonstrating roughly how
long they take to come back and display:
# 2009-07-01T06:55:59 1 minute 14 seconds
# 2009-07-01T17:06:17 53 seconds
# 2009-07-01T17:06:53 36 seconds
# 2009-07-01T17:08:00 1 minute 7 seconds
# 2009-07-01T17:08:40 40 seconds
# 2009-07-01T17:09:45 1 minute 5 seconds
# 2009-07-01T17:11:49 2 minutes 4 seconds
# 2009-07-01T17:12:44 55 seconds
# 2009-07-01T17:13:49 1 minute 5 seconds
# 2009-07-01T17:15:00 1 minute 11 seconds
# 2009-07-01T17:16:10 1 minute 10 seconds
In short, sometimes as slow off-peak as peak.
Does this mean that many secure users are from Asia?
Are there too many secure users?
Is there anywhere that configuration and usage of secure is listed?
Fred can you take a peek and see if we can monitor status of secure server? I haven't noticed any problems using it, nor did the load graphs look particularly unpleasant when I checked last week, but we want to make sure it's not going to crap when we're not looking.
*** Bug 19588 has been marked as a duplicate of this bug. ***
Secure.wikimedia.org seems to point to bart and uses apache2 to proxy the ssl connection over to the cluster.
However, bart is also the nagios monitoring server and will therefore see spikes in CPU usage from time to time, depending on the nagios scheduler.
Also, this server is very low on memory:
[root@bart conf]# free -m
total used free shared buffers cached
Mem: 3550 3129 420 0 406 1118
-/+ buffers/cache: 1604 1945
Swap: 1983 0 1983
which could cause some of the issues you are seeing.
I will enable process accounting on that server to try and get a better view as to what is going on.
Ganglia graphs available at http://ganglia.wikimedia.org/pmtpa/?c=Miscellaneous&h=bart.wikimedia.org&m=&r=hour&s=descending&hc=4
Also note, this server is set to be decomissioned in the near future.
Created attachment 6367 [details]
peaks during off-peak time
Thank you for the ganglia link. The server list had "ssl"
instead of secure.wikimedia.org, so I'd missed it.
I've been looking at the graphs from time to time, and
this was a fine example.
Noting for the record that http://nagios.wikimedia.org/ has been reporting
MEMCACHED CRITICAL - Can not connect to 10.0.2.159:11000 (Connection refused)
for some time now....
Created attachment 6371 [details]
peaks at same time each day
For comparison between 07-19 and 07-20, has the first CPU peak at the
same time. However, there is a 07-20 network peak at the same time as the
second 07-19 CPU peak, indicating some kind of regular process, too.
Dunno if this helps, but I've noticed this problem only on these pages so far. I look at a lot of Wikipedia articles):
I'm accessing Wikipedia from New Zealand. The pages seem to be perpetually inaccessible (a few days so far). Of course the non-secure pages work fine.
The above three pages are still inaccessible for me. Also I've found another:
The first time i tried to access it i got a 502 error about proxy not being able to read. Second time, it went through rather quickly. Perhaps the parser cache is separate for secure and rest of everything, and that page just takes insanely long to render that it times out(?)
Some more: https://secure.wikimedia.org/wikipedia/en/wiki/World_War_II
(In reply to comment #10)
> Some more: https://secure.wikimedia.org/wikipedia/en/wiki/World_War_II
For me too. All return "502 Proxy Error"
The proxy server received an invalid response from an upstream server.
The proxy server could not handle the request GET /wikipedia/en/wiki/World_War_II.
Reason: Error reading from remote server
Apache/2.2.8 (Ubuntu) mod_fastcgi/2.4.6 PHP/5.2.4-2ubuntu5.12wm1 with Suhosin-Patch mod_ssl/2.2.8 OpenSSL/0.9.8g Server at secure.wikimedia.org Port 443
>Statuscode:502 Bad Gateway
>Date:Wed, 02 Mar 2011 07:33:01 GMT
That again appears to be the proxy timing out. I only get the 502 when the page was not served from the parser cache. If its served from the parser cache, it works fine from secure.
Probably the timeout on the proxy server needs to be increased (or someone could make the parser be super fast, but that's a little more difficult ;)
[tried sending this via email, trying again]
We've seen these Proxy errors before with the server overloaded. It's
currently on singer. But I don't see (via Ganglia) the huge cpu spikes
we used to have on bart with nagios.
However, I was just going to post to wikitech that I've been seeing
other problems from secure lately, too:
* Edits don't seem to flush the cache properly. After noticing this
weekend, I had to action=flush a dozen pages by hand to see my article
and category changes reflected via normal access.
* It's losing the user name on edits, showing up with IP instead. I'm
not sure this wasn't due to my user error somehow -- but it was fairly
frequent back in the old overloaded days, hadn't happened to me for a
couple of years, and just showed up again yesterday!
(In reply to comment #13)
> [tried sending this via email, trying again]
yeah, trying to reply by email to bugmail doesn't work.
> We've seen these Proxy errors before with the server overloaded. It's
> currently on singer. But I don't see (via Ganglia) the huge cpu spikes
> we used to have on bart with nagios.
If my theory is correct, its not caused by load.
> * Edits don't seem to flush the cache properly. After noticing this
> weekend, I had to action=flush a dozen pages by hand to see my article
> and category changes reflected via normal access.
There was recently some issues with the job queue (bug 27727), may be related to that (That wouldn't be secure specific though)
> * It's losing the user name on edits, showing up with IP instead. I'm
> not sure this wasn't due to my user error somehow -- but it was fairly
> frequent back in the old overloaded days, hadn't happened to me for a
> couple of years, and just showed up again yesterday!
That's a more interesting issue, I have no idea what could cause that.
Giving half of Fred's old bugs to Ashar since I trust him to get it done or reassign if he doesn't have time.
Resetting this back to wikibugs, and almost willing to close it.
Appears this is was assisgned back before we had status.* and the related tools and that is what was wanted. Which is now Bug 27912 to get it inculded.
And the 502 errors are also a seperate bug (bug 25271), which could probably get duped either way.
Assigning back to me. Pending actions:
- make sure it is monitored by nagios and ganglia
- check the peaks disappeared or either
--- move the process generating them elsewhere
--- move secure.w.o somewhere else
Regarding comment 16, I had already filed Bug 19588 on the Proxy errors, but Brion marked it as a duplicate of this bug (back in comment 2). So maybe they should be split again?
Merge with bug 25271?
The Wikimedia Foundation operation team is rebuilding the HTTPS system from scratch that will solve this bug for good.
HTTPS has been enabled on test some days ago:
Therefore, this bug will not be fixed since the architecture is going to be replaced.
Sounds more like "almost FIXED" than WONTFIX. :)