Last modified: 2014-09-23 22:33:09 UTC
See also: https://bugzilla.wikimedia.org/show_bug.cgi?id=16043 (this blocks it if anything) Great change has taken place in #wikipedia with regards to opping practices. It remains difficult to manage the channel during times of downtime, especially with little or no support from sysadmins (if the channel gets particularly hectic, while +m might not be warranted, it is impossible to read both the channel and #wikimedia-tech). A far better solution than Mike suggests is that the Wikimedia sysadmins go to the effort of creating some easy, quick to update and accessible method of telling users what is going on. Not many people use and are familiar with IRC - and I'd expect that for 90% of people who see the "site are down" message, their usual next step would be to (ironically!) visit wikipedia to see what IRC means! It therefore serves very few users as a means of providing status updates. It would be relatively trivial for someone to create (yet another) IRC bot for #wikimedia-tech which could write comments given to it to either a blog or something like twitter (and thus an RSS feed). This would be accessible to many many more users affected by Wikimedia downtime. An IRC channel is no longer fit for purpose.
Where would you place that status page so it doesn't get "slashdotted" on a wikipedia outage? A long time ago, there was an external page serving for that, which was taken down on wikipedia failures. Now wikipedia traffic is orders of magnitude greater. An appropiate place to set the messages could be the toolserver (if WM-DE is ok with it), independent but nearby. However, it only makes sense if the source of the problem isn't in esams itself! I can't think a scenary where the squids present the error message, the toolserver is not accesible and which isn't trivially solved by rerouting to tampa. Nonetheless I feel there might be an unsuspected problem there.
(In reply to comment #1) > Where would you place that status page so it doesn't get "slashdotted" on a > wikipedia outage? > > A long time ago, there was an external page serving for that, which was taken > down on wikipedia > failures. Now wikipedia traffic is orders of magnitude greater. > > An appropiate place to set the messages could be the toolserver (if WM-DE is ok > with it), > independent but nearby. However, it only makes sense if the source of the > problem isn't in esams > itself! > I can't think a scenary where the squids present the error message, the > toolserver is not > accesible and which isn't trivially solved by rerouting to tampa. Nonetheless I > feel there might > be an unsuspected problem there. toolserver did cross my mind. Alternatively, use a completely seperate service such as a hosted blog or, as an increasing number of services do, use twitter. Alternatively, could the "Site down" notice be modified such that it draws a short status string from somewhere and presents it to users?
message bot already spews to twitter. http://twitter.com/wikimediatech
It would be more consistent with our values linking to http://identi.ca/wikimediatech instead of twitter. Still, I don't think the Server admin log is appropiate as a general status information. Make a feed with #wikipedia-tech topic? :)
IRC is useful for many people. So are identi.ca and twitter, client (or app) and browser based. We should provide a few routes, not just one. There's no need not to tell people about IRC as one of those. If anything's up IRC will surely get to know of it. I was in #wikipedia on 24 May and demands weren't unreasonable, posted a message there now and then, people got the idea. Easy. Suggestion: <Standard and user-friendly generic error message> If this persists more than a few minutes, the current status and updates can be viewed at: * IRC: <channel details> <http://web link> (web based) * identi.ca: <details> * Twitter: wikimedia-network-status <http://search.twitter.com/search?q=wikimedia-network-status> (web based) * Our external status pages: <list> Almost-current versions of articles can be read from the following cache websites: * <list>
(In reply to comment #5) > IRC is useful for many people. So are identi.ca and twitter, client (or app) > and browser based. We should provide a few routes, not just one. That's bug 16043. This bug asks for > some easy, quick to update and accessible method of > telling users what is going on. http://status.wikimedia.org/ seems the way, but sysadmins need to decide how to update it with notices. [Link to largely unhelpful discussion, just for historical purposes: http://thread.gmane.org/gmane.org.wikimedia.foundation/52853 .]
Guillaume, Sumana, Tilman, Matthew or whoever is responsible for this: do we have *any* location right now where users can expect to find information about (if not report) current outages and technical problems, which could be linked from the error page? As of bug 16043 comment 24, the new (varnish) error page won't have even a link to IRC, while it would be nice if it gave some directions. status.wikimedia.org doesn't give any updates; I think even Twitter would be better than nothing, but I don't remember https://twitter.com/wikimedia consistently consistently reporting such information with less than few hours' delay. Perhaps https://wikitech.wikimedia.org/view/Server_admin_log would be a suitable target? It's both more open for posting (hence more complete) and "moderated" (by editing the wiki). It mostly contains obscure information, but during outages the top lines will probably be about what people are looking for; informative messages can easily be made in bold. This issue saw no progress in years... Can we find a simple solution, and who's in the position of taking a decision on the topic?
Just post stuff /both/ to Identi.ca and Twitter (with the most popular accounts or hashtags) with a simple IRC-to-Identi.ca/Twitter bot. Won't hurt.
(In reply to comment #7 by Nemo) > Perhaps https://wikitech.wikimedia.org/view/Server_admin_log would be a > suitable target? See comment 4 - It's likely too techy. This request has some bikeshed potential - is there a scope which kind of issues should be informed about? (I probably shouldn't ask this, to keep this focused.)
Hello everyone, Apologies for being late to this discussion. Is the sort of information we are currently exposing at http://status.wikimedia.org the kind of information you are looking for? Or something else? Thanks.
Hello Ken. (In reply to comment #10) > Is the sort of information we are currently exposing at > http://status.wikimedia.org the kind of information you are looking for? Or > something else? Something else. status.wikimedia.org reports only the worst cases of downtime (when sites are not even accessible), for some of the services. What's needed is information on whether the sites are functioning (e.g. up, down, read only, r/w but there's a fatal if you try to save, Europe cut off) and what's being done about it. A recent example could be https://status.github.com/messages
Change 97190 had a related patch set uploaded by Nemo bis: Add Twitter account to Varnish's error page https://gerrit.wikimedia.org/r/97190
(In reply to comment #12) > Change 97190 had a related patch set uploaded by Nemo bis: > Add Twitter account to Varnish's error page > > https://gerrit.wikimedia.org/r/97190 I think this proposed change might mistakenly give the impression that the "wikimedia" Twitter account is used to provide site status information and it's definitely not, even during actual outages and issues.
(In reply to comment #13) Comment 4 notes that Twitter is not really aligned with Wikimedia's open source values, though in the time since comment 4 was made, identi.ca no longer exists, I believe. :-/
Copy from gerrit comments: > Dzahn: didn't you mean https://twitter.com/wikimediatech instead of https://twitter.com/wikimedia ? ...snip... I disagree with using the wikitech logs because most end users will not understand what they mean eg: <p858snake|l> most end users will not know what "cp1002 hdd is full" or "fenari is in swap" or perhaps "exim is being stupid" means <p858snake|l> or how that relates to why its boke
(In reply to comment #13) > (In reply to comment #12) > > Change 97190 had a related patch set uploaded by Nemo bis: > > Add Twitter account to Varnish's error page > > > > https://gerrit.wikimedia.org/r/97190 > > I think this proposed change might mistakenly give the impression that the > "wikimedia" Twitter account is used to provide site status information and > it's > definitely not, even during actual outages and issues. It's definitely been used for major outages, see e.g.: https://twitter.com/Wikimedia/status/232469652691894272 https://twitter.com/Wikimedia/status/232519974663643136 https://twitter.com/Wikimedia/status/350485792956755968 https://twitter.com/Wikipedia/status/398888528039276544 (was retweeted by @wikimedia, too) Since mid-2011, Twitter has been listed as a communications tool for such cases at https://wikitech.wikimedia.org/wiki/Incident_response#Communicating_with_the_public . Of course it's a matter of judgment how severe an incident needs to be to be reported on @wikimedia. Issues that don't affect a lot of users, or short outages, may indeed not be covered there. The wording in the patch ("You may be able to get further information in Wikimedia's <a href="https://twitter.com/wikimedia" >Twitter feed</a>") should be sufficiently non-committal.
It would be interesting to have a guesstimate of how many views of that error page happen to coincide with something that would trigger an update of that Twitter handle, i.e. if it's mostly seen during major outages (a lot of views in rare events) or minor ones (less views but many more events). Since IRC was removed, the error message no longer provides any way (however hard) to get really up to date information. I don't know however if that's a goal, maybe not.
I don't think it's an issue if folks check out @wikimedia from an error message and find no updates there, as long as the message is worded accordingly. The proposed message in https://gerrit.wikimedia.org/r/#/c/97190/ already says "may be", which I think is sufficient, but we could also add "in case of ongoing outages" since we'll likely never tweet something for an intermittent site issue.
(In reply to comment #16) >> I think this proposed change might mistakenly give the impression that the >> "wikimedia" Twitter account is used to provide site status information and >> it's definitely not, even during actual outages and issues. > > It's definitely been used for major outages[...] Yes, it has been used previously. But site outages and issues happen 24/7 and I can assure you we've had many outages and large site issues of varying strengths that have gone unreported to Twitter. There's also the issue of tweets coming post-incident (see below). > [...] see e.g.: > > https://twitter.com/Wikimedia/status/232469652691894272 > https://twitter.com/Wikimedia/status/232519974663643136 > https://twitter.com/Wikimedia/status/350485792956755968 > https://twitter.com/Wikipedia/status/398888528039276544 (was retweeted by > @wikimedia, too) A user visits Wikipedia and sees an error page. They refresh or come back a few minutes later and the site is back. In only one of the four cases mentioned here would there have been any useful information from Twitter. In three of the four cases, the message was put out after the site issue was resolved (e.g., "Site back to normal after problems affecting logged-in users."). Any user who saw the Wikimedia error message and clicked over to Twitter would not have been provided any useful information. If we insist on including a link to Twitter, I think it might be better to include a link such as <https://twitter.com/search?q=wikipedia+down>. That's how a user can actually determine whether the site is having issues during an actual outage. Otherwise we will simply be directing users to a feed (@wikimedia) of "check out this project on Wikisource" or "see the Commons image of the day" when the sites are inaccessible. That doesn't seem ideal to me.