Last modified: 2011-03-13 18:06:39 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T23179, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 21179 - Random page redirects are incorrectly performed with HTTP 302
Random page redirects are incorrectly performed with HTTP 302
Status: RESOLVED WONTFIX
Product: MediaWiki
Classification: Unclassified
Redirects (Other open bugs)
unspecified
All All
: Lowest minor (vote)
: ---
Assigned To: Nobody - You can work on this!
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2009-10-19 11:46 UTC by Bogdan Stancescu
Modified: 2011-03-13 18:06 UTC (History)
2 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments
Patch for includes/OutputPage.php (3.25 KB, patch)
2009-10-19 11:46 UTC, Bogdan Stancescu
Details
Diff for includes/OutputPage.php (4.46 KB, patch)
2009-10-19 11:48 UTC, Bogdan Stancescu
Details
Diff for includes/specials/SpecialRandompage.php (743 bytes, patch)
2009-10-19 11:49 UTC, Bogdan Stancescu
Details

Description Bogdan Stancescu 2009-10-19 11:46:57 UTC
Created attachment 6682 [details]
Patch for includes/OutputPage.php

When a user requests a random page, the MediaWiki software responds with a HTTP 302 "Found" status code. According to RFC 2616, HTTP status code 302 is meant to be used when a resource has been temporarily moved. As such, "Since the redirection might be altered on occasion, the client SHOULD continue to use the Request-URI for future requests." Of course, this is not appropriate for random pages -- and the result of this implementation is that spiders improperly index random pages.

I have changed the code as to provide the proper response, i.e. HTTP status code 303 "See other", which seems a lot more appropriate: "This method exists primarily to allow the output of a POST-activated script to redirect the user agent to a selected resource. The new URI is not a substitute reference for the originally requested resource. The 303 response MUST NOT be cached, but the response to the second (redirected) request might be cacheable."

I'm attaching patches to this bug. The diffs are made against the latest versions of the files in the SVN repository (OutputPage.php Revision 57608, SpecialRandomPage.php Revision 55188).
Comment 1 Bogdan Stancescu 2009-10-19 11:48:47 UTC
Created attachment 6683 [details]
Diff for includes/OutputPage.php
Comment 2 Bogdan Stancescu 2009-10-19 11:49:26 UTC
Created attachment 6684 [details]
Diff for includes/specials/SpecialRandompage.php
Comment 3 Platonides 2009-10-19 16:17:38 UTC
Could you use unified diffs?
Comment 4 Bogdan Stancescu 2009-10-19 18:32:29 UTC
This was a manual diff as I don't have a working copy checked out, so I decided to provide a bit more context for patch to work with in case someone makes changes in the repository before this patch makes it in. Given that patch works fine with both formats, and that I wouldn't include SVN-specific stuff anyway, would it really help if I provided unified diffs in addition to the current ones?
Comment 5 Brion Vibber 2009-10-19 22:51:47 UTC
Are you aware of anything that specifically treats a 303 differently from a 302?

I'm a little worried that I don't see clear evidence of actual useful support in a couple quick web searches (like say a page from Google's webmaster guidelines), while the spec itself says that a 302 is safer and does the same thing:

http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html

      Note: Many pre-HTTP/1.1 user agents do not understand the 303
      status. When interoperability with such clients is a concern, the
      302 status code may be used instead, since most user agents react
      to a 302 response as described here for 303.
Comment 6 Bogdan Stancescu 2009-10-20 00:19:44 UTC
The RFC you're quoting is almost 10 years old, I'm sure all compatibility wrinkles have been sorted out by now. As for Google or any other piece of software following standards, I don't think we should be concerned with that -- we should, regardless of what they're doing.

Having said that, I think the Googlebot actually does make the difference, at least judging by the fact that they're documenting the two distinctly here: http://www.google.com/support/webmasters/bin/answer.py?hl=en&answer=40132 ("The server returns this code when the requestor should make a separate GET request to a different location to retrieve the response.")
Comment 7 Brion Vibber 2009-10-20 00:29:32 UTC
That page just mirrors the RFC and doesn't describe any distinction in Googlebot's crawl behavior or indexing behavior. In the absence of documented evidence that search indexes treat these differently, dropping as WONTFIX.
Comment 8 Bogdan Stancescu 2009-10-20 09:56:05 UTC
That's a surprising decision. You know our implementation is incorrect, you know how to fix it, you have the actual code to fix it, and yet you refuse to fix it because you have no evidence Google would make the difference between the incorrect implementation and the correct implementation. This should be fixed even if we had evidence that Google DIDN'T make the difference -- I simply don't understand why you'd make a voluntary decision not to follow standards based on the assumption that following them might produce the same results as not following them.
Comment 9 Brion Vibber 2009-10-20 13:39:27 UTC
On the contrary, the spec specifically says our behavior is both correct and more compatible. Please provide actual arguments in favor of a change if you wish to continue.
Comment 10 Bogdan Stancescu 2009-10-20 14:06:52 UTC
Fair enough, let's analyze the specification. But first, let's define the concepts we're using. When you access the URI for Special:Randompage you don't identify a specific resource, but rather make a request for a service (random redirection). As such the URI for Special:Randompage MOST NOT be associated with whatever resource ends up being served as a result; in that specification's context we can exchange "MUST NOT be associated" with "MUST NOT be cached", since they're basically the same thing.

The specification for 302 reads "The requested resource resides temporarily under a different URI. Since the redirection might be altered on occasion, the client SHOULD continue to use the Request-URI for future requests." When I access Special:Randompage, did the resource at Special:Randompage move temporarily under a different URI? As such, should the client keep using the Special:Randompage URI in order to reach the (random) resource it ended up with? (The former is obviously incorrect, and the latter is Google's current behavior, which forces us to use robots.txt in order to disallow access to random pages.)

The spec for 303 on the other hand reads "The response to the request can be found under a different URI and SHOULD be retrieved using a GET method on that resource. This method exists primarily to allow the output of a POST-activated script to redirect the user agent to a selected resource. The new URI is not a substitute reference for the originally requested resource." This is altogether more appropriate for our purpose: the specification itself says the request is processed by a script and provides a redirection (precisely what we're doing); in addition the new URI is explicitly said not to be a substitute reference for the original request -- and that's exactly what we're after.

If that wasn't enough, the spec for 302 reads "This response is only cacheable if indicated by a Cache-Control or Expires header field", whereas the spec for 303 specifically states "The 303 response MUST NOT be cached".

Regarding your concern with the compatibility note, that reads "When interoperability with [pre-HTTP/1.1 user agents] is a concern, the 302 status code may be used instead". Which pre-HTTP/1.1 agents are you concerned about? Also, even if we could conceivably find some archaic HTTP 1.0 clients still in sporadic use today, we'd only hinder their access to the Random page functionality, which is in no way a core functionality in Mediawiki -- so basically we'd be concerned with a negligible minority being negligibly affected.
Comment 11 Brion Vibber 2009-10-20 14:51:20 UTC
The spec's not at issue.

Really what we need here is some evidence that 1) they actually behave differently so there's some benefit to changing and 2) client support is consistent enough (including in hacked-together client-side bot tools) that there's no downside.
Comment 12 Bogdan Stancescu 2009-10-20 15:14:48 UTC
In your previous reply you asserted the specification "specifically says our behavior is both correct and
more compatible" -- now the spec's not an issue. I'm not sure what I should address, if not the issues you're raising.

Regarding benefits, your logic is fallacious: the fact that Google or whoever else does or does not abide to standards doesn't mean that OUR abiding to standards isn't beneficial -- those two statements are not dependent on one another. Let's say Google completely disregards status codes (which it doesn't, we know there's a difference between the way it treats 301 and 302, we're just not sure between 302 and 303). If everybody uses 301, 302 and 303 properly then Google will take note, even if it hadn't so far. And since MediaWiki is widely deployed, this is one of the places where we actually have some leverage to push for standards. And then again -- this is only IF Google disregards the difference, which we don't know for sure.

As for your other argument, I'm sorry, but that's nothing short of preposterous. We'll never find a study regarding the behavior of "hacked-together client-side bot tools" in respect to HTTP status code 303 -- but if you do have a study regarding said tools' behavior in respect to 302 I'd love to read it. Of course you don't, but if we were to continue this discussion you'd say they worked so far -- but by that rationale we'd be forced never to change anything, which I'm sure you'd disagree with in other respects. Lack of proof for something unprovable is not an argument.

I'm not reopening this -- we've been running in circles for a couple of exchanges and the fact that you're asking me to prove something which is utterly impossible to prove tells me this conversation is useless.
Comment 13 P.Copp 2009-10-20 15:28:42 UTC
(In reply to comment #10)
> Regarding your concern with the compatibility note, that reads "When
> interoperability with [pre-HTTP/1.1 user agents] is a concern, the 302 status
> code may be used instead". Which pre-HTTP/1.1 agents are you concerned about?
> Also, even if we could conceivably find some archaic HTTP 1.0 clients still in
> sporadic use today, we'd only hinder their access to the Random page
> functionality, which is in no way a core functionality in Mediawiki -- so
> basically we'd be concerned with a negligible minority being negligibly
> affected.
> 

IIRC, Wikimedia's squid servers still use HTTP/1.0. It wouldn't be very standard compliant for them to respond with "HTTP/1.0 303 See other" as this status code doesn't exist in 1.0 :)
Comment 14 Bogdan Stancescu 2009-10-20 15:38:50 UTC
That's a legitimate concern -- Wikimedia's servers currently respond "HTTP/1.x 302 Moved Temporarily", which I assume means the response is intended to be backwards-compatible. As such, switching the whole thing to HTTP/1.1 might involve more complications than I had anticipated.
Comment 15 Brion Vibber 2009-10-20 16:23:18 UTC
Bogdan, what I mean by "the spec's not at issue" is that we don't disagree about what the spec says -- we're disagreeing on the importance and relative benefits/risks of using an HTTP 1.1-only feature that is more semantically correct.

Being that we interop with both HTTP 1.0 (really, HTTP 1.0 plus the Host header...) and "real" 1.1 clients, using an HTTP 1.1-only response doesn't make much sense to me without a clear benefit. The spec explicitly says that 302 is a correct and backwards-compatible usage, so in the absence of a practical benefit I believe sticking with 302 is the best behavior.
Comment 16 Bogdan Stancescu 2009-10-20 16:33:54 UTC
I wasn't aware of Wikimedia's server's current behavior -- now I have seen the light and I completely understand this is not worth the hassle in that particular context (it may not worth it even if Google treats it properly, since the problem is easily avoided altogether with robots.txt).

Having said that, MediaWiki is a stand-alone product -- how difficult would it be to implement this on an optional basis, as to allow other installations to benefit from the improvements in HTTP 1.1 in this regard?
Comment 17 Bogdan Stancescu 2009-10-28 23:47:27 UTC
For the record, I tested this on my own MediaWiki installation -- Brion's intuition was right, Google doesn't discriminate between 302 and 303. That's quite surprising for me, but it's the way it is.
Comment 18 Bogdan Stancescu 2009-10-29 13:46:15 UTC
Update: I brought this up explicitly on the Google Webmaster Central forum: http://www.google.com/support/forum/p/Webmasters/thread?tid=7f545a23e5276203&hl=en

While I haven't (yet) received a definitive answer on how Google treats status code 303, it appears the most useful approach is to use rel=canonical in order to specify the proper URL. Interstingly, MediaWiki already does that for page redirects -- but it only does it for page redirects (search for "canonical" in includes/Article.php). Why isn't that included unconditionally?

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links