Last modified: 2011-01-16 00:48:50 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T4585, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 2585 - Server should return a 404 HTTP status code if the page does not exist
Server should return a 404 HTTP status code if the page does not exist
Status: RESOLVED FIXED
Product: MediaWiki
Classification: Unclassified
General/Unknown (Other open bugs)
1.16.x
All All
: Normal enhancement with 9 votes (vote)
: ---
Assigned To: Nobody - You can work on this!
: patch, patch-need-review
: 26282 (view as bug list)
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2005-06-28 19:13 UTC by Théo
Modified: 2011-01-16 00:48 UTC (History)
19 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments
trivial hack to output 404 responses for pages (and Special pages) that don't exist (2.43 KB, patch)
2005-08-15 18:52 UTC, Rowan Collins [IMSoP]
Details
Trivial patch to return 404 in the File namespace (318 bytes, patch)
2011-01-11 09:17 UTC, wikimedia.bugzilla
Details

Description Théo 2005-06-28 19:13:14 UTC
Mediawiki does not use HTTP1.1 Status code. The lisft is here :
http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html .
Mediawiki can use that to help some users.
When the page doesn't exist, Mediwiki returns code 200 : OK. When a special page
is not found, Mediawiki returns code 200, and when the page contains a redirect,
it's 200 too.
It is not very good.
And it's easy to add this feature ...
A simple php function can do it !
We have to define error code to return. There are the choice !
(a 404 error code can be return when a special page isn't found)

Scuse me foor my poor english :(
Comment 1 Brion Vibber 2005-06-28 19:24:17 UTC
(Changed summary to indicate that it does _not_ do this presently)
Comment 2 Gregory Szorc 2005-07-11 15:18:26 UTC
This enhancement would benefit search engines as well.

Using the PHP header($s) function, the following values of $s should be used when specific conditions are met:

Page Not Found:  $s = "HTTP/1.1 404 Not Found"

Page is a redirect: $s = "HTTP/1.1 301 Moved Permanently"
   or $s = "HTTP/1.1 307 Temporary Redirect" 

For the 404 error, page content can be outputted normally.  This status code will prevent spiders from indexing pages that do not exist.

The redirect case takes a little more work.  Ideally, the 301 or 307 header should be followed by a Location header.  The client will then submit a second HTTP GET request.  However, if 
we still want to display the "Redirected from" message on a page, we will have to use some $_SESSION magic before the header("Location: foo") call.  The Location header output should 
be immediately followed by a die() or similar function call to stop script execution.
Comment 3 Rowan Collins [IMSoP] 2005-08-12 21:36:07 UTC
Using 3xx codes for redirects is normally considered unsuitable for redirects,
because it requires both extra effort to display the page correctly (special
session or parameter handling; not to mention how caching would work) and extra
server load to display *anything* (the server has to process two requests
instead of one). 

A 404 response for non-existent pages / Special pages would however be very
useful for those writing automated tools which interact with MediaWiki only by HTTP.
Comment 4 Rowan Collins [IMSoP] 2005-08-15 18:52:18 UTC
Created attachment 790 [details]
trivial hack to output 404 responses for pages (and Special pages) that don't exist

This is a very quick hack (which seems to work for me) which returns HTTP 404
if the request is for a non-existent page or a non-existent Special: page. 

Unfortunately, MediaWiki still does all the parsing and everything when you do
a HEAD request (it just gets thrown away), so it's probably still "cheaper" to
use Special:Export. Ideally, a HEAD request would result in a kind of "null
view" where nothing much was done, but I'm not sure at what level to insert
that to ensure that some meaningful headers are still generated.
Comment 5 Rowan Collins [IMSoP] 2005-08-15 22:10:46 UTC
See also bug 3161, which deals with returning a 404 from Special:Export.
Comment 6 Brion Vibber 2005-08-17 00:30:58 UTC
Note that we _need_ to do all the work, as HTTP requires that all the headers on
a HEAD request match those that would be sent with a GET. That means things like
Content-length need to match...
Comment 7 Zigger 2005-08-17 01:23:20 UTC
&action=raw can be used to get a 404, although IE6SP1 & Firefox 1.06 try to
download the result.
Comment 8 Gregory Szorc 2005-08-17 01:29:06 UTC
With regard to comment #7, even though a 404 is served, it doesn't mean you
can't serve content.  You could have a fully functional web site serving out
pages with 404 codes all day long and your browser wouldn't care.  Search
engines DO care.  They won't index pages sent with a 404 (or at least they
shouldn't).  Look at the 404 error of any web site sometime.  View the source. 
There is HTML there and it doesn't come from your browser.
Comment 9 Brion Vibber 2005-10-24 02:44:47 UTC
Several users have reported being persistently unable to access nonexistent
pages, including edit pages. leading to inability to create new pages to access
user pages of new users and IP addresses.

I haven't been able to reproduce the problems locally; it may be specific to
certain versions or due to odd settings or local proxies that misbehave.

Reverting changes to index.php and Article.php which produce 404 response codes,
and reopening bug.
Comment 10 Gregory Szorc 2006-05-01 18:06:20 UTC
I have made the following changes to MediaWiki 1.6.3:

Line 751 of Article.php has the following:

#Set status code to 404 if article doesn't exist
if ($this->mTitle->getArticleID() === 0) $wgOut->setStatusCode(404);
Comment 11 Keul 2007-08-21 15:56:05 UTC
> Several users have reported being persistently unable to access nonexistent pages, including edit pages
Several scientists have reported being persistently unable to access last pi digits including binary and decimal digits of this one.

When you use text editor, it warn you that file does not exist and ask to create before editing.
wikimedia does not warn googlebot and users that file is 404 before editing.

I think the big problem is the creation of articles
- wikipedia.org/AntarticRabbits should return 404 (with what you want in, maybe creation of article, as long as there's 404 errorcode)
- wikipedia.org/CreateArticle:AntarticRabbits with CreateArticle the page NAME, and AntarticRabbits the PARAMETER witch is the name of the article you wants to create should return code 200

Google should reference wikipedia.org/CreateArticle but not wikipedia.org/CreateArticle:* because it is an infinite possibility parameter, including article names that doesn't exists.
I think robots.txt could handle this
All red link linking to unexisting articles should link to wikipedia.org/CreateArticle:name_of_article.

Advantages:
- 404 error code for articles that doesn't exists.
- users can create articles with a logical URI/link
- google can suppress immediately articles from his database if article was removed by moderator

Inconvenients:
- none
Comment 12 Roan Kattouw 2007-08-21 21:18:25 UTC
(In reply to comment #11)
> Inconveniences:
> - none

Oh yes there are. You're making new page creation overly complicated just to facilitate something silly as 404s (which I think shouldn't happen at all: technically, you're requesting /w/index.php?title=Name_of_article and a 404 would mean index.php doesn't exist, which it does). The whole CreateArticle: thing is crazy: you're introducing a special namespace-like keyword in front of the title which will require a whole new framework to be parsed, and you're making it impossible for people to have a CreateArticle namespace (not that it's a common namespace name, but it's a principal thing). What is the reason for having separate pages that return 404 and 200 anyway? Is there a problem with returning 404 and a page? There shouldn't be, as long as the page HTML is >512 bytes.
Comment 13 Roan Kattouw 2007-08-21 21:19:34 UTC
Oh and as for Google, there's always $wgOut->setRobotpolicy( "noindex,follow" ); which is already used for nonexistent special pages.
Comment 14 Stephen Bain 2007-12-17 10:26:38 UTC
Another possibility would be setting 410 for deleted pages. Arguably this is more important for the timely updating of search engine caches, and I doubt there would be the same problems with badly-configured clients or proxies.
Comment 15 Brion Vibber 2007-12-18 01:38:11 UTC
(In reply to comment #12)
> technically, you're requesting /w/index.php?title=Name_of_article and a 404
> would mean index.php doesn't exist, which it does

Errors apply to URLs; URLs include query strings. They do not apply only to parts of URLs.

> Is there a problem with returning 404 and a page? There shouldn't be, as
> long as the page HTML is >512 bytes.

When we instituted this (r11307) we did in fact have problems -- we got many complaints from contributors who received error pages in their browser. Unfortunately I was never able to reproduce it so we didn't get enough data to really debug it.

One possibility is that there were proxies which filtered the data.

Another is that some of the browsers didn't obey the expected size limit (eg a lower limit than expected).

Another is that we had pages legitimately below the limit and failed to pad them... but this seems like it should be unlikely given the size of the pages.



(In reply to comment #14)
> Another possibility would be setting 410 for deleted pages. Arguably this is
> more important for the timely updating of search engine caches, and I doubt
> there would be the same problems with badly-configured clients or proxies.

410 sounds like it might be more appropriate for deleted pages than general non-existent pages, and definitely not appropriate for edit pages.
Comment 16 Tristan Miller 2007-12-18 19:32:25 UTC
See Bug 12345 for the 410 issue.
Comment 17 Dan Jacobson 2008-01-30 01:16:02 UTC
Whatever you do, remember that "categories with no content, but with
members" should still return 200, and not 404. See also
http://permalink.gmane.org/gmane.science.linguistics.wikipedia.technical/36416
and Bug 7897: My http://radioscanningtw.jidanni.org/ and
http://taizhongbus.jidanni.org/ are wikis with many purposefully
"empty" category pages with lots of radio frequency and bus stop
category members.
Comment 18 Dan Jacobson 2008-05-04 03:14:28 UTC
Regarding using the API to check as a workaround,
In http://permalink.gmane.org/gmane.science.linguistics.wikipedia.technical/37879
"RK" == Roan Kattouw says

RK> First, save up a list of articles you wanna check.

Several lists in fact are needed, one for each wikimedia site that one
wants to check.

RK> When you've got a couple hundred of them (or have run out of
RK> articles to check), issue an API request like:

RK> http://en.wikipedia.org/w/api.php?action=query&titles=Dog|WP:WAX|Jidanni|Talk:|Indian_elephant

RK> It returns some basic data (namespace and existence) for every
RK> article...

And then one needs to parse that all again too...

All because MediaWiki is so special a Web 2.0 new technology that old
fashioned 404s are for squares.

Brion> As noted there we had problems when originally implementing it,
Brion> which may or may not be legit or continuing, and we haven't got 'round to
Brion> reimplementing it.

OK, glad to know that 404s are not old fashioned, and best wishes for
a soon implementation.
Comment 19 Dan Jacobson 2008-06-16 01:10:29 UTC
One almost wishes there was a User Preference:
Give me real [X] 404's, and [X] 302's !
Comment 20 Juliano F. Ravasi 2008-08-11 20:38:36 UTC
Something that must be taken in consideration: The specification says that if the "resource" doesn't exist, the server returns 404 (or 410). But what is a "resource"? The specification is not clear about what is to be considered a resource: the full URL or only the absolute path (without the query).

The relevance of this is the question: when not using short URLs, is it possible for a user agent to interpret a 404 for "/w/index.php?title=Page_name" to be that "/w/index.php" is not found? In this case, that would be wrong and possibly bring a lot of problems.

This feature must be configurable, and default to disabled. If the above concern holds true, users should only enable this feature after configuring short URLs, so that MW only returns 404 errors for addresses that are, canonically unique for their respective "resources".
Comment 21 Aryeh Gregor (not reading bugmail, please e-mail directly) 2008-08-11 21:00:05 UTC
(In reply to comment #20)
> Something that must be taken in consideration: The specification says that if
> the "resource" doesn't exist, the server returns 404 (or 410). But what is a
> "resource"? The specification is not clear about what is to be considered a
> resource: the full URL or only the absolute path (without the query).

RFC 2616 is very clear: 404 means "The server has not found anything matching the Request-URI."  The Request-URI includes the query string, if one was provided, as indicated in section 5.1.2.  Besides, any other behavior would make absolutely no sense.  This is a non-issue.

Note for reference: the old revisions dealing with this are r11307 (adding feature) and r11474 (reverting).
Comment 22 Juliano F. Ravasi 2008-08-11 23:22:03 UTC
(In reply to comment #21)
> RFC 2616 is very clear: 404 means "The server has not found anything matching
> the Request-URI."  The Request-URI includes the query string, if one was
> provided, as indicated in section 5.1.2.  Besides, any other behavior would
> make absolutely no sense.  This is a non-issue.

Check section 3.2.2 of the same RFC and you will see where it lies the ambiguity. Quoting:

   3.2.2 http URL

   (...)
   http_URL = "http:" "//" host [ ":" port ] [ abs_path [ "?" query ]]

   If the port is empty or not given, port 80 is assumed. The semantics
   are that the identified resource is located at the server listening
   for TCP connections on that port of that host, and the Request-URI
   for the resource is abs_path (section 5.1.2).

Emphasis on "the Request-URI for the resource is abs_path". Note that in this paragraph, it is clear that the resource doesn't include the query string. In this sense, '/w/index.php' alone is to be considered the resource, the query string is a message passed to the resource. Then, later in 5.1.2:

   5.1.2 Request-URI

   (...)
   The most common form of Request-URI is that used to identify a
   resource on an origin server or gateway. In this case the absolute
   path of the URI MUST be transmitted (see section 3.2.1, abs_path) as
   the Request-URI, (...)

Note how it says that the Request-URI identifies the resource, referring to the abs_path, and forgets to mention the query string.

The description of 404 mentions "anything matching the Request-URI", but isn't this referring to the resource as per definition in section 3.2.2? Note the description of status codes 200, 202, 203, 206, 300, 301, 302, 303, 305, 307, 401, 405, 406, 407, 409, 410, 412, 415 and 416 (section 10.2.1 onwards) all mention "resource" specifically, in a sense that seems to be the definition used in section 3.2.2, that is, until abs_path, not including the query string. The description of 404 only mentions "resource" when pointing to 410.

Not that I see a lot of problem here... Just to note that the RFC is open to some interpretation that is different than the common sense, if there is one.
Comment 23 Aryeh Gregor (not reading bugmail, please e-mail directly) 2008-08-11 23:53:27 UTC
Other parts of the RFC make it clear that they include the query string as part of the Request-URI, e.g.,

5.1 Request-Line

   The Request-Line begins with a method token, followed by the
   Request-URI and the protocol version, and ending with CRLF. The
   elements are separated by SP characters. No CR or LF is allowed
   except in the final CRLF sequence.

        Request-Line   = Method SP Request-URI SP HTTP-Version CRLF

which by your logic forbids query strings in the Request-Line.  The reference to abs_path either was not meant to be precise or is erroneous.  There's no other possible interpretation here that's remotely sane.  Also see, for instance,

   The server is refusing to service the request because the Request-URI
   is longer than the server is willing to interpret. This rare
   condition is only likely to occur when a client has improperly
   converted a POST request to a GET request with long query
   information

i.e., the query information is part of the Request-URI (otherwise it couldn't lengthen it).

Saying Request-URI can't include a query string more or less means that query strings don't exist in HTTP, since pretty much nothing in the standard would then distinguish between Request-URIs that differ only in their query strings.  Your interpretation is not possible.
Comment 24 Josef 2008-10-04 20:06:06 UTC
When considering Mediawiki as a service, e.g. in automated workflows, it would surely be helpful to be able to distinguish between existing and non-existing pages. For this purpose, there is no alternative to a 404 code to http://de.wikipedia.org/wiki//Hétérogénéité and similar non-existing pages. This matches the REST thought paradigm where users can then POST to have the page created.
Comment 25 Brion Vibber 2008-12-22 23:40:36 UTC
Reimplemented this in r44919.

This is less expansive than the old 2005 implementation (r11307), hitting only page views (won't affect action=edit) and doesn't attempt to cover error conditions either (many of which should probably return a different code).

Pages which exist in the DB or return true for Title::isAlwaysKnown() such as file pages for existing files, as well as category pages that exist, are treated as existing by returning true for Article::hasViewableContent().
Comment 26 Splarka 2008-12-27 02:54:48 UTC
Observations:

It is very annoying to have must-revalidate along with 404, as the browser cannot cache the page even with forward/backward navigation. This isn't terribly critical, but form elements are also reinitilized in most browsers in such a case. If the MediaWiki:Noarticletext has custom forms they get erased (and even the search box). This should probably become a separate bug entry (suppress must-revalidate on 404s?).

The fringe cases of ISP-hijacked, browser-overridden, seems /mostly/ harmless, since content without a page (populated missing categories, shared media without local descriptions) which can be accessed via easy means (such as via Search [GO]) are not given 404. However, one case where they are, but still GOable, is any page starting with the User: prefix (whether the user exists or not, whether the page is a base page or user subpage). Example: http://meta.wikimedia.org/w/index.php?search=User%3AIamgoingto404ville/foo&go=Go 302s to a 404 which should never happen. This should probably become a separate bug entry (never 302 to a 404?).
Comment 27 Dan Jacobson 2009-05-13 00:07:48 UTC
In https://issues.apache.org/bugzilla/show_bug.cgi?id=47186 there you are on one hand returning
  'badtitletext'=> 'The requested page title was invalid, empty, or an incorrectly linked inter-language or inter-wiki title.
and at the same time a cheery 200 OK. Are you sure you want to do that?
Comment 28 mac.med02 2009-10-02 18:52:03 UTC
*** Bug 6545 has been marked as a duplicate of this bug. ***
Comment 29 Manfred Krüger 2009-10-17 18:09:19 UTC
http://en.wikipedia.org/wiki/File:TestIt.jpg still yields status 200, so for files it doesn't work.
Comment 30 Brion Vibber 2009-10-19 23:57:23 UTC
Looks fine on a regular page; may be something funky about image pages.

ImagePage::hasViewableContent() correctly returns false here, so we've probably got something shortcutting and the skip doesn't end up going right. The check & 404 are in Article::showMissingArticle(), which gets called from Article:view()... in a scary scary scary loop that I don't understand at first glance. :)
Comment 31 Dan Jacobson 2009-11-15 02:28:38 UTC
As of r59081 for regular pages it still breaks if one appends a %25E9 to the URL:
$ GET -PSd http://taizhongbus.jidanni.org/index.php?title=NOSUCHPAGE
GET http://taizhongbus.jidanni.org/index.php?title=NOSUCHPAGE --> 404 Not Found
$ GET -PSd http://taizhongbus.jidanni.org/index.php?title=NOSUCHPAGE%25E9
GET http://taizhongbus.jidanni.org/index.php?title=NOSUCHPAGE%25E9 --> 200 OK
Comment 32 Platonides 2009-11-15 15:19:29 UTC
(In reply to comment #31)
> As of r59081 for regular pages it still breaks if one appends a %25E9 to the
> URL:

"The requested page title was invalid, empty, or an incorrectly linked inter-language or inter-wiki title. It may contain one or more characters which cannot be used in titles."
Not providing a 404 on such case doesn't look like a bug.
OTOH 200 might not be the best status code either, so perhaps give out a 400?
Comment 33 Alexandre Emsenhuber [IAlex] 2010-12-08 18:21:52 UTC
*** Bug 26282 has been marked as a duplicate of this bug. ***
Comment 34 wikimedia.bugzilla 2011-01-11 09:17:10 UTC
Created attachment 7970 [details]
Trivial patch to return 404 in the File namespace
Comment 35 Roger W Haworth 2011-01-14 17:56:39 UTC
Deleted or never-existed categories should return HTTP 404. I have raised this request as a separate bug: https://bugzilla.wikimedia.org/show_bug.cgi?id=26729
Comment 36 Bawolff (Brian Wolff) 2011-01-16 00:48:50 UTC
image page issue fixed in r80407.

Category page issue (bug 26729) in previous comment fixed in r80406


In regards to Comment 31, i agree with Platonides, 404 is not the appropriate response code in that situation.


 marking as fixed. If i missed anything (this bug is long, and hard to keep track of why it was re-opened) please re-open.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links