Last modified: 2010-03-15 21:03:57 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T17663, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 15663 - Blank lines at the end of global robots.txt cause syntax problems when MediaWiki:robots.txt is appended
Blank lines at the end of global robots.txt cause syntax problems when MediaW...
Status: RESOLVED FIXED
Product: Wikimedia
Classification: Unclassified
General/Unknown (Other open bugs)
unspecified
All All
: Normal normal (vote)
: ---
Assigned To: Nobody - You can work on this!
http://en.wikipedia.org/wiki/MediaWik...
: easy, shell
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2008-09-20 16:49 UTC by Ilmari Karonen
Modified: 2010-03-15 21:03 UTC (History)
5 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Ilmari Karonen 2008-09-20 16:49:34 UTC
The Robots Exclusion Standard (http://www.robotstxt.org/orig.html), which defines the syntax of robots.txt files, says, among other things, that:

1. there may not be more than one record with "User-Agent: *", and
2. sections may not contain blank lines.

Thus, the only standard-conforming (and thus reasonably reliable) way for users editing the local part of robots.txt, as specific in [[MediaWiki:Robots.txt]], to include rules pertaining to all robots is to include them at the very top with no blank lines preceding them, so that they get appended directly to the "User-Agent: *" section of the global robots.txt.

Unfortunately, even this doesn't currently work right, since the global part of Wikimedia's robots.txt contains some blank lines at the end.  Please remove said lines or replace them with comments.

(Actually, the current implementation is somewhat annoyingly fragile in general.  It might be better for MediaWiki itself to parse the content of both the local and global parts of robots.txt (it's not hard), preferably with fairly relaxed parsing rules, and merge them properly into a single file guaranteed to have correct syntax.  While it it, the software could try to provide notification of any unrecognized lines and other potential errors detected during the parsing.)
Comment 1 Mike.lifeguard 2009-03-20 17:48:02 UTC
Adding JeLuF to CC, as they wrote this, IIRC.
Comment 2 Mike.lifeguard 2009-08-06 14:29:17 UTC
JeLuF, could you please take a look at this and/or bug 15878? There are reports that search spiders are indexing what they shouldn't be (http://meta.wikimedia.org/w/index.php?title=Talk:Spam_blacklist&oldid=1589272#COIBot_reports_showing_up_in_Google_results).
Comment 3 Ilmari Karonen 2010-01-07 10:37:45 UTC
It's been over a year, would someone please fix this bug?  All it should take to fix the immediate issue is to remove the blank lines from the end of the global robots.txt or to replace them with comments.
Comment 4 JeLuF 2010-03-15 21:03:57 UTC
Fixed.

There was a hardcoded \n\n in robots.php causing the problems. Should now be fine.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links