Last modified: 2010-03-15 21:03:57 UTC
The Robots Exclusion Standard (http://www.robotstxt.org/orig.html), which defines the syntax of robots.txt files, says, among other things, that: 1. there may not be more than one record with "User-Agent: *", and 2. sections may not contain blank lines. Thus, the only standard-conforming (and thus reasonably reliable) way for users editing the local part of robots.txt, as specific in [[MediaWiki:Robots.txt]], to include rules pertaining to all robots is to include them at the very top with no blank lines preceding them, so that they get appended directly to the "User-Agent: *" section of the global robots.txt. Unfortunately, even this doesn't currently work right, since the global part of Wikimedia's robots.txt contains some blank lines at the end. Please remove said lines or replace them with comments. (Actually, the current implementation is somewhat annoyingly fragile in general. It might be better for MediaWiki itself to parse the content of both the local and global parts of robots.txt (it's not hard), preferably with fairly relaxed parsing rules, and merge them properly into a single file guaranteed to have correct syntax. While it it, the software could try to provide notification of any unrecognized lines and other potential errors detected during the parsing.)
Adding JeLuF to CC, as they wrote this, IIRC.
JeLuF, could you please take a look at this and/or bug 15878? There are reports that search spiders are indexing what they shouldn't be (http://meta.wikimedia.org/w/index.php?title=Talk:Spam_blacklist&oldid=1589272#COIBot_reports_showing_up_in_Google_results).
It's been over a year, would someone please fix this bug? All it should take to fix the immediate issue is to remove the blank lines from the end of the global robots.txt or to replace them with comments.
Fixed. There was a hardcoded \n\n in robots.php causing the problems. Should now be fine.