Last modified: 2014-06-02 22:43:46 UTC
URLs with action=render are indexed by external search engines like Google. For example, see https://www.google.com/search?q=site:en.wikipedia.org+inurl:action%3Drender . I'm not sure the best approach to fix this. The pages are not well-formed (there's no <head> at all), so I don't know whether a <meta> robots directive (http://www.robotstxt.org/meta.html) will work. It might be necessary to use robots.txt .
<meta> will likely work since <html>, <head> and <body> are optional. Browsers automatically create a head and body for text/html documents, and relevant tags are hoisted to the <head> accordingly. However that would be undesired for more important reasons since action=render is used to retrieve partial documents. If that would start including non-content, the result is that some applications will treat that <meta> tag as part of the content and thus could incorrectly treat articles as non-indexable. This sounds like a perfect case for an http header.
(In reply to Krinkle from comment #1) > This sounds like a perfect case for an http header. Specifically, X-Robots-Tag: noindex This is also used on the web as the way to exclude internal APIs that don't respond with html (e.g. JSON responses, or images) when robots.txt hacking is not desired.
Good find. X-Robots-Tag looks like the way to go. It's supported by Google (https://developers.google.com/webmasters/control-crawl-index/docs/robots_meta_tag?csw=1), Bing (http://www.bing.com/webmaster/help/how-can-i-remove-a-url-or-page-from-the-bing-index-37c07477), and I'm sure others.
Change 134996 had a related patch set uploaded by devunt: Add 'X-Robots-Tag: noindex' header in action=render pages https://gerrit.wikimedia.org/r/134996
Change 134996 merged by jenkins-bot: Add 'X-Robots-Tag: noindex' header in action=render pages https://gerrit.wikimedia.org/r/134996
merged by Mattflaschen