Last modified: 2014-06-02 22:43:46 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T65891, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 63891 - URLs with action=render should not be indexed by search engines
URLs with action=render should not be indexed by search engines
Status: RESOLVED FIXED
Product: MediaWiki
Classification: Unclassified
General/Unknown (Other open bugs)
1.23.0
All All
: Normal normal (vote)
: ---
Assigned To: JuneHyeon Bae (devunt)
: easy
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2014-04-14 04:42 UTC by Matthew Flaschen
Modified: 2014-06-02 22:43 UTC (History)
2 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Matthew Flaschen 2014-04-14 04:42:23 UTC
URLs with action=render are indexed by external search engines like Google.  For example, see https://www.google.com/search?q=site:en.wikipedia.org+inurl:action%3Drender .

I'm not sure the best approach to fix this.  The pages are not well-formed (there's no <head> at all), so I don't know whether a <meta> robots directive (http://www.robotstxt.org/meta.html) will work.  It might be necessary to use robots.txt .
Comment 1 Krinkle 2014-04-26 09:19:45 UTC
<meta> will likely work since <html>, <head> and <body> are optional. Browsers automatically create a head and body for text/html documents, and relevant tags are hoisted to the <head> accordingly.

However that would be undesired for more important reasons since action=render is used to retrieve partial documents. If that would start including non-content, the result is that some applications will treat that <meta> tag as part of the content and thus could incorrectly treat articles as non-indexable.

This sounds like a perfect case for an http header.
Comment 2 Krinkle 2014-04-26 09:21:25 UTC
(In reply to Krinkle from comment #1)
> This sounds like a perfect case for an http header.

Specifically,

  X-Robots-Tag: noindex

This is also used on the web as the way to exclude internal APIs that don't respond with html (e.g. JSON responses, or images) when robots.txt hacking is not desired.
Comment 3 Matthew Flaschen 2014-04-29 00:46:21 UTC
Good find. X-Robots-Tag looks like the way to go.

It's supported by Google (https://developers.google.com/webmasters/control-crawl-index/docs/robots_meta_tag?csw=1), Bing (http://www.bing.com/webmaster/help/how-can-i-remove-a-url-or-page-from-the-bing-index-37c07477), and I'm sure others.
Comment 4 Gerrit Notification Bot 2014-05-23 05:07:28 UTC
Change 134996 had a related patch set uploaded by devunt:
Add 'X-Robots-Tag: noindex' header in action=render pages

https://gerrit.wikimedia.org/r/134996
Comment 5 Gerrit Notification Bot 2014-06-02 20:15:07 UTC
Change 134996 merged by jenkins-bot:
Add 'X-Robots-Tag: noindex' header in action=render pages

https://gerrit.wikimedia.org/r/134996
Comment 6 JuneHyeon Bae (devunt) 2014-06-02 22:43:46 UTC
merged by Mattflaschen

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links