Last modified: 2014-06-02 22:43:46 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T65891, the corresponding Phabricator task for complete and up-to-date bug report information.

Bug 63891 - URLs with action=render should not be indexed by search engines


Summary:	URLs with action=render should not be indexed by search engines

Status:	RESOLVED FIXED

Product:	MediaWiki
Classification:	Unclassified
Component:	General/Unknown (Other open bugs)
Version:	1.23.0
Hardware:	All All

Importance:	Normal normal (vote)
Target Milestone:	---
Assigned To:	JuneHyeon Bae (devunt)

URL:
Whiteboard:
Keywords:	easy

Depends on:
Blocks:
	Show dependency tree / graph

Reported:	2014-04-14 04:42 UTC by Matthew Flaschen
Modified:	2014-06-02 22:43 UTC (History)
CC List:	2 users (show)

See Also:	46424
Web browser:	---
Mobile Platform:	---
Assignee Huggle Beta Tester:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Description Matthew Flaschen 2014-04-14 04:42:23 UTC

URLs with action=render are indexed by external search engines like Google.  For example, see https://www.google.com/search?q=site:en.wikipedia.org+inurl:action%3Drender .

I'm not sure the best approach to fix this.  The pages are not well-formed (there's no <head> at all), so I don't know whether a <meta> robots directive (http://www.robotstxt.org/meta.html) will work.  It might be necessary to use robots.txt .

Comment 1 Krinkle 2014-04-26 09:19:45 UTC

<meta> will likely work since <html>, <head> and <body> are optional. Browsers automatically create a head and body for text/html documents, and relevant tags are hoisted to the <head> accordingly.

However that would be undesired for more important reasons since action=render is used to retrieve partial documents. If that would start including non-content, the result is that some applications will treat that <meta> tag as part of the content and thus could incorrectly treat articles as non-indexable.

This sounds like a perfect case for an http header.

Comment 2 Krinkle 2014-04-26 09:21:25 UTC

(In reply to Krinkle from comment #1)
> This sounds like a perfect case for an http header.

Specifically,

  X-Robots-Tag: noindex

This is also used on the web as the way to exclude internal APIs that don't respond with html (e.g. JSON responses, or images) when robots.txt hacking is not desired.

Comment 3 Matthew Flaschen 2014-04-29 00:46:21 UTC

Good find. X-Robots-Tag looks like the way to go.

It's supported by Google (https://developers.google.com/webmasters/control-crawl-index/docs/robots_meta_tag?csw=1), Bing (http://www.bing.com/webmaster/help/how-can-i-remove-a-url-or-page-from-the-bing-index-37c07477), and I'm sure others.

Comment 4 Gerrit Notification Bot 2014-05-23 05:07:28 UTC

Change 134996 had a related patch set uploaded by devunt:
Add 'X-Robots-Tag: noindex' header in action=render pages

https://gerrit.wikimedia.org/r/134996

Comment 5 Gerrit Notification Bot 2014-06-02 20:15:07 UTC

Change 134996 merged by jenkins-bot:
Add 'X-Robots-Tag: noindex' header in action=render pages

https://gerrit.wikimedia.org/r/134996

Comment 6 JuneHyeon Bae (devunt) 2014-06-02 22:43:46 UTC

merged by Mattflaschen

Wikimedia Bugzilla is closed!

Search

Personal tools

Navigation

Links