Last modified: 2014-04-23 04:38:42 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T31162, the corresponding Phabricator task for complete and up-to-date bug report information.

Bug 29162 - Automatically add encoded URL lines for entries in MediaWiki:robots.txt


Summary:	Automatically add encoded URL lines for entries in MediaWiki:robots.txt

Status:	NEW

Product:	Wikimedia
Classification:	Unclassified
Component:	General/Unknown (Other open bugs)
Version:	unspecified
Hardware:	All All

Importance:	Lowest enhancement (vote)
Target Milestone:	---
Assigned To:	Nobody - You can work on this!

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:	robots.txt
	Show dependency tree / graph

Reported:	2011-05-27 00:45 UTC by Ryan Lane
Modified:	2014-04-23 04:38 UTC (History)
CC List:	8 users (show)

See Also:
Web browser:	---
Mobile Platform:	---
Assignee Huggle Beta Tester:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Description Ryan Lane 2011-05-27 00:45:16 UTC

Crawlers may still index pages that should be disallowed due to encoded characters, see the following example:

Disallow: /wiki/Wikipedia:Arbitration/
Disallow: /wiki/Wikipedia3AArbitration/
Disallow: /wiki/Wikipedia3AArbitration%2F
Disallow: /wiki/Wikipedia:Arbitration%2F

MediaWiki should generate these extra rules automatically for users.

Comment 1 MER-C 2011-05-27 13:13:22 UTC

Better still, use <link rel="canonical" href="http://en.wikipedia.org/wiki/X"> for this purpose.

More info: http://googlewebmastercentral.blogspot.com/2009/02/specify-your-canonical.html

Comment 2 FT2 2011-05-27 13:51:10 UTC

Very useful, however we need to check what happens if some of the alternate URLs to the canonical page are excluded by robots.txt and some aren't. 

Does it apply the robots.txt rule that it has for the "canonical" page, to all alternatives? Or does it get confused?

Example:

/Wikipedia%3AArbitration%2FExample is stated to have /Wikipedia:Arbitration/Example as its canonical link. However one of these is NOINDEXed via robots.txt or in its header and one isn't. Knowing the canonicity helps to identify these as "duplicates" and "the same page". But does it guarantee both of these will be treated as NOINDEXED if one of them is and the other isn't? Or do we still have to cover all variants of the URL in robots.txt?

Comment 3 FT2 2011-05-27 17:43:31 UTC

To clarify, URL variants where robots.txt or header tags prohibit spidering will probably be excluded from spidering in the first place. So Google will be left to collate those URL variants it came across where robots,txt or header tags _didn't_ prevent spidering -- and a "canonical" setting which states these are all the same page. 

Ie this setting could help avoid duplicates but my guess is it probably _won't_ prevent URLs not stopped by robots.txt or header tags from being listed in results.

Comment 4 Alexandre Emsenhuber [IAlex] 2011-05-28 19:20:50 UTC

Changing product/component to Wikimedia/Site requests, MediaWiki:Robots.txt is a WMF hack, there's no such feature in MediaWiki core.

Comment 5 Mark A. Hershberger 2011-06-15 01:30:51 UTC

(In reply to comment #0)
> MediaWiki should generate these extra rules automatically for users.

(In reply to comment #4)
> MediaWiki:Robots.txt is a WMF hack, there's no such feature in MediaWiki core.

Now, how to prioritize...

Wikimedia Bugzilla is closed!

Search

Personal tools

Navigation

Links