Last modified: 2014-04-14 04:43:00 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T48424, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 46424 - Urls with curid query indexed by Google
Urls with curid query indexed by Google
Status: NEW
Product: MediaWiki
Classification: Unclassified
General/Unknown (Other open bugs)
1.20.x
All All
: Normal normal (vote)
: ---
Assigned To: Nobody - You can work on this!
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2013-03-21 14:53 UTC by Krinkle
Modified: 2014-04-14 04:43 UTC (History)
3 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Krinkle 2013-03-21 14:53:43 UTC
This was supposedly fixed (bug 16865; r45360).

And though MediaWiki is indeed outputting "noindex", Google appears to be ignoring it and as such is indexing duplicate content.


A few examples:

https://www.google.com/search?q=inurl:curid+site:mediawiki.org

1. Discussion - MediaWiki
   www.mediawiki.org/?curid=84252
   Mar 29, 2012
   Hi! Searching for the shortest urls for wikis using scripts other then Latin was a longtime nightmare. urls using the "wgArticleId" from ...

2. Link to - MediaWiki
   www.mediawiki.org/?curid=84277
   Mar 30, 2012
  mw.config.set({,,,, wgPageName":"Ernst_Lossa","wgTitle":"Ernst Lossa", "wgCurRevisionId":99548829,"wgArticleId":2809853, ...} ...) nr.

https://www.google.com/search?q=inurl:curid+site:wikipedia.org

3. [edit] Notes - Wikipedia
   en.wikipedia.org/wiki/index.html?curid=7490642&action=render
   My Brightest Diamond is the project of singer–songwriter and multi-instrumentalist Shara Worden. The band has released three studio albums, 2006's Bring Me ...

4. Wikipedia, the free encyclopedia
   en.wikipedia.org/wiki?curid=
  The 1950 Atlantic hurricane season was the first year in the Atlantic hurricane database (HURDAT) in which storms were given names by the United States Air ...

5. Wikipedia
   simple.wikipedia.org/?curid=
   This is the front page of the Simple English Wikipedia. Wikipedias are places where people work together to write encyclopedias in different languages. We use ...

6. Table tennis at the 2004 Summer Paralympics - Wikipedia, the free ...
   en.wikipedia.org/wiki/index.html?curid=1011065
   Table Tennis at the 2004 Summer Paralympics was staged at the Galatsi Olympic Hall from September 18 to September 27. Competitors were divided into ten ...

7. Upper Eastside - Wikipedia, the free encyclopedia
   en.m.wikipedia.org/wiki/index.html?curid=19698600
   A MiMo restaurant on Biscayne Boulevard in the Upper Eastside. The Upper Eastside is famous for its post war MiMo architecture, and is home to the MiMo ...

8. Robert Loggia - Wikipédia
   fr.wikipedia.org/wiki/?curid=899678
   Translate this page
   Vous pouvez partager vos connaissances en l'améliorant (comment ?) selon les recommandations des projets correspondants. Robert Loggia est un acteur et ...


What I found is that:

- The ones from mediawiki.org are LiquidThreads pages. LQT apparently overrides this logic from Article.php and as such is not outputting "robots => index". So those are a flaw on our end.

- #3 has action=render. That's never supposed to be indexed (separate bug?) but the way it is used circumvents some of our deferences. #3 accesses an article by the name of "index.html",, but then overrides the curid and tacks on action=render. Basically doing:
   en.wikipedia.org/wiki/Some_page_name?curid=7490642&action=render

- #4 and #5 have an empty curid

- #6 and #7 are more examples of this odd "index.html" title

- #8 is like the ones on mediawiki.org except that these are not from LQT and are actually outputting "noindex". This is the main problem.


Though it is somewhat outside the scope of this bug, I think we should:
* Always output rel=canonical when viewing a regular page
  (whenever not on a Special page, not a non-View action, no diff or oldid)
  So any url, no matter how weirdly constructed, with:
  - /?title=
  - /w?title=
  - /w/index.php?title=
  - any of the above with curid instead of title
  - any of the above via /wiki/
  - any of the above with action=view

  Right now we're only doing rel=canonical on redirects which makes no sense to me.
  It is perfectly file to output rel=canonical on the canonical page itself.

* Always output noindex when not rel=canonical but are viewing a page.
  Any wikipage/action=view that is not a simple view of the latest version of an article,
  e.g. with diff or oldid
Comment 1 Matthew Flaschen 2014-04-14 04:43:00 UTC
Filed a separate bug for action=render, bug 63891 .

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links