Last modified: 2008-08-04 00:53:52 UTC
On user pages (and maybe som other namespaces as well) it should be possible to use a magic word, something like __NOGOOGLE__, in order to make the google robot not indexing that page. For instance, I have on my user page a set of subpages, sand boxes where I play and test or make drafts to what later could be real wikipedia articles. So I don't want google to index these pages. They now appear early in google's search result. On a html-page, the solution to this is to add the line <pre> <meta name="robots" content="noindex,nofollow"> </pre> Could someone implement something like __NOGOOGLE__ to be used by users who don't want their user pages indexed?
No. Namespaces which robots are asked not to index can be configured, however in this case, if it's public, then it's indexable. A __NOINDEX__ type magic word has been discussed before and rejected simply because it's subject to abuse and misunderstanding. Google are quite quick at re-crawling bits of Wikipedia content, so if a draft page has moved to the article space, they'll reflect it within a few days, usually.
But let's say that the magic word __NOINDEX__ has no effect but on subpages belonging to the User namespace, and nowhere else. For instance, only on pages like: http://xx.wikipedia.org/wiki/User:N_N/a_subpage. Is that a possible compromise?
No, it's up to the people who manage the web site to determine what is and is not indexed by search engines, and Wikimedia wikis generally have everything indexed bar pages such as VfD/AfD/whatever the trendy TLA for deletion debates is, which external viewers don't typically understand. There is _no reason_ to disable indexing of your user page or any other page in that namespace. What you are posting to a public web site is public. If you don't want anyone else to be able to read it or edit it or whatever, _don't post it_.
Reopening this as we're considering it or similar as an improvement over lots of manual editing of global robots.txt.
We get complaints frequently via otrs about people wanting various logs removed that malign their companies etc. They usually serve a purpose, but it's not like it's content. Just an example. http://en.wikipedia.org/wiki/Wikipedia:WikiProject_Spam/LinkReports
Having a __NOINDEX__ magic word is probably the best strategy if we want to differentiate what content ought to appear in search engines in more than a very crude way. Routinely editing robots.txt is no solution, and I consider it undesirable to simply block out very broad categories of material (such as everything that is not an article).
I looked into the code, but it appears that $wgOut->setRobotPolicy is called at the very beginning of Article::view. That is a lot lines before the page content is parsed and magic words are evaluated. Anybody an idea how to do this?
It should be possible to call it again to override it with specific data. You'd have to do this when pulling wiki output out of the ParserOutput object (otherwise the parser cache will always eat everything).
*** Bug 14209 has been marked as a duplicate of this bug. ***
Fixed in r37973. I patterned the code after __NEWSECTIONLINK__, and it seems to work fine.