Last modified: 2011-02-16 01:06:45 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T29173, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 27173 - Remove noindex meta tag from HTML of logged in users
Remove noindex meta tag from HTML of logged in users
Status: RESOLVED FIXED
Product: MediaWiki extensions
Classification: Unclassified
FlaggedRevs (Other open bugs)
unspecified
All All
: Normal trivial with 1 vote (vote)
: ---
Assigned To: Rob Lanphier
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2011-02-05 08:42 UTC by Mathias Schindler
Modified: 2011-02-16 01:06 UTC (History)
10 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Mathias Schindler 2011-02-05 08:42:32 UTC
Logged in users to (German language) Wikipedia pages are served with the following line in the html:

<meta name="robots" content="noindex,nofollow" /> 

This line is meaningless for logged in users and can be removed without any downside.
Comment 1 Mark A. Hershberger 2011-02-05 22:04:50 UTC
I don't understand: what advantage would users see if this were removed?
Comment 2 Mathias Schindler 2011-02-05 22:06:03 UTC
A 50 byte size reduction per page (before compression).
Comment 3 Platonides 2011-02-05 22:12:30 UTC
Those pages are indeed not suitable for robot indexing. The 50 byte size reduction is not significant.
We could remove more bytes by removing the script variables, comments or performing html5 minimization techniques. But you would need to reason why those few bytes make a difference.
Comment 4 Yellowcard 2011-02-06 16:51:26 UTC
I hereby support the bug Matthias has described above and therefore the deletion of the meta robots tag for logged in users. I do not care at all about 50 bytes more or less and that is not the point this is all about, but the meta robots tag brings forward disadvantages in the shape of dangers. These dangers are shown in the current debate about Google.de not listing German Wikipedia articles under some circumstances.

Basically, there are two possible scenarios. I want to describe them both in the following. When I say "Google Bot" I mean any search engine crawler as well - I just take the Google crawler due to current topicality.

1st: Google Bot crawls pages as an anonymous user (not sending header cookies). This scenario is the standard one we assume right now. We do not know any search enigne bots which crawl while being logged in. Therefore: Any robots information is totally senseless to be sent to logged in users as they are generally no crawlers and so do not read robots messages. 1st scenario means: The meta information is obsolete.

2nd: Google Bot crawls pages as logged in user. In this case, the usage of robots information is sensitive also for logged in users. Then, however, this could be the reason (or one reason among others) regarding the Google <-> Wikipedia problem existing right now. If the 2nd scenario could apply, the robots information should be removed temporarily just to make sure it is NOT responsible for the problems. 2nd scenario means: It is likely that the robots information and the Google problem are related to each other. To fix the problem as fast as possible, the robots information should be disabled (for a while, at least).

As you can see, both possible cases make me urge to delete the robots information - at least for a couple of weeks. As soon as Google lists up all the Wikipedia articles again and both MediaWiki Techs and Google Techs found the reason causing the problem, they should deliberate if this meta tags are reasonable and should be added back.

However, according to statements from Wikimedia, neither Google Bots nor any other search engine crawlers log in. If this is true, there is no need for those meta tags as they are NEVER read by crawlers and are therefore no more than source code waste.
Comment 5 LordAndrew 2011-02-06 17:37:02 UTC
The Googlebot indexing problem (bug 27155) is a problem on Google's end. I don't see why anything has to be done here. Removing the robot indexing policy would result in a bunch of useless pages being indexed, but if the search engine isn't going to display it in the search results then all we've done is waste resources.

Search engine indexing bots ''shouldn't'' be indexing while logged in. But if someone does write a search engine bot that logs in for some reason, it should follow the same indexing policies as all other search engine bots. If the robots policy is removed for logged in users, then such a bot would be getting different indexing instructions that those that don't. Why would we grant an exception to the robot indexing policies simply because the bot logs in?
Comment 6 Yellowcard 2011-02-06 17:41:37 UTC
(In reply to comment #5)
> The Googlebot indexing problem (bug 27155) is a problem on Google's end. 

This is not proven yet - not at all.
Comment 7 PDD 2011-02-06 17:51:32 UTC
(In reply to comment #5)
> But if
> someone does write a search engine bot that logs in for some reason, it should
> follow the same indexing policies as all other search engine bots.

Exactly, and the site-wide robots indexing policy *is not* and *should not be* set via META tags. The META tags were introduced as a new feature of FlaggedRevs to prevent unflagged (!) revisions of pages from being indexed. So, if anything, only unflagged revisions should have the META tag with NOINDEX,NOFOLLOW, but for logged-in users *all* pages (flagged und unflagged) have this META tag. This is a bug. Bugs should be fixed.
Comment 8 Derk-Jan Hartman 2011-02-06 18:17:18 UTC
PDD is right, this is a bug in flaggerevs. changing component.
Comment 9 Platonides 2011-02-06 18:28:52 UTC
So, the real reason about this bug is that Google is miscrawling wikipedia and
someone though that such line was at fault. It is not.
If that was the reason, all wikipedias would be affected, not only dewiki. No
wikipedia article would be listed. And if it were logged in, cached pages would
contain the Google user name at the top.
Comment 10 PDD 2011-02-06 18:35:01 UTC
(In reply to comment #9)
> If that was the reason, all wikipedias would be affected, not only dewiki.

Erm, are you commenting here without actually having looked into the matter? The META tag bug affects dewiki, huwiki and plwiki *only*, so of course it can't have any effect on "all wikipedias", no matter what that effect might be...
Comment 11 DerHexer 2011-02-06 18:37:57 UTC
These are two separated bugs: One about that Google issue and one about useless ,noindex,nofollow‘ in flagged (and not only unflagged) revisions. The latter is called [[bugzilla:27173]] (that one here), the former [[bugzilla:27155]].
Comment 12 Chad H. 2011-02-06 19:06:27 UTC
FWIW, the indexing problem is an issue on Google's end, not ours (bug 27155 tracks that)

We've actually been serving noindex,nofollow to logged in users in FlaggedRevs for quite some time now (the code in this regard hasn't changed in about a year). I think Googlebot's problems just raised people's awareness and it made a decent original assumption as to the cause.

Whether or not we should serve noindex,nofollow to logged in users is debatable, and I guess this bug serves that purpose.
Comment 13 Yellowcard 2011-02-06 19:13:53 UTC
(In reply to comment #12)
> Whether or not we should serve noindex,nofollow to logged in users is
> debatable, and I guess this bug serves that purpose.

Chad got the point. Mentioning Google was just an example and nothing else. Independent of the fact if crawlers do their job logged in or not, the meta tags are senseless and have to be removed - or can anyone explain to me why a crawler must not index an article version which has been checked ("flagged")?!
Comment 14 Derk-Jan Hartman 2011-02-06 19:30:26 UTC
This is caused because in FlaggedArticleView.php, setRobotPolicy has the following check:

<pre>
if ( !$this->pageOverride() && $this->article->isStableShownByDefault() ) {
// set noindex
}
</pre>

in this check, $this->pageOverride() returns false for stable versions for logged in users, yet true for stable versions for non-logged in users.

pageOverride() returns false for logged in users, due to the following check:

<pre>
                $config = $this->article->getVisibilitySettings();
                # Does the stable version override the current one?
                if ( $config['override'] ) {
                        if ( $this->showDraftByDefault() ) {
                                return ( $wgRequest->getIntOrNull( 'stable' ) === 1 );
                        }
                        # Viewer sees stable by default
                        return !( $wgRequest->getIntOrNull( 'stable' ) === 0 );
</pre>

ergo, pageOverride() does not account for usergroup settings in viewing stable pages, it only takes into account usersettings, page settings and url overrides.
Comment 15 Aaron Schulz 2011-02-06 20:06:39 UTC
(In reply to comment #14)
> ergo, pageOverride() does not account for usergroup settings in viewing stable
> pages, it only takes into account usersettings, page settings and url
> overrides.

Yes, it does check that. That's what showDraftByDefault() does.

The real cause is that logged-in users see the current version by default, even if it is synced with the stable version. Try logging in and adding ?stable=1 to the page URL (noindex goes away). The two versions are almost the same, except the stable has filetimestamp=X added to thumbnail links. In rare cases, the current version might use newer versions of Commons files too (feature of bug 15748).

One way to index these would be to have setRobotPolicy() check for this scenario (viewing the draft when the stable synced with it).

I've been doing refactoring yesterday to make the code easier to read. I'll deal with this after finishing that.
Comment 16 Aaron Schulz 2011-02-10 03:04:32 UTC
Sync check (per comment #15) added in r81874.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links