Last modified: 2014-03-25 17:51:00 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T63132, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 61132 - robots.txt should let search engines to index tools.wmflabs.org
robots.txt should let search engines to index tools.wmflabs.org
Status: RESOLVED WONTFIX
Product: Wikimedia Labs
Classification: Unclassified
tools (Other open bugs)
unspecified
All All
: Unprioritized normal
: ---
Assigned To: Marc A. Pelletier
http://tools.wmflabs.org/robots.txt
: code-update-regression
Depends on:
Blocks: tool-missing-ts-feat
  Show dependency treegraph
 
Reported: 2014-02-10 11:07 UTC by Nemo
Modified: 2014-03-25 17:51 UTC (History)
4 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Nemo 2014-02-10 11:07:35 UTC
http://tools.wmflabs.org/robots.txt:

User-agent: *
Disallow: /

Tools are obscure and hard to find enough without forbidding search engines to do their job as they do on Toolserver...

(In reply to bug 59118 comment 5)
> Set up robots.txt as a temporary measure to:
> 
> | User-agent: *
> | Disallow: /
Comment 1 Tim Landscheidt 2014-02-10 11:14:37 UTC
Do you mean the tools themselves (e. g. https://tools.wmflabs.org/wikilint/) or the index (just https://tools.wmflabs.org/)?

The first is a WONTFIX, for the second I haven't found a solution yet.  Do you have an idea?
Comment 2 Nemo 2014-02-10 11:17:17 UTC
Why would the first be a WONTFIX?
For the second see the docs,

Allow: /$

is supposed to work (at least with Google).
Comment 3 Tim Landscheidt 2014-02-10 11:37:20 UTC
(In reply to comment #2)
> Why would the first be a WONTFIX?

Because there are tools that are linked from every wiki page and any spider accessing them brings the house down.  As tools are created and updated without any review by admins and wiki edits are not monitored as well, blacklisting them after the meltdown doesn't work.

So unlimited spider access is not possible.

> For the second see the docs,

Unfortunately, there is no specification for robots.txt; that's the core of the problem.

> Allow: /$

> is supposed to work (at least with Google).

According to [[de:Robots Exclusion Standard]] with Googlebot, Yahoo! Slurp and msnbot.  And the other spiders?  Will they read it in the same way or as "/"?  How do we whitelist "/?Rules"?
Comment 4 Nemo 2014-02-10 12:36:26 UTC
(In reply to comment #3)
> (In reply to comment #2)
> > Why would the first be a WONTFIX?
> 
> Because there are tools that are linked from every wiki page 

Blacklist them, then?

https://toolserver.org/robots.txt has:

User-agent: *
Disallow: /~magnus/geo/geohack.php
Disallow: /~daniel/WikiSense
Disallow: /~geohack/
Disallow: /~enwp10/
Disallow: /~cbm/cgi-bin/

> and any spider
> accessing them brings the house down.  As tools are created and updated
> without
> any review by admins and wiki edits are not monitored as well, blacklisting
> them after the meltdown doesn't work.
> 
> So unlimited spider access is not possible.

Nobody said unlimited. This works on Toolserver, it's not inherently impossible. It's unfortunate that migration implies such usability regressions, because then tool developers will try to postpone migration as long as possible and we'll have little time.

> 
> > For the second see the docs,
> 
> Unfortunately, there is no specification for robots.txt; that's the core of
> the
> problem.

Not really, there is a specification but everyone has extensions. I meant Google's, as I said.

> msnbot.  And the other spiders?  Will they read it in the same way or as
> "/"? 

You'll find out with experience.

> How do we whitelist "/?Rules"?

Mentioning it specifically, no?
However, while I can understand blocking everything except the root page, whitelisting individual pages is rather crazy and I don't see how /?Rules would be more interesting than most other pages. Horrible waste of time to go haunt them, you could as well just snail mail a print of webpages on demand.
Comment 5 Tim Landscheidt 2014-02-10 12:58:45 UTC
(In reply to comment #4)
> [...]

> > and any spider
> > accessing them brings the house down.  As tools are created and updated
> > without
> > any review by admins and wiki edits are not monitored as well, blacklisting
> > them after the meltdown doesn't work.

> > So unlimited spider access is not possible.

> Nobody said unlimited. This works on Toolserver, it's not inherently
> impossible. It's unfortunate that migration implies such usability
> regressions,
> because then tool developers will try to postpone migration as long as
> possible
> and we'll have little time.

I haven't met a tool developer who postpones migration because of robots.txt (or cares about that at all, because their tools are linked from Wikipedia).  Noone even asked to change robots.txt.  Who are they?

If tool developers guarantee that a specific tool is resistant to spiders, we can whitelist that (even automated à la ~/.description).

> [...]

> > msnbot.  And the other spiders?  Will they read it in the same way or as
> > "/"? 

> You'll find out with experience.

> [...]

Why would we take that risk with only marginal benefit gained?  "Experience" means a lot of people yelling.
Comment 6 Nemo 2014-02-10 13:08:03 UTC
(In reply to comment #5)
> I haven't met a tool developer who postpones migration because of robots.txt

Why would you meet them? People unaware of this obscure dark corner of the internet called tool labs, hidden from the rest of the WWW, will never arrive to us.
Comment 7 Tim Landscheidt 2014-02-10 13:12:33 UTC
(In reply to comment #6)
> > I haven't met a tool developer who postpones migration because of robots.txt

> Why would you meet them? People unaware of this obscure dark corner of the
> internet called tool labs, hidden from the rest of the WWW, will never arrive
> to us.

That's why I asked you: Who postpones migration to Labs because of robots.txt?
Comment 8 Tim Landscheidt 2014-02-10 13:13:58 UTC
Sorry, that was too fast.
Comment 9 Nemo 2014-02-10 13:17:37 UTC
(In reply to comment #7)
> That's why I asked you: Who postpones migration to Labs because of
> robots.txt?

Sorry, it's not my job to go ask dozens or hundreds of tools owners why they've not yet migrated their tools.

Missed this:

(In reply to comment #5)
> Why would we take that risk with only marginal benefit gained? [...]

Ah, right, marginal benefit. I had forgotten that Tool Labs was only built as a monument to computer science; having people finding and using tools and pages useful for them is just an accessory, a marginal benefit.
Comment 10 Tim Landscheidt 2014-02-10 13:36:44 UTC
(In reply to comment #9)
> (In reply to comment #7)
> > That's why I asked you: Who postpones migration to Labs because of
> > robots.txt?

> Sorry, it's not my job to go ask dozens or hundreds of tools owners why
> they've
> not yet migrated their tools.

Then why do you claim that it is related to robots.txt?

> Missed this:

> (In reply to comment #5)
> > Why would we take that risk with only marginal benefit gained? [...]

> Ah, right, marginal benefit. I had forgotten that Tool Labs was only built
> as a
> monument to computer science; having people finding and using tools and pages
> useful for them is just an accessory, a marginal benefit.

This bug isn't about "people finding and using tools and pages useful for them", but robots.txt.  If you want to increase the visibility of the available tools at Tools, you can set up a mirror at a more prominent wiki very easily.  The code for https://tools.wmflabs.org/ is at <http://git.wikimedia.org/blob/labs%2Ftoollabs.git/master/www%2Fcontent%2Flist.php>.
Comment 11 Jarry1250 2014-02-10 14:21:29 UTC
I need robots.txt-esque access for my tool, http://tools.wmflabs.org/wmukevents , which is a calendar feed. For users to be able to add it to their Google calendars requires the Google Calendar Bot to be able to access it. Unfortunately Google Calendar Bot uses the same user agent as the regular Google spider.

That said, I mentioned this to Coren a while back, he twiddled some levers (can't recall precisely what) and now it WORKSFORME, so perhaps I've misremembered the problem on some level.
Comment 12 Merlijn van Deen (test) 2014-02-10 14:31:32 UTC
> Ah, right, marginal benefit. I had forgotten that Tool Labs was only built
> as a
> monument to computer science; having people finding and using tools and pages
> useful for them is just an accessory, a marginal benefit.

Google is smart enough to do it's job even without robots.txt:

https://encrypted.google.com/search?q=gerrit%20patch%20uploader
Comment 13 Merlijn van Deen (test) 2014-02-10 14:32:12 UTC
Sorry, that should have read 'Google is smart enough to do it's job even when blocked by robots.txt'
Comment 14 Marc A. Pelletier 2014-03-25 17:46:51 UTC
Closing as WONTFIX for the general case.  Individual tool owners are welcome to request a whitelisting of their tool so long as they have properly validated that a bot spidering them cannot cause issues.

In particular, tools which return pages with dynamic content that is or may be expensive on the database to generate and which contains further internal links generally throw spiders in a loop and consume a great deal of resources, impacting all other tools.
Comment 15 Nemo 2014-03-25 17:51:00 UTC
Meh. Ok, will host my stuff elsewhere. I'd like it to be found and used. :)

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links