Last modified: 2014-02-03 20:18:32 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T35406, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 33406 - Install sitemap extension into bugzilla, and then update bugzilla robots.txt
Install sitemap extension into bugzilla, and then update bugzilla robots.txt
Status: RESOLVED WONTFIX
Product: Wikimedia
Classification: Unclassified
Bugzilla (Other open bugs)
unspecified
All All
: Lowest normal (vote)
: ---
Assigned To: Daniel Zahn
: ops
Depends on: 46328
Blocks:
  Show dependency treegraph
 
Reported: 2011-12-29 01:10 UTC by Tim Landscheidt
Modified: 2014-02-03 20:18 UTC (History)
10 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Tim Landscheidt 2011-12-29 01:10:30 UTC
ATM, MediaZilla isn't indexed by search engines, which means that searching for MediaWiki bugs will *never* get one here, but directs one at most to one of those many fishy websites that pair up the bug mailing list with advertisements.  Even if one then reads the bug number and searches for "mediawiki bug 4711", one still doesn't get here.  So please remove robots.txt.
Comment 1 Mark A. Hershberger 2011-12-29 16:42:02 UTC
If you know the bug number, it is quicker to use http://bugzilla.wikimedia.org/4711 to find the bug.

Fulltext, freeform search would be great, but a lot can be done with advanced search.

Given the recent struggles here with vandalism and load, though, increasing our problems by introducing bots isn't a high priority right now.
Comment 2 Tim Landscheidt 2011-12-29 18:44:02 UTC
Just a disclaimer: I do not get paid for the time I spend here.  If WMF wants me to jump through some hoops, that's fine, but no, thanks.

  There's a nice Google Tech Talk by Spolsky where he explains the design principles of stackoverflow.com and the road bumps that impede workflows.  If WMF has some data that vandalism and load on the bugtracker outweigh the ease of use for and value of potential patches from MediaWiki users, so be it.
Comment 3 Sam Reed (reedy) 2011-12-29 20:07:06 UTC
(In reply to comment #1)
> If you know the bug number, it is quicker to use
> http://bugzilla.wikimedia.org/4711 to find the bug.
> 
> Fulltext, freeform search would be great, but a lot can be done with advanced
> search.
> 
> Given the recent struggles here with vandalism and load, though, increasing our
> problems by introducing bots isn't a high priority right now.

True about going directly, if you know, most people wouldn't know that bug 1234 actually is [1]

I'm not sure what the issue is with having Google among others index our bugzilla instance. It doesn't open us up to any more spam

Looking at [3] it seems we have the default BZ robots.txt installed one

A bit of searching around [2] among others, seems to suggest we'll need to install the sitemap extension [4]

I also don't think a blanket removal of the robots.txt is a good idea. However, doing by example [5], and updating to something along those lines seems very sane. I'm not sure why the default is so limiting. The sitemap extension also includes an improved robots.txt


We can easily get ops to update the robots.txt, because it's a quick fix, but might need to find a bit more time to get ops to actually install the extension, and then presumably possibly a submission to Google webmaster tools


[1] https://bugzilla.wikimedia.org/1234
[2] http://bugzillatips.wordpress.com/2011/05/04/search-bugzilla-using-google/
[3] https://bugzilla.wikimedia.org/robots.txt
[4] http://code.google.com/p/bugzilla-sitemap/
[5] https://bugzilla.mozilla.org/robots.txt
Comment 4 Daniel Friesen 2011-12-29 20:07:40 UTC
I SERIOUSLY doubt that robots.txt is doing ANYTHING to help lower issues we have here. Vandalism works even though we have a robots.txt so naturally it's completely ignoring that. And I know for a fact that e-mail addresses are already being harvested from our bugtracker, so robots.txt isn't helping there.

The only thing that robots.txt is doing is keeping out all the good bots, all we have now are the bad ones.
Comment 5 Sam Reed (reedy) 2011-12-29 20:14:40 UTC
RT #2194
Comment 6 p858snake 2011-12-29 23:42:00 UTC
Are we sure we want this? I would imagine it would be similar as to why we don't really want the lists indexed because the amount of craft it could protenitally introduce into the results we don't want.
Comment 7 Mark A. Hershberger 2011-12-30 04:50:34 UTC
(In reply to comment #4)
> I SERIOUSLY doubt that robots.txt is doing ANYTHING to help lower issues we
> have here. Vandalism works even though we have a robots.txt so naturally it's
> completely ignoring that. And I know for a fact that e-mail addresses are
> already being harvested from our bugtracker, so robots.txt isn't helping there.

Just to be clear, I wasn't saying that we are keeping vandalism at bay by having  a stricter robots.txt file.  As pointed out in comment #1, there are plenty of links to the tracker all over the internet that vandals could follow if that was how they found bug trackers to play with.

In the past (perhaps less so currently?) "well behaved" spiders that respected robots.txt have routinely wreaked havoc on sites like this one that are, essentially, a bunch of cgi scripts that result in a process being forked for each request.

So, last week, we dealt with some apparent vandalism when someone brought the server to a halt by requesting a particular URL over and over.

My point was simply that if we suddenly make bugzilla visible to spiders who respect robots.txt, they would probably send a ton of queries to the server (e.g. several spiders from each search engine) to quickly discover the newly available data.

That sort of sudden visibility could very well look a lot like the vandalism we  saw last week.

That said, something like https://bugzilla.mozilla.org/robots.txt is a good thing to consider.
Comment 8 Daniel Zahn 2012-01-02 17:06:04 UTC
- created RT #2198 for that
- installed SiteMap extension as suggested per  http://code.google.com/p/bugzilla-sitemap/
- this automatically changed robots.txt and i left it that way. it is now:

User-agent: *
Disallow: /*.cgi
Disallow: /*show_bug.cgi*ctype=*
Allow: /
Allow: /*index.cgi
Allow: /*show_bug.cgi
Allow: /*describecomponents.cgi
Allow: /*page.cgi

Sitemaps have already actively been submitted to Google, there was just a failure with Yahoo.

Replacing ./robots.txt. (The old version will be saved as
"./robots.txt.old". You can delete the old version if you do not need
its contents.)
Pinging search engines to let them know about our sitemap:
      Live: OK
    Google: OK
       Ask: OK
Submitting https://bugzilla.wikimedia.org/page.cgi?id=sitemap/sitemap.xml to Search::Sitemap::Pinger::Yahoo=HASH(0x7b6a970) failed: 403 Forbidden
     Yahoo: FAILED


But also note that it might take a while.
Comment 9 Mark A. Hershberger 2012-01-03 18:54:55 UTC
(In reply to comment #0)
> Even if one then reads the bug number and searches for
> "mediawiki bug 4711", one still doesn't get here.  So please remove robots.txt.

robots.txt has been updated, the sitemap read, the site re-indexed, and you can see bugzilla links in search results.

Beware, though, that this probably isn't what you want.  You're query doesn't really give any better results now.

The only way I can get the bug report is this google string: "bug 4711 site:bugzilla.wikimedia.org".  Searches on live, and ask were similarly unfruitful.
Comment 10 Tim Landscheidt 2012-01-04 00:10:41 UTC
(In reply to comment #8)
Thanks.  When I try to download:

> [...]
> Submitting https://bugzilla.wikimedia.org/page.cgi?id=sitemap/sitemap.xml to
> Search::Sitemap::Pinger::Yahoo=HASH(0x7b6a970) failed: 403 Forbidden
>      Yahoo: FAILED
> [...]

I get an empty file (after a short delay).  That doesn't look right to me.
Comment 11 Tim Landscheidt 2012-01-04 00:13:28 UTC
(In reply to comment #9)
> [...]
> Beware, though, that this probably isn't what you want.  You're query doesn't
> really give any better results now.

> The only way I can get the bug report is this google string: "bug 4711
> site:bugzilla.wikimedia.org".  Searches on live, and ask were similarly
> unfruitful.

That's right.  I have no SEO knowledge, but I do notice that the pages don't contain "MediaWiki" in any (prominent) place.

  I suggest to wait a few days or weeks to see if the pages get karma by incoming links; if not I'll file a new bug.
Comment 12 MZMcBride 2012-01-04 03:36:02 UTC
https://bugzilla.wikimedia.org/page.cgi?id=sitemap/sitemap.xml doesn't seem to be returning any content. This bug kind of has a murky scope, but if this is the correct sitemap URL and it's supposed to be returning something, this bug should be re-opened.
Comment 13 Mark A. Hershberger 2012-01-04 22:26:24 UTC
(In reply to comment #12)
> if this is the correct sitemap URL and it's supposed to be returning something,
> this bug should be re-opened.

Daniel already reopened the RT ticket, too.
Comment 14 Tim Landscheidt 2012-09-19 16:42:08 UTC
(In reply to comment #13)
> > if this is the correct sitemap URL and it's supposed to be returning something,
> > this bug should be re-opened.

> Daniel already reopened the RT ticket, too.

Any update on this?  https://bugzilla.wikimedia.org/page.cgi?id=sitemap/sitemap.xml still gives no content.

Is the Bugzilla configuration (or patched sources) accessible somewhere?  I didn't see anything obvious on Gerrit.
Comment 15 Chad H. 2012-09-19 16:45:12 UTC
(In reply to comment #14)
> (In reply to comment #13)
> > > if this is the correct sitemap URL and it's supposed to be returning something,
> > > this bug should be re-opened.
> 
> > Daniel already reopened the RT ticket, too.
> 
> Any update on this? 
> https://bugzilla.wikimedia.org/page.cgi?id=sitemap/sitemap.xml still gives no
> content.
> 
> Is the Bugzilla configuration (or patched sources) accessible somewhere?  I
> didn't see anything obvious on Gerrit.

We haven't moved the bugzilla customizations to gerrit yet. We probably should.
Comment 16 Sumana Harihareswara 2012-09-19 16:50:55 UTC
Adding Andre in case he can help with this.
Comment 17 Andre Klapper 2012-09-19 17:31:17 UTC
(In reply to comment #0)
> ATM, MediaZilla isn't indexed by search engines

This statement isn't correct anymore. I get results from bugzilla.wikimedia.org on google.com (though not with the ranking that would be perfect).
Don't know exactly what the benefits of aforementioned bugzilla-sitemap would be compared to the current situation.

> (In reply to comment #15)
> We haven't moved the bugzilla customizations to gerrit yet. We probably should.

Might be worth a separate ticket.
Comment 18 Andre Klapper 2012-11-29 01:40:41 UTC
(In reply to comment #15)
> We haven't moved the bugzilla customizations to gerrit yet. We probably should.

We have https://gerrit.wikimedia.org/r/gitweb?p=wikimedia%2Fbugzilla%2Fmodifications.git;a=summary but it's not up to date (e.g. missing the SiteMap extension which is deployed).

What would https://bugzilla.wikimedia.org/page.cgi?id=sitemap/sitemap.xml offer?
It's not clear to me what needs to be done to fix this report.
Comment 19 Andre Klapper 2012-12-17 21:35:55 UTC
Tim: It's not clear to me what needs to be done to fix this report. Could you please clarify?  Otherwise I might close this as WORKSFORME as I simply don't know what's missing...
Comment 20 Jesús Martínez Novo (Ciencia Al Poder) 2012-12-17 21:49:04 UTC
See [[Site map]]. The current sitemap is broken (it's an empty file), which is also invalid XML.

Another benefit from sitemap, apart for letting the search engine to know about all bugs, is to have a "last modified" field on each page to be indexed. If a particular page (or bug in this case) has already been indexed by the search engine, it won't reindex it again unless the last modified is newer than the cached copy: that should save some CPU and bandwidth because old bugs won't be re-crawled again.
Comment 21 Tim Landscheidt 2012-12-18 18:40:24 UTC
(In reply to comment #19)
> Tim: It's not clear to me what needs to be done to fix this report. Could you
> please clarify?  Otherwise I might close this as WORKSFORME as I simply don't
> know what's missing...

Essentially, as Jesús said, the sitemap extensions seems to be broken as deployed as it doesn't return any sitemap.  Its installation was suggested by Sam in comment #3.  For starters, it would be nice if someone could relay the status of RT #2198.

Without a working configuration, it's hard to assess whether the bad search rankings are due to this error.
Comment 22 Andre Klapper 2013-01-04 20:08:42 UTC
I see. To compare with a working version (Mozilla), run

wget -qO- https://bugzilla.mozilla.org/page.cgi?id=sitemap/sitemap.xml

$:andre\> wget https://bugzilla.wikimedia.org/page.cgi?id=sitemap/sitemap.xml
--2013-01-04 21:03:59--  https://bugzilla.wikimedia.org/page.cgi?id=sitemap/sitemap.xml
Resolving bugzilla.wikimedia.org... 208.80.152.149
Connecting to bugzilla.wikimedia.org|208.80.152.149|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 0 [text/xml]
2013-01-04 21:04:21 (0.00 B/s) - “page.cgi?id=sitemap%2Fsitemap.xml” saved [0/0]
Comment 23 Mark A. Hershberger 2013-03-19 19:08:36 UTC
Note that right now google doesn't even do anything with the sitemap because of bug #46328
Comment 24 This, that and the other (TTO) 2013-08-11 11:03:17 UTC
A couple of things:

- Currently, https://bugzilla.mozilla.org/page.cgi?id=sitemap/sitemap.xml is
  just timing out for me. 
- Google does not seem to know about bugzilla.wikimedia.org at all. See 
  https://www.google.com.au/search?q=wikimedia+bugzilla - you would expect to
  at least see this domain appearing there.
Comment 25 MZMcBride 2013-08-11 17:06:56 UTC
From <https://bugzilla.wikimedia.org/robots.txt>:

---
User-agent: *
Disallow: /*.cgi
Disallow: /*show_bug.cgi*ctype=*
Allow: /
Allow: /*index.cgi
Allow: /*show_bug.cgi
Allow: /*describecomponents.cgi
Allow: /*page.cgi
---

http://www.robotstxt.org/faq/robotstxt.html seems to indicate that wildcards are unsupported in robots.txt files:

---
Wildcards are _not_ supported: instead of 'Disallow: /tmp/*' just say 'Disallow: /tmp/'.
---

There also seems to be an assumption that Allow rules can override previous Disallow rules. I'm not sure if this is actually the case. If *.cgi is disallowed, will *show_bug.cgi become allowed with a later directive?

https://encrypted.google.com/search?hl=en&q=site%3Abugzilla.wikimedia.org indicates that, as stated in comment 24, bugzilla.wikimedia.org is not being indexed by Google at all currently.
Comment 26 Jesús Martínez Novo (Ciencia Al Poder) 2013-08-11 17:10:49 UTC
(In reply to comment #24)
> - Currently, https://bugzilla.mozilla.org/page.cgi?id=sitemap/sitemap.xml is
>   just timing out for me. 

Works for me at this moment. Maybe it was a temporary issue. It displays a list of 17 elements (in XML).

> - Google does not seem to know about bugzilla.wikimedia.org at all. See 
>   https://www.google.com.au/search?q=wikimedia+bugzilla - you would expect to
>   at least see this domain appearing there.

Agreed: https://www.google.com/search?q=site%3Abugzilla.wikimedia.org
Comment 27 Tim Landscheidt 2013-08-11 17:46:26 UTC
(In reply to comment #26)
> (In reply to comment #24)
> > - Currently, https://bugzilla.mozilla.org/page.cgi?id=sitemap/sitemap.xml is
> >   just timing out for me.

> Works for me at this moment. Maybe it was a temporary issue. It displays a
> list
> of 17 elements (in XML).

> [...]

Those are 17 links to bugzilla.*mozilla*.org.
Comment 28 This, that and the other (TTO) 2013-08-11 20:48:57 UTC
(In reply to comment #27)
> Those are 17 links to bugzilla.*mozilla*.org.

My apologies; clearly I meant https://bugzilla.wikimedia.org/page.cgi?id=sitemap/sitemap.xml, and yes, now it is working (although very slow, and still delivering a blank page).
Comment 29 Tim Landscheidt 2013-08-11 21:44:05 UTC
(In reply to comment #28)
> (In reply to comment #27)
> > Those are 17 links to bugzilla.*mozilla*.org.

> My apologies; clearly I meant
> https://bugzilla.wikimedia.org/page.cgi?id=sitemap/sitemap.xml, and yes, now
> it
> is working (although very slow, and still delivering a blank page).

So the same as:

> (In reply to comment #8)
> Thanks.  When I try to download:

> > [...]
> > Submitting https://bugzilla.wikimedia.org/page.cgi?id=sitemap/sitemap.xml to
> > Search::Sitemap::Pinger::Yahoo=HASH(0x7b6a970) failed: 403 Forbidden
> >      Yahoo: FAILED
> > [...]

> I get an empty file (after a short delay).  That doesn't look right to me.

which I wrote 2012-01-04? :-)

Unfortunately, I'm not privy to RT #2198, so maybe there have been (unsuccessful) discussions there.
Comment 30 Andre Klapper 2013-08-13 15:34:12 UTC
(In reply to comment #24)
(In reply to comment #25)
> From <https://bugzilla.wikimedia.org/robots.txt>:
> User-agent: *

Different bug I'd say. :) Looks like robots.txt this is not in operations/puppet/files/apache/sites/bugzilla.wikimedia.org, wondering where it is (or if it's puppetized at all).
Comment 31 Andre Klapper 2013-12-02 01:32:24 UTC
On Fedora 20, I checked out upstream Bugzilla 4.4 branch via bzr and applied https://git.wikimedia.org/summary/wikimedia%2Fbugzilla%2Fmodifications.git on it.

Running ./checksetup.pl, extensions/Sitemap fails with perl module "Search-Sitemap" not found. Ubuntu does not list a package on http://packages.ubuntu.com either, but it might still be packaged for other distributions (e.g. there is a package called "perl-Search-Sitemap" for openSuse at http://ftp.uni-stuttgart.de/opensuse-buildservice/devel:/languages:/perl:/CPAN-S/openSUSE_12.3/noarch/perl-Search-Sitemap-2.13-5.1.noarch.rpm ).

A recent mailing list thread at https://groups.google.com/forum/#!msg/mozilla.support.bugzilla/j60P0Uw9fOU/PPDgFZMIrtsJ even implied that it might not be needed anymore, but not sure if that is correct.

http://bzr.mozilla.org/bugzilla/extensions/sitemap/trunk/files has not been updated since 2010.
Comment 32 Andre Klapper 2013-12-14 00:57:12 UTC
Strongly proposing WONTFIX. 
There is no distro-packaged Search::Sitemap available and the code is ancient and not even half-working. Let's remove this from production, new Bugzilla, and https://git.wikimedia.org/summary/wikimedia%2Fbugzilla%2Fmodifications.git

Using it on boogs.wmflabs.org I get everytime:

Pinging search engines to let them know about our sitemap:
Submitting http://boogs.wmflabs.org/page.cgi?id=sitemap/sitemap.xml to Search::Sitemap::Pinger::Ask=HASH(0x903e608) failed: 500 Can't connect to submissions.ask.com:80 (Bad hostname)
       Ask: FAILED
      Live: OK
Submitting http://boogs.wmflabs.org/page.cgi?id=sitemap/sitemap.xml to Search::Sitemap::Pinger::Yahoo=HASH(0x8d93158) failed: 403 Forbidden
     Yahoo: FAILED
    Google: OK
There were some failures while submitting the sitemap to certain search
engines. If you wait a few minutes and run checksetup again, we will
attempt to submit your sitemap again.
Comment 33 This, that and the other (TTO) 2013-12-14 01:17:39 UTC
Fine, but surely this is not the only way to fix the core issue?

(In reply to comment #0)
> ATM, MediaZilla isn't indexed by search engines, which means that searching
> for MediaWiki bugs will *never* get one here

I like:

(In reply to comment #7)
> That said, something like https://bugzilla.mozilla.org/robots.txt is a good
> thing to consider.
Comment 34 Jesús Martínez Novo (Ciencia Al Poder) 2013-12-15 14:05:47 UTC
Sitemap is not needed for search engines to index mediazilla, since all bugs are sent to wikibugs-l and end listed in various web pages. But they aren't indexing mediazilla because of this entry in robots.txt:

 Disallow: /*.cgi

But without a sitemap, search engines don't know when a bug is updated, and end reindexing the entire site every time, producing a lot of overhead on the servers and bringing the site down. With a sitemap, only updated bugs since the last site index would be crawled again (supposedly), reducing the overhead on the site, although not sure to what extent.

From comment 32, it just needs to generate a sitemap, there's no need to ping search engines about it's existence. They'll know about it when they fetch robots.txt again and find a sitemap file location there. I don't see why it's pinging search engines.

Sitemap on bmo (bugzilla.mozilla.org) seems to be generated with a different extension, or a modification of this one, according to this:
https://code.google.com/p/bugzilla-sitemap/issues/detail?id=1

From what I see, that patch doesn't ping search engines, and also saves the sitemap on the server and sends it to the search engines instead of regenerating the sitemap *every time* the URL is requested, during a period of time defined in SITEMAP_AGE. This should be more convenient. Maybe we can get the extension that's using bmo somewhere? Or at least consider using that patch if it looks sane.
Comment 35 Andre Klapper 2014-02-03 13:03:04 UTC
WONTFIXing in favor of fixing bug 13881.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links