Last modified: 2011-03-13 18:05:49 UTC

Wikimedia Bugzilla is closed!

Wikimedia has migrated from Bugzilla to Phabricator. Bug reports should be created and updated in Wikimedia Phabricator instead. Please create an account in Phabricator and add your Bugzilla email address to it.
Wikimedia Bugzilla is read-only. If you try to edit or create any bug report in Bugzilla you will be shown an intentional error message.
In order to access the Phabricator task corresponding to a Bugzilla report, just remove "static-" from its URL.
You could still run searches in Bugzilla or access your list of votes but bug reports will obviously not be up-to-date in Bugzilla.
Bug 9563 - Include Sitemaps file on wikimedia's robots.txt
Include Sitemaps file on wikimedia's robots.txt
Status: RESOLVED WONTFIX
Product: Wikimedia
Classification: Unclassified
General/Unknown (Other open bugs)
unspecified
All All
: Lowest enhancement (vote)
: ---
Assigned To: Nobody - You can work on this!
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2007-04-11 21:07 UTC by Mathias Schindler
Modified: 2011-03-13 18:05 UTC (History)
2 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Mathias Schindler 2007-04-11 21:07:55 UTC
Sitemaps is a formerly Google only, now a project supported by some of the
larger search engines. It is an XML based site map of a given web site.
MediaWiki does support creating Sitemap XML files and we do it on a regular
basis. Sitemaps now supports storing the xml file in the robots.txt to make it
findable for other search engines.

Feature Request: Please mention the location of the corresponding sitemaps xml
file in each of our robots.txt files as described in
http://sitemaps.org/protocol.html#submit_robots

The result "could" be faster and more efficient indexing of Wikimedia content on
our servers and less useless requests on unchanged sites.
Comment 1 River Tarnell 2007-05-06 15:42:54 UTC
this can't be done easily at the moment because robots.txt is shared between 
all wikis.
Comment 2 Dan Jacobson 2008-03-26 23:15:06 UTC
Yes, it would be great to break away from having to do any special
contacting of search engines. If they are interested in indexing us,
they know where to look: robots.txt's "Sitemap:" entry.

We no longer would need to have a "*oogle/*ahoo! webmaster tools
account" or any special catering or even knowing about who the search
engines are.

So how to do it?:

In http://sitemaps.org/protocol.php:
  Sitemaps & Cross Submits
  To submit Sitemaps for multiple hosts from a single host, you need to
  "prove" ownership of the host(s) for which URLs are being submitted in
  a Sitemap... You can do this by modifying the robots.txt file on
  www.host1.com to point to the Sitemap on www.sitemaphost.com...
  ...You can specify more than one Sitemap file per robots.txt file.
  Sitemap: <sitemap1_location>
  Sitemap: <sitemap2_location>

Note: we do not see mention of being able to use sitemap INDEX files
on the "Sitemap:" line of robots.txt, just plain sitemap.xml.gz files.

But that is good enough for me:
I'm putting e.g.,
Sitemap: http://radioscanningtw.jidanni.org/sitemap-radioscanningtw-wiki_-NS_0-0.xml.gz
Sitemap: http://taizhongbus.jidanni.org/sitemap-taizhongbus-wiki_-NS_0-0.xml.gz
Sitemap: http://taizhongbus.jidanni.org/sitemap-taizhongbus-wiki_-NS_5-0.xml.gz
Sitemap: http://transgender-taiwan.org/sitemap-transgender-wiki_-NS_0-0.xml.gz
in my robots.txt, which is shared between my three wikis and
hoping for the best.

Here I can pick and choose amongst all the different namespaces made
by generateSitemap.php achieving Bug #12860), without having to get
tangled up with editing sitemap-index-*.xml that generateSitemap.php
also creates, or its bugs: paths: Bug #9675, xsd: Bug #13527. Indeed,
I will rm sitemap-index-*.xml .
Comment 3 Dan Jacobson 2008-03-26 23:32:17 UTC
Wait, Wikimedia sites are big, so maybe you can use
Sitemap: http://aaa.../sitemap-index-yyy.xml
Sitemap: http://bbb.../sitemap-index-zzz.xml
according to http://sitemaps.org/protocol.php:

    Specifying the Sitemap location in your robots.txt file:
    If you have a Sitemap index file, you can include the location of just
    that file. You don't need to list each individual Sitemap listed in
    the index file.

if that's indeed what it is trying to say.
Comment 4 Dan Jacobson 2008-03-27 00:30:09 UTC
OK, and here is the makefile I will use

T=transgender-taiwan.org
R=radioscanningtw.jidanni.org
B=taizhongbus.jidanni.org
S=$T $R $B
robots.txt:robots-base-jidanni.txt $(addsuffix .SITEMAPS,$S)
	> $@ echo \#Made by $(MAKEFILE_LIST), will get overwritten
	>> $@ cat $<
	ls sitemap-*-NS_{[0-5],1[2345]}-*.xml.gz|perl -pwe \
	'if(/trans/){s@^@$T/@}elsif(/bus/){s@^@$B/@}$(\
	)elsif(/radio/){s@^@$R/@};s@^@Sitemap: http://@' >> $@
%.SITEMAPS:
	cd ../$*/maintenance && \
	    php generateSitemap.php --server=http://$* --fspath=../
	rm sitemap-index-*-wiki_.xml

Sure hope this paragraph from http://sitemaps.org/protocol.php,

     When a particular host's robots.txt, say
     http://www.host1.com/robots.txt, points to a Sitemap or a Sitemap
     index on another host; it is expected that for each of the target
     Sitemaps, such as http://www.sitemaphost.com/sitemap-host1.xml, all
     the URLs belong to the host pointing to it. This is because, as noted
     earlier, a Sitemap is expected to have URLs from a single host only.

won't spoil my plans.
Comment 5 Dan Jacobson 2008-03-27 01:05:22 UTC
So I removed my previous sitemap from my Google Webmaster Tools
account, with confidence from
http://www.google.com/support/webmasters/bin/answer.py?answer=64748

    You can tell Google and other search engines about your sitemap by
    adding the following line to your robots.txt file

    We still recommend that you submit your sitemap through your Webmaster
    Tools account so you can make sure that the Sitemap was processed
    without any issues and to get additional statistics about your
    site.

Well, Google will now just have to poke around my robots.txt to find
where my sitemap is, just like the other search engines will. Now I
need not have any special knowledge or (proactive) contact with any
particular search engine company.
Comment 6 JeLuF 2008-03-27 05:21:20 UTC
The problem is that we don't have a robots.txt per wiki. All wikis share one robots.txt. We can't add Sitemap:-lines to the robots.txt because we'd need different entries per wiki.
Comment 7 Dan Jacobson 2008-03-28 23:28:33 UTC
Are you sure you need different entries per wiki?
I just put them all in the same file, expecting that the search engine will
ignore the ones it wants to.

$ HEAD http://transgender-taiwan.org/robots.txt \
http://radioscanningtw.jidanni.org/robots.txt \
http://taizhongbus.jidanni.org/robots.txt |grep Length
Content-Length: 1763
Content-Length: 1763
Content-Length: 1763

My logs (but only two days so far) show search engines get the ones they want.
Can you find a statement that what I did is against the protocol?



Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links