Last modified: 2011-03-13 18:05:49 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T11563, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 9563 - Include Sitemaps file on wikimedia's robots.txt
Include Sitemaps file on wikimedia's robots.txt
Product: Wikimedia
Classification: Unclassified
General/Unknown (Other open bugs)
All All
: Lowest enhancement (vote)
: ---
Assigned To: Nobody - You can work on this!
Depends on:
  Show dependency treegraph
Reported: 2007-04-11 21:07 UTC by Mathias Schindler
Modified: 2011-03-13 18:05 UTC (History)
2 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Description Mathias Schindler 2007-04-11 21:07:55 UTC
Sitemaps is a formerly Google only, now a project supported by some of the
larger search engines. It is an XML based site map of a given web site.
MediaWiki does support creating Sitemap XML files and we do it on a regular
basis. Sitemaps now supports storing the xml file in the robots.txt to make it
findable for other search engines.

Feature Request: Please mention the location of the corresponding sitemaps xml
file in each of our robots.txt files as described in

The result "could" be faster and more efficient indexing of Wikimedia content on
our servers and less useless requests on unchanged sites.
Comment 1 River Tarnell 2007-05-06 15:42:54 UTC
this can't be done easily at the moment because robots.txt is shared between 
all wikis.
Comment 2 Dan Jacobson 2008-03-26 23:15:06 UTC
Yes, it would be great to break away from having to do any special
contacting of search engines. If they are interested in indexing us,
they know where to look: robots.txt's "Sitemap:" entry.

We no longer would need to have a "*oogle/*ahoo! webmaster tools
account" or any special catering or even knowing about who the search
engines are.

So how to do it?:

  Sitemaps & Cross Submits
  To submit Sitemaps for multiple hosts from a single host, you need to
  "prove" ownership of the host(s) for which URLs are being submitted in
  a Sitemap... You can do this by modifying the robots.txt file on to point to the Sitemap on
  ...You can specify more than one Sitemap file per robots.txt file.
  Sitemap: <sitemap1_location>
  Sitemap: <sitemap2_location>

Note: we do not see mention of being able to use sitemap INDEX files
on the "Sitemap:" line of robots.txt, just plain sitemap.xml.gz files.

But that is good enough for me:
I'm putting e.g.,
in my robots.txt, which is shared between my three wikis and
hoping for the best.

Here I can pick and choose amongst all the different namespaces made
by generateSitemap.php achieving Bug #12860), without having to get
tangled up with editing sitemap-index-*.xml that generateSitemap.php
also creates, or its bugs: paths: Bug #9675, xsd: Bug #13527. Indeed,
I will rm sitemap-index-*.xml .
Comment 3 Dan Jacobson 2008-03-26 23:32:17 UTC
Wait, Wikimedia sites are big, so maybe you can use
Sitemap: http://aaa.../sitemap-index-yyy.xml
Sitemap: http://bbb.../sitemap-index-zzz.xml
according to

    Specifying the Sitemap location in your robots.txt file:
    If you have a Sitemap index file, you can include the location of just
    that file. You don't need to list each individual Sitemap listed in
    the index file.

if that's indeed what it is trying to say.
Comment 4 Dan Jacobson 2008-03-27 00:30:09 UTC
OK, and here is the makefile I will use
S=$T $R $B
robots.txt:robots-base-jidanni.txt $(addsuffix .SITEMAPS,$S)
	> $@ echo \#Made by $(MAKEFILE_LIST), will get overwritten
	>> $@ cat $<
	ls sitemap-*-NS_{[0-5],1[2345]}-*.xml.gz|perl -pwe \
	)elsif(/radio/){s@^@$R/@};s@^@Sitemap: http://@' >> $@
	cd ../$*/maintenance && \
	    php generateSitemap.php --server=http://$* --fspath=../
	rm sitemap-index-*-wiki_.xml

Sure hope this paragraph from,

     When a particular host's robots.txt, say, points to a Sitemap or a Sitemap
     index on another host; it is expected that for each of the target
     Sitemaps, such as, all
     the URLs belong to the host pointing to it. This is because, as noted
     earlier, a Sitemap is expected to have URLs from a single host only.

won't spoil my plans.
Comment 5 Dan Jacobson 2008-03-27 01:05:22 UTC
So I removed my previous sitemap from my Google Webmaster Tools
account, with confidence from

    You can tell Google and other search engines about your sitemap by
    adding the following line to your robots.txt file

    We still recommend that you submit your sitemap through your Webmaster
    Tools account so you can make sure that the Sitemap was processed
    without any issues and to get additional statistics about your

Well, Google will now just have to poke around my robots.txt to find
where my sitemap is, just like the other search engines will. Now I
need not have any special knowledge or (proactive) contact with any
particular search engine company.
Comment 6 JeLuF 2008-03-27 05:21:20 UTC
The problem is that we don't have a robots.txt per wiki. All wikis share one robots.txt. We can't add Sitemap:-lines to the robots.txt because we'd need different entries per wiki.
Comment 7 Dan Jacobson 2008-03-28 23:28:33 UTC
Are you sure you need different entries per wiki?
I just put them all in the same file, expecting that the search engine will
ignore the ones it wants to.

$ HEAD \ \ |grep Length
Content-Length: 1763
Content-Length: 1763
Content-Length: 1763

My logs (but only two days so far) show search engines get the ones they want.
Can you find a statement that what I did is against the protocol?

Note You need to log in before you can comment on or make changes to this bug.