Last modified: 2011-03-13 18:05:49 UTC
Sitemaps is a formerly Google only, now a project supported by some of the larger search engines. It is an XML based site map of a given web site. MediaWiki does support creating Sitemap XML files and we do it on a regular basis. Sitemaps now supports storing the xml file in the robots.txt to make it findable for other search engines. Feature Request: Please mention the location of the corresponding sitemaps xml file in each of our robots.txt files as described in http://sitemaps.org/protocol.html#submit_robots The result "could" be faster and more efficient indexing of Wikimedia content on our servers and less useless requests on unchanged sites.
this can't be done easily at the moment because robots.txt is shared between all wikis.
Yes, it would be great to break away from having to do any special contacting of search engines. If they are interested in indexing us, they know where to look: robots.txt's "Sitemap:" entry. We no longer would need to have a "*oogle/*ahoo! webmaster tools account" or any special catering or even knowing about who the search engines are. So how to do it?: In http://sitemaps.org/protocol.php: Sitemaps & Cross Submits To submit Sitemaps for multiple hosts from a single host, you need to "prove" ownership of the host(s) for which URLs are being submitted in a Sitemap... You can do this by modifying the robots.txt file on www.host1.com to point to the Sitemap on www.sitemaphost.com... ...You can specify more than one Sitemap file per robots.txt file. Sitemap: <sitemap1_location> Sitemap: <sitemap2_location> Note: we do not see mention of being able to use sitemap INDEX files on the "Sitemap:" line of robots.txt, just plain sitemap.xml.gz files. But that is good enough for me: I'm putting e.g., Sitemap: http://radioscanningtw.jidanni.org/sitemap-radioscanningtw-wiki_-NS_0-0.xml.gz Sitemap: http://taizhongbus.jidanni.org/sitemap-taizhongbus-wiki_-NS_0-0.xml.gz Sitemap: http://taizhongbus.jidanni.org/sitemap-taizhongbus-wiki_-NS_5-0.xml.gz Sitemap: http://transgender-taiwan.org/sitemap-transgender-wiki_-NS_0-0.xml.gz in my robots.txt, which is shared between my three wikis and hoping for the best. Here I can pick and choose amongst all the different namespaces made by generateSitemap.php achieving Bug #12860), without having to get tangled up with editing sitemap-index-*.xml that generateSitemap.php also creates, or its bugs: paths: Bug #9675, xsd: Bug #13527. Indeed, I will rm sitemap-index-*.xml .
Wait, Wikimedia sites are big, so maybe you can use Sitemap: http://aaa.../sitemap-index-yyy.xml Sitemap: http://bbb.../sitemap-index-zzz.xml according to http://sitemaps.org/protocol.php: Specifying the Sitemap location in your robots.txt file: If you have a Sitemap index file, you can include the location of just that file. You don't need to list each individual Sitemap listed in the index file. if that's indeed what it is trying to say.
OK, and here is the makefile I will use T=transgender-taiwan.org R=radioscanningtw.jidanni.org B=taizhongbus.jidanni.org S=$T $R $B robots.txt:robots-base-jidanni.txt $(addsuffix .SITEMAPS,$S) > $@ echo \#Made by $(MAKEFILE_LIST), will get overwritten >> $@ cat $< ls sitemap-*-NS_{[0-5],1[2345]}-*.xml.gz|perl -pwe \ 'if(/trans/){s@^@$T/@}elsif(/bus/){s@^@$B/@}$(\ )elsif(/radio/){s@^@$R/@};s@^@Sitemap: http://@' >> $@ %.SITEMAPS: cd ../$*/maintenance && \ php generateSitemap.php --server=http://$* --fspath=../ rm sitemap-index-*-wiki_.xml Sure hope this paragraph from http://sitemaps.org/protocol.php, When a particular host's robots.txt, say http://www.host1.com/robots.txt, points to a Sitemap or a Sitemap index on another host; it is expected that for each of the target Sitemaps, such as http://www.sitemaphost.com/sitemap-host1.xml, all the URLs belong to the host pointing to it. This is because, as noted earlier, a Sitemap is expected to have URLs from a single host only. won't spoil my plans.
So I removed my previous sitemap from my Google Webmaster Tools account, with confidence from http://www.google.com/support/webmasters/bin/answer.py?answer=64748 You can tell Google and other search engines about your sitemap by adding the following line to your robots.txt file We still recommend that you submit your sitemap through your Webmaster Tools account so you can make sure that the Sitemap was processed without any issues and to get additional statistics about your site. Well, Google will now just have to poke around my robots.txt to find where my sitemap is, just like the other search engines will. Now I need not have any special knowledge or (proactive) contact with any particular search engine company.
The problem is that we don't have a robots.txt per wiki. All wikis share one robots.txt. We can't add Sitemap:-lines to the robots.txt because we'd need different entries per wiki.
Are you sure you need different entries per wiki? I just put them all in the same file, expecting that the search engine will ignore the ones it wants to. $ HEAD http://transgender-taiwan.org/robots.txt \ http://radioscanningtw.jidanni.org/robots.txt \ http://taizhongbus.jidanni.org/robots.txt |grep Length Content-Length: 1763 Content-Length: 1763 Content-Length: 1763 My logs (but only two days so far) show search engines get the ones they want. Can you find a statement that what I did is against the protocol?