Last modified: 2008-08-19 23:46:09 UTC
Created attachment 4606 [details] Namespace limit patch for generateSitemap.php Sometimes in addition to restricting crawling by robots.txt, it's a good idea to limit a list of what goes to a sitemap. E.g. it might be helpful in case some extensions create non-content namespaces (similar to MediaWiki namespace) and it doesn't make sense to include them into sitemap. Attached is a patch to allow user specify a list of namespaces for which to generate sitemaps.
Perhaps do this via command-line options instead of a site config var?
Not sure - there are quite a lot of namespaces in configuration usually and it's pain in the neck to put them all into command line, besides, LocalSettings.php is usually changed a lot anyway and adding more to it is quite usual (I have tons of things in there). Also, it seems that having black list instead of white list might also be good idea because there are fewer namespaces to exclude and this list is usually constant - if someone adds new namespace, it's most probably content and should be indexed by crawlers. I've added changes for exclusion and fixed a bug with undefined variable.
Created attachment 4646 [details] Patch to implement blacklisting and whitelisting namespaces for sitemap generation
Oops. Patch also adds full server URL to sitemap - I believe I saw a bug for it, but can't remember where. Feel free to remove the change if you don't feel like adding it.
Fixed in r33498 using slightly modified version of the first patch. Exclusion may be added later, but it seams a rather large jump from no discrimination whatsoever; I suggest you open another bug for that, however.
Perhaps add an example of usage. Say: In LocalSettings.php put: $wgSitemapNamespaces=array (NS_MAIN, NS_TALK, ... NS_CATEGORY, NS_CATEGORY_TALK, ); Actually it seems a waste to put it in LocalSettings.php as is might be used only 1/999999 of the times that file is read...