Last modified: 2012-01-15 00:47:16 UTC
google sitemap validation reports errors. maintenance/generateSitemap.php doesn't generate full location to sitemap index.
Created attachment 3511 [details] generateSitemap.php.patch fixs bug for my http://perl6.cz see http://perl6.cz/sitemap-index-perl6.xml
Is that guaranteed to be a correct path? That seems to assume that all output files will be in the root URl directory at the wiki's $wgServer path.
There is --server=<server> The protocol and server name to use in URLs, e.g. http://en.wikipedia.org. This is sometimes necessary because server name detection may fail in command line scripts. and $wgServer = $options['server'];
And the path?
And more generally, not breaking the links to files if files and content are at different places?
Sitemapindex and sitemap files should be in the same directory. $wgServer should be only server name without path. No anything like http://en.wikipedia.org/wiki, beause this break content links.
I have tested this patch using the latest version of MediaWiki for SVN, it works in my configuration (which is very complex) but all my sitemaps are at the root of the website so I was unable to test this for the problem that Brion said might occurr. When can this be added to the trunk? - it would be very useful.
I fixed this with a sed one-liner, if namespace sitemaps and index are in sitemaps/: sed 's/>sitemap-/>http\:\/\/domain.tld\/sitemaps\/sitemap-/' $BASEDIR/sitemaps/sitemap-index.xml > $BASEDIR/sitemap.xml Ugly but it works.
I also faced this issue and created similar patch for it: http://bugs.debian.org/cgi-bin/bugreport.cgi?msg=5;filename=generateSitemap.php.patch;att=1;bug=460831 Any chance it will get fixed soon?
Created attachment 4828 [details] Allow using full path via specificaiton of path to web root This patch adds a new command line option --path allowing you to specify the path to the sitemap file relative to the system root. e.g. --server=http://en.wikpedia.org --path=/w/ would generate http://en.wikipedia.org/w/sitemapname....xml It doesn't seam the cleanest way, but the only feasable one I could find.
Here in 1.13 (I just changed it above too, hope that's OK), I don't see why generateSitemap.php's --server option only affects the sitemaps, and not the sitemap indexes. Using php generateSitemap.php --server=http://taizhongbus.jidanni.org --fspath=../ I get sitemap-taizhongbus-wiki_-NS_0-0.xml.gz ... with absolute URIs, but sitemap-index-taizhongbus-wiki_.xml with relative URIs. So the chain robots.txt -> sitemap-index-taizhongbus-wiki_.xml -> sitemap-taizhongbus-wiki_-NS_0-0.xml.gz ends up being absolute -> relative -> absolute As we see in http://www.sitemaps.org/protocol.php at "Sample XML Sitemap Index" that they use absolute URIs and not relative URIs, so wouldn't it be best, given the murky nature of all this, to follow their example. One knows that according to the rules, both robots.txt "Sitemap:" entries, and the sitemaps themselves must contain absolute URIs, so one wonders how the middle link in the chain can take the risk of containing relative URIs. In http://www.sitemaps.org/schemas/sitemap/0.9/siteindex.xsd: "The URI must conform to RFC 2396 (http://www.ietf.org/rfc/rfc2396.txt)." (but in that RFC there are also relative URIs, so who knows.)
(In reply to comment #11) > Here in 1.13 (I just changed it above too, hope that's OK), > I don't see why generateSitemap.php's --server option > only affects the sitemaps, and not the sitemap indexes. This is because the path to the sitemaps is not known and therefore cannot reliably be placed in the sitemap index file without an additional parameter to specify this being added to the script. e.g. if your sitemaps were located in the directory /var/www/sitemaps/ as apposed to /var/www/ or /var/www/w/ MediaWiki would have no way of knowing this and your --server parameter would cause the index file to list them as http://www.example.com/<sitemap> or http://www.example.com/w/<sitemap> (depending on implementation) rather than http://www.example.com/sitemaps/<sitemap>
OK, I sure hope there will be a final way to get a http:// URL in the indexes. For now I will just use my wacko Makefile: # Make sitemaps for my wikis, that all live in the same tree # Copyright : http://www.fsf.org/copyleft/gpl.html # Author : Dan Jacobson http://jidanni.org/ # Created On : Thu Mar 27 04:11:10 2008 # Last Modified On: Fri Aug 22 07:39:26 2008 # Update Count : 80 # https://bugzilla.wikimedia.org/show_bug.cgi?id=9675 T=transgender-taiwan.org R=radioscanningtw.jidanni.org B=taizhongbus.jidanni.org S=$T $R $B all:$(addsuffix .SITEMAPS,$S) %.SITEMAPS: cd ../$*/maintenance && \ php generateSitemap.php --server=http://$* --fspath=../ perl -wpi -e 'use strict; use warnings FATAL => q(all);$(\ )s@(<loc>)(sitemap)@$$1http://$*/$$2@' `ls -t sitemap-index-*.xml|sed q` sleep 2
New here, not to sure of the form but I want this fixed so I don't have to remember to swap in my own code *again* after the next update. Here is how I fix this problem. The main point to note is that I sidestep the whole issue of finding out the correct path by asking the human. This thing has to be run by a sysadmin from the command line so we not talking monkeys here. This is documented with these lines: + --webpath=<dir> If you are placing the sitemap files in a sub folder + i.e. using the --fspath option and specify somewhere other than root + you need to place here the directory name e.g: + + if -fspath = /var/www/httpdocs/mediawiki_sitemaps/ for example + then --webpath = /mediawiki_sitemaps *Note, no trailing / needed I hope this helps. $ svn diff generateSitemap.php Index: generateSitemap.php =================================================================== --- generateSitemap.php (revision 38559) +++ generateSitemap.php (working copy) @@ -1,4 +1,4 @@ -<?php +x<?php define( 'GS_MAIN', -2 ); define( 'GS_TALK', -1 ); /** @@ -367,9 +367,11 @@ * @return string */ function indexEntry( $filename ) { + global $wgServer; + global $wgWebpath; return "\t<sitemap>\n" . - "\t\t<loc>$filename</loc>\n" . + "\t\t<loc>$wgServer$wgWebpath/$filename</loc>\n" . "\t\t<lastmod>{$this->timestamp}</lastmod>\n" . "\t</sitemap>\n"; } @@ -457,18 +459,30 @@ server name detection may fail in command line scripts. --compress=[yes|no] compress the sitemap files, default yes + + --webpath=<dir> If you are placing the sitemap files in a sub folder + i.e. using the --fspath option and specify somewhere other than root + you need to place here the directory name e.g: + + if -fspath = /var/www/httpdocs/mediawiki_sitemaps/ for example + then --webpath = /mediawiki_sitemaps *Note, no trailing / needed + EOT; die( -1 ); } -$optionsWithArgs = array( 'fspath', 'server', 'compress' ); +$optionsWithArgs = array( 'fspath', 'server', 'compress', 'webpath' ); require_once( dirname( __FILE__ ) . '/commandLine.inc' ); if ( isset( $options['server'] ) ) { $wgServer = $options['server']; } +if ( isset( $options['webpath'] ) ) { + $wgWebpath = $options['webpath']; +} + $gs = new GenerateSitemap( @$options['fspath'], @$options['compress'] !== 'no' ); $gs->main();
Created attachment 5218 [details] Proposed replacement for generateSitemap.php
Here is a very short patch for this problem: --- generateSitemap.php 2008-11-03 11:37:53.000000000 +0100 +++ /srv/www/htdocs/mw/esl/maintenance/generateSitemap.php 2008-11-03 11:40:40.000000000 +0100 @@ -392,9 +392,13 @@ * @return string */ function indexEntry( $filename ) { + global $wgServer; + $title = Title::makeTitle( '', '' ); + $location = $wgServer . $title->getLocalUrl() . $filename; + return "\t<sitemap>\n" . - "\t\t<loc>$filename</loc>\n" . + "\t\t<loc>$location</loc>\n" . "\t\t<lastmod>{$this->timestamp}</lastmod>\n" . "\t</sitemap>\n"; }
*** Bug 14397 has been marked as a duplicate of this bug. ***
Any news on this issue?
I'll stick the current workaround Makefile I'm "forced to use" in the URL box above. Also noting inconsistent use of --server in maintenance scripts, just in case one day somebody wants to unify them: maintenance/generateSitemap.php:481: --server=<server> The protocol and server name to use in URLs, e.g. maintenance/dumpBackup.php:87: --server=h Force reading from MySQL server h maintenance/dumpTextPass.php:518: --server=h Force reading from MySQL server h
See also bug 19593.
The problem seems pretty easy to resolve, i.e. just add the full path (as most MediaWiki users have to do themselves if they want to use sitemaps with Google without an error message. What's causing the apparent delay in resolving this bug?
Please consider MediaWiki farms like this: https://fusionforge.org/plugins/mediawiki/wiki/fusionforge/index.php/Main_Page
Fixed in r77176. I took a KISS approach to fixing this issue and simply added a new --urlpath parameter, which can be used to specify the URL path corresponding to --fspath. For most installations, this will probably equal or begin with the server name, but I figured the minor redundancy should be worth the flexibility and simplicity.