Last modified: 2012-01-15 00:47:16 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T11675, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 9675 - sitemap-index doesn't include full location path
sitemap-index doesn't include full location path
Status: RESOLVED FIXED
Product: MediaWiki
Classification: Unclassified
Maintenance scripts (Other open bugs)
1.16.x
All All
: Normal major with 7 votes (vote)
: ---
Assigned To: Nobody - You can work on this!
http://transgender-taiwan.org/jidanni...
: patch, patch-need-review
: 14397 (view as bug list)
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2007-04-24 11:29 UTC by Michal Jurosz
Modified: 2012-01-15 00:47 UTC (History)
14 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments
generateSitemap.php.patch (824 bytes, patch)
2007-04-24 11:31 UTC, Michal Jurosz
Details
Allow using full path via specificaiton of path to web root (2.21 KB, patch)
2008-04-17 18:45 UTC, Robert Leverington
Details
Proposed replacement for generateSitemap.php (11.26 KB, patch)
2008-08-27 20:50 UTC, The GadgetDoctor
Details

Description Michal Jurosz 2007-04-24 11:29:45 UTC
google sitemap validation reports errors.
maintenance/generateSitemap.php doesn't generate full location to sitemap index.
Comment 1 Michal Jurosz 2007-04-24 11:31:34 UTC
Created attachment 3511 [details]
generateSitemap.php.patch

fixs bug for my http://perl6.cz
see http://perl6.cz/sitemap-index-perl6.xml
Comment 2 Brion Vibber 2007-04-26 18:39:11 UTC
Is that guaranteed to be a correct path? That seems to assume that all output
files will be in the root URl directory at the wiki's $wgServer path.
Comment 3 Michal Jurosz 2007-04-26 20:20:23 UTC
There is

  --server=<server>	The protocol and server name to use in URLs, e.g.
		http://en.wikipedia.org. This is sometimes necessary because
		server name detection may fail in command line scripts.

and 

 $wgServer = $options['server'];
Comment 4 Brion Vibber 2007-04-26 20:33:44 UTC
And the path?
Comment 5 Brion Vibber 2007-04-26 20:34:16 UTC
And more generally, not breaking the links to files if files and content are at
different places?
Comment 6 Michal Jurosz 2007-04-27 06:47:33 UTC
Sitemapindex and sitemap files should be in the same directory. $wgServer should
be only server name without path. No anything like http://en.wikipedia.org/wiki,
beause this break content links.
Comment 7 Robert Leverington 2007-06-24 12:11:52 UTC
I have tested this patch using the latest version of MediaWiki for SVN, it works in my configuration (which is very complex) but all my sitemaps are at the root of the website so I was unable to test this for the problem that Brion said might occurr. When can this be added to the trunk? - it would be very useful.
Comment 8 Alexander 2007-09-13 00:59:14 UTC
I fixed this with a sed one-liner, if namespace sitemaps and index are in sitemaps/:

sed 's/>sitemap-/>http\:\/\/domain.tld\/sitemaps\/sitemap-/' $BASEDIR/sitemaps/sitemap-index.xml > $BASEDIR/sitemap.xml

Ugly but it works.
Comment 9 Michal Čihař 2008-01-26 03:09:45 UTC
I also faced this issue and created similar patch for it: http://bugs.debian.org/cgi-bin/bugreport.cgi?msg=5;filename=generateSitemap.php.patch;att=1;bug=460831

Any chance it will get fixed soon?
Comment 10 Robert Leverington 2008-04-17 18:45:50 UTC
Created attachment 4828 [details]
Allow using full path via specificaiton of path to web root

This patch adds a new command line option --path allowing you to specify the path to the sitemap file relative to the system root.

e.g. --server=http://en.wikpedia.org --path=/w/

would generate

http://en.wikipedia.org/w/sitemapname....xml

It doesn't seam the cleanest way, but the only feasable one I could find.
Comment 11 Dan Jacobson 2008-08-21 19:29:43 UTC
Here in 1.13 (I just changed it above too, hope that's OK),
I don't see why generateSitemap.php's --server option
only affects the sitemaps, and not the sitemap indexes.

Using
  php generateSitemap.php --server=http://taizhongbus.jidanni.org --fspath=../
I get sitemap-taizhongbus-wiki_-NS_0-0.xml.gz ...
with absolute URIs, but sitemap-index-taizhongbus-wiki_.xml
with relative URIs.

So the chain robots.txt -> sitemap-index-taizhongbus-wiki_.xml ->
sitemap-taizhongbus-wiki_-NS_0-0.xml.gz
ends up being absolute -> relative -> absolute

As we see in http://www.sitemaps.org/protocol.php at "Sample XML
Sitemap Index" that they use absolute URIs and not relative URIs, so
wouldn't it be best, given the murky nature of all this, to follow
their example.

One knows that according to the rules, both robots.txt "Sitemap:" entries,
and the sitemaps themselves must contain absolute URIs, so one wonders how the
middle link in the chain can take the risk of containing relative URIs.

In http://www.sitemaps.org/schemas/sitemap/0.9/siteindex.xsd:
 "The URI must conform to RFC 2396 (http://www.ietf.org/rfc/rfc2396.txt)."
(but in that RFC there are also relative URIs, so who knows.)
Comment 12 Robert Leverington 2008-08-21 21:20:31 UTC
(In reply to comment #11)
> Here in 1.13 (I just changed it above too, hope that's OK),
> I don't see why generateSitemap.php's --server option
> only affects the sitemaps, and not the sitemap indexes.

This is because the path to the sitemaps is not known and therefore cannot reliably be placed in the sitemap index file without an additional parameter to specify this being added to the script.

e.g. if your sitemaps were located in the directory

/var/www/sitemaps/

as apposed to 

/var/www/

or

/var/www/w/

MediaWiki would have no way of knowing this and your --server parameter would cause the index file to list them as

http://www.example.com/<sitemap>

or

http://www.example.com/w/<sitemap>

(depending on implementation) rather than

http://www.example.com/sitemaps/<sitemap>
Comment 13 Dan Jacobson 2008-08-21 23:42:05 UTC
OK, I sure hope there will be a final way to get a http:// URL in the indexes.

For now I will just use my wacko Makefile:

# Make sitemaps for my wikis, that all live in the same tree
# Copyright       : http://www.fsf.org/copyleft/gpl.html
# Author          : Dan Jacobson http://jidanni.org/
# Created On      : Thu Mar 27 04:11:10 2008
# Last Modified On: Fri Aug 22 07:39:26 2008
# Update Count    : 80
# https://bugzilla.wikimedia.org/show_bug.cgi?id=9675

T=transgender-taiwan.org
R=radioscanningtw.jidanni.org
B=taizhongbus.jidanni.org
S=$T $R $B
all:$(addsuffix .SITEMAPS,$S)
%.SITEMAPS:
	cd ../$*/maintenance && \
	    php generateSitemap.php --server=http://$* --fspath=../
	perl -wpi -e 'use strict; use warnings FATAL => q(all);$(\
	)s@(<loc>)(sitemap)@$$1http://$*/$$2@' `ls -t sitemap-index-*.xml|sed q`
	sleep 2
Comment 14 The GadgetDoctor 2008-08-22 13:50:43 UTC
New here, not to sure of the form but I want this fixed so I don't have to remember to swap in my own code *again* after the next update. Here is how I fix this problem.

The main point to note is that I sidestep the whole issue of finding out the correct path by asking the human. This thing has to be run by a sysadmin from the command line so we not talking monkeys here. This is documented with these lines:

+       --webpath=<dir>         If you are placing the sitemap files in a sub folder
+               i.e. using the --fspath option and specify somewhere other than root
+               you need to place here the directory name e.g:
+                 
+                 if -fspath = /var/www/httpdocs/mediawiki_sitemaps/ for example
+                 then --webpath = /mediawiki_sitemaps   *Note, no trailing / needed

I hope this helps.



$ svn diff generateSitemap.php 
Index: generateSitemap.php
===================================================================
--- generateSitemap.php (revision 38559)
+++ generateSitemap.php (working copy)
@@ -1,4 +1,4 @@
-<?php
+x<?php
 define( 'GS_MAIN', -2 );
 define( 'GS_TALK', -1 );
 /**
@@ -367,9 +367,11 @@
         * @return string
         */
        function indexEntry( $filename ) {
+         global $wgServer;
+         global $wgWebpath;
                return
                        "\t<sitemap>\n" .
-                       "\t\t<loc>$filename</loc>\n" .
+                       "\t\t<loc>$wgServer$wgWebpath/$filename</loc>\n" .
                        "\t\t<lastmod>{$this->timestamp}</lastmod>\n" .
                        "\t</sitemap>\n";
        }
@@ -457,18 +459,30 @@
                server name detection may fail in command line scripts.
 
        --compress=[yes|no]     compress the sitemap files, default yes
+
+       --webpath=<dir>         If you are placing the sitemap files in a sub folder
+               i.e. using the --fspath option and specify somewhere other than root
+               you need to place here the directory name e.g:
+                 
+                 if -fspath = /var/www/httpdocs/mediawiki_sitemaps/ for example
+                 then --webpath = /mediawiki_sitemaps   *Note, no trailing / needed
+                 
 
 EOT;
        die( -1 );
 }
 
-$optionsWithArgs = array( 'fspath', 'server', 'compress' );
+$optionsWithArgs = array( 'fspath', 'server', 'compress', 'webpath' );
 require_once( dirname( __FILE__ ) . '/commandLine.inc' );
 
 if ( isset( $options['server'] ) ) {
        $wgServer = $options['server'];
 }
 
+if ( isset( $options['webpath'] ) ) {
+       $wgWebpath = $options['webpath'];
+}
+
 $gs = new GenerateSitemap( @$options['fspath'], @$options['compress'] !== 'no' );
 $gs->main();


 
Comment 15 The GadgetDoctor 2008-08-27 20:50:38 UTC
Created attachment 5218 [details]
Proposed replacement for generateSitemap.php
Comment 16 Szőts Ákos 2008-11-03 10:42:04 UTC
Here is a very short patch for this problem:

--- generateSitemap.php 2008-11-03 11:37:53.000000000 +0100
+++ /srv/www/htdocs/mw/esl/maintenance/generateSitemap.php      2008-11-03 11:40:40.000000000 +0100
@@ -392,9 +392,13 @@
         * @return string
         */
        function indexEntry( $filename ) {
+               global $wgServer;
+               $title = Title::makeTitle( '', '' );
+               $location = $wgServer . $title->getLocalUrl() . $filename;
+
                return
                        "\t<sitemap>\n" .
-                       "\t\t<loc>$filename</loc>\n" .
+                       "\t\t<loc>$location</loc>\n" .
                        "\t\t<lastmod>{$this->timestamp}</lastmod>\n" .
                        "\t</sitemap>\n";
        }
Comment 17 DaSch 2008-12-21 12:34:43 UTC
*** Bug 14397 has been marked as a duplicate of this bug. ***
Comment 18 ZRHwiki 2009-07-09 23:13:41 UTC
Any news on this issue?
Comment 19 Dan Jacobson 2009-07-10 19:36:38 UTC
I'll stick the current workaround Makefile I'm "forced to use" in the URL box above.

Also noting inconsistent use of --server in maintenance scripts, just in case one day somebody
wants to unify them:
maintenance/generateSitemap.php:481:	--server=<server>	The protocol and server name to use in URLs, e.g.
maintenance/dumpBackup.php:87:  --server=h  Force reading from MySQL server h
maintenance/dumpTextPass.php:518:  --server=h  Force reading from MySQL server h
Comment 20 Dan Jacobson 2009-07-12 19:15:12 UTC
See also bug 19593.
Comment 21 ZRHwiki 2010-03-12 21:29:56 UTC
The problem seems pretty easy to resolve, i.e. just add the full path (as most MediaWiki users have to do themselves if they want to use sitemaps with Google without an error message. What's causing the apparent delay in resolving this bug?
Comment 22 Thorsten Glaser 2010-06-25 07:22:33 UTC
Please consider MediaWiki farms like this:

https://fusionforge.org/plugins/mediawiki/wiki/fusionforge/index.php/Main_Page
Comment 23 Ilmari Karonen 2010-11-23 19:29:34 UTC
Fixed in r77176.  I took a KISS approach to fixing this issue and simply added a new --urlpath parameter, which can be used to specify the URL path corresponding to --fspath.  For most installations, this will probably equal or begin with the server name, but I figured the minor redundancy should be worth the flexibility and simplicity.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links