Last modified: 2014-03-02 19:13:04 UTC
The XML dump contains a siteinfo header with a <namespaces> tag that is very useful for processing the text in the dumps. It looks something like this: <mediawiki ...snip... > <siteinfo> <sitename>Վիքիպեդիա</sitename> <base>http://hy.wikipedia.org/wiki/%D4%B3%D5%AC%D5%AD%D5%A1%D5%BE%D5%B8%D6%80_%D5%A7%D5%BB</base> <generator>MediaWiki 1.23wmf15</generator> <case>first-letter</case> <namespaces> <namespace key="-2" case="first-letter">Մեդիա</namespace> <namespace key="-1" case="first-letter">Սպասարկող</namespace> <namespace key="0" case="first-letter" /> <namespace key="1" case="first-letter">Քննարկում</namespace> <namespace key="2" case="first-letter">Մասնակից</namespace> ...snip... </namespaces> </siteinfo> Regretfully, this header does not include canonical namespace names or namespace aliases. However, an API request for "meta=siteinfo" does include these bits. For example, the call for http://hy.wikipedia.org/w/api.php?action=query&meta=siteinfo&siprop=namespaces|namespacealiases returns the following XML: <api> <query> <namespaces> <ns id="-2" case="first-letter" canonical="Media" xml:space="preserve">Մեդիա</ns> <ns id="-1" case="first-letter" canonical="Special" xml:space="preserve">Սպասարկող</ns> <ns id="0" case="first-letter" content="" xml:space="preserve" /> <ns id="1" case="first-letter" subpages="" canonical="Talk" xml:space="preserve">Քննարկում</ns> <ns id="2" case="first-letter" subpages="" canonical="User" xml:space="preserve">Մասնակից</ns> ...snip... </namespaces> <namespacealiases> <ns id="6" xml:space="preserve">Image</ns> <ns id="7" xml:space="preserve">Image talk</ns> </namespacealiases> </query> </api> The XML dump should be updated to include this important metadata about namespaces.
What would be the use case of having this information in the dump?
(In reply to Jesús Martínez Novo (Ciencia Al Poder) from comment #1) > What would be the use case of having this information in the dump? As I understand it, the XML dumps are targeted for offline use. (In reply to Aaron Halfaker from comment #0) > Regretfully, this header does not include canonical namespace names or > namespace aliases. However, an API request for "meta=siteinfo" does include > these bits. This sounds as though people trying to re-use the dumps need to go online to get this information. I think this is a perfectly reasonable enhancement request. I'm marking this ticket with the "easy" keyword because it shouldn't be very difficult to add this additional information to the XML dumps. The most challenging part here is figuring out whether it's the PHP or the Python maintenance scripts that generate these particular dumps. The actual output logic can probably be cribbed from the MediaWiki API.
Re. use case, One common activity when processing wiki dumps is to extract historical link information -- something that can't be done with pagelinks. Let's say I'm processing an enwiki dump and I encounter the following link: [[WP:Foo]] Without knowing that "WP" is an alias of ns=4 ("Project"/"Wikipedia") I'd have to assume that "WP:Foo" is the title of an ns=0 article. This is a problem for canonical namespace names too. The following link would reference the same page: [[Project:Foo]]
What processing are you talking about? Do you have any script that handles the dump, other than importDump.php? And what about interwiki links? Would you assume that [[commons:Foo]] would be also a page in the main namespace?