Last modified: 2014-03-02 19:13:04 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T64109, the corresponding Phabricator task for complete and up-to-date bug report information.

Bug 62109 - Add canonical namespaces and aliases to XML dumps


Summary:	Add canonical namespaces and aliases to XML dumps

Status:	NEW

Product:	MediaWiki
Classification:	Unclassified
Component:	Export/Import (Other open bugs)
Version:	1.23.0
Hardware:	All All

Importance:	Normal enhancement (vote)
Target Milestone:	---
Assigned To:	Nobody - You can work on this!

URL:
Whiteboard:
Keywords:	easy

Depends on:
Blocks:	62111
	Show dependency tree / graph

Reported:	2014-03-01 19:43 UTC by Aaron Halfaker
Modified:	2014-03-02 19:13 UTC (History)
CC List:	3 users (show)

See Also:	40010
Web browser:	---
Mobile Platform:	---
Assignee Huggle Beta Tester:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Description Aaron Halfaker 2014-03-01 19:43:25 UTC

The XML dump contains a siteinfo header with a <namespaces> tag that is very useful for processing the text in the dumps.  It looks something like this:

<mediawiki ...snip... >
  <siteinfo>
    <sitename>Վիքիպեդիա</sitename>
    <base>http://hy.wikipedia.org/wiki/%D4%B3%D5%AC%D5%AD%D5%A1%D5%BE%D5%B8%D6%80_%D5%A7%D5%BB</base>
    <generator>MediaWiki 1.23wmf15</generator>
    <case>first-letter</case>
    <namespaces>
      <namespace key="-2" case="first-letter">Մեդիա</namespace>
      <namespace key="-1" case="first-letter">Սպասարկող</namespace>
      <namespace key="0" case="first-letter" />
      <namespace key="1" case="first-letter">Քննարկում</namespace>
      <namespace key="2" case="first-letter">Մասնակից</namespace>

  ...snip...

    </namespaces>
  </siteinfo>

Regretfully, this header does not include canonical namespace names or namespace aliases.  However, an API request for "meta=siteinfo" does include these bits.  For example, the call for http://hy.wikipedia.org/w/api.php?action=query&meta=siteinfo&siprop=namespaces|namespacealiases returns the following XML:

<api>
  <query>
    <namespaces>
      <ns id="-2" case="first-letter" canonical="Media" xml:space="preserve">Մեդիա</ns>
      <ns id="-1" case="first-letter" canonical="Special" xml:space="preserve">Սպասարկող</ns>
      <ns id="0" case="first-letter" content="" xml:space="preserve" />
      <ns id="1" case="first-letter" subpages="" canonical="Talk" xml:space="preserve">Քննարկում</ns>
      <ns id="2" case="first-letter" subpages="" canonical="User" xml:space="preserve">Մասնակից</ns>

  ...snip...

    </namespaces>
    <namespacealiases>
      <ns id="6" xml:space="preserve">Image</ns>
      <ns id="7" xml:space="preserve">Image talk</ns>
    </namespacealiases>
  </query>
</api>

The XML dump should be updated to include this important metadata about namespaces.

Comment 1 Jesús Martínez Novo (Ciencia Al Poder) 2014-03-01 19:51:21 UTC

What would be the use case of having this information in the dump?

Comment 2 MZMcBride 2014-03-01 20:22:54 UTC

(In reply to Jesús Martínez Novo (Ciencia Al Poder) from comment #1)
> What would be the use case of having this information in the dump?

As I understand it, the XML dumps are targeted for offline use.

(In reply to Aaron Halfaker from comment #0)
> Regretfully, this header does not include canonical namespace names or
> namespace aliases.  However, an API request for "meta=siteinfo" does include
> these bits.

This sounds as though people trying to re-use the dumps need to go online to get this information. I think this is a perfectly reasonable enhancement request.

I'm marking this ticket with the "easy" keyword because it shouldn't be very difficult to add this additional information to the XML dumps. The most challenging part here is figuring out whether it's the PHP or the Python maintenance scripts that generate these particular dumps. The actual output logic can probably be cribbed from the MediaWiki API.

Comment 3 Aaron Halfaker 2014-03-01 21:25:16 UTC

Re. use case,

One common activity when processing wiki dumps is to extract historical link information -- something that can't be done with pagelinks.  Let's say I'm processing an enwiki dump and I encounter the following link:

[[WP:Foo]]

Without knowing that "WP" is an alias of ns=4 ("Project"/"Wikipedia") I'd have to assume that "WP:Foo" is the title of an ns=0 article.  

This is a problem for canonical namespace names too.  The following link would reference the same page:

[[Project:Foo]]

Comment 4 Jesús Martínez Novo (Ciencia Al Poder) 2014-03-02 19:13:04 UTC

What processing are you talking about? Do you have any script that handles the dump, other than importDump.php?

And what about interwiki links? Would you assume that [[commons:Foo]] would be also a page in the main namespace?

Wikimedia Bugzilla is closed!

Search

Personal tools

Navigation

Links