Last modified: 2011-11-25 00:11:07 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T34376, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 32376 - XML dump contains gender-specific namespaces that breaks search indexing of those namespaces
XML dump contains gender-specific namespaces that breaks search indexing of t...
Status: RESOLVED FIXED
Product: Datasets
Classification: Unclassified
General/Unknown (Other open bugs)
unspecified
All All
: High major (vote)
: ---
Assigned To: Ariel T. Glenn
:
: 32629 (view as bug list)
Depends on:
Blocks: 31697
  Show dependency treegraph
 
Reported: 2011-11-12 12:23 UTC by Robert Stojnic
Modified: 2011-11-25 00:11 UTC (History)
6 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Robert Stojnic 2011-11-12 12:23:24 UTC
Currently lucene doesn't support the gender-specific namespaces, which appear in XML dumps although they don't appear in the header. Could we have the XML dumps just use the canonical namespaces again, or add the non-canonical to the header?

Please note this completely breaks User namespace indexing, and makes user pages appear as main pages in search!!!
Comment 1 Roan Kattouw 2011-11-12 12:26:02 UTC
We should add them to the XML dumps.
Comment 2 Niklas Laxström 2011-11-12 12:29:02 UTC
What? Dumps don't have canonical namespace names or ids?
Comment 3 Roan Kattouw 2011-11-12 12:32:57 UTC
(In reply to comment #2)
> What? Dumps don't have canonical namespace names or ids?
There is a patch on bug 30513 to add namespace IDs. Right now people have to parse the namespace out of each title using the namespace map at the beginning of the dump, and that namespace map is missing gendered namespaces.
Comment 4 Robert Stojnic 2011-11-12 12:34:09 UTC
Test case: try exporting User_talk:MrsMyer on de.wp, then look at the title of the exported page.
Comment 5 Diederik van Liere 2011-11-13 02:09:51 UTC
My German is a bit rustic, but I think this illustrates your point:

http://de.wikipedia.org/w/index.php?title=Benutzerin_Diskussion:MrsMyer&printable=yes

in the URL the title starts with Benutzerin_Diskussion while the export page shows Benutzer Diskussion.
Comment 6 Diederik van Liere 2011-11-13 03:08:20 UTC
Rev. http://www.mediawiki.org/wiki/Special:Code/MediaWiki/82029 and Rev. http://www.mediawiki.org/wiki/Special:Code/MediaWiki/97461 (part of MW 1.18) introduced gender sensitive namespaces.

The XML dump file contains both Benutzerin and Benutzer. Updating the <namespace> tag with both variants is probably the cleanest solution.
Comment 7 Roan Kattouw 2011-11-14 06:59:16 UTC
(In reply to comment #6)
> The XML dump file contains both Benutzerin and Benutzer. Updating the
> <namespace> tag with both variants is probably the cleanest solution.
While that's a good idea, your patch on bug 30513 for adding a namespace tag/field would also make this problem go away.
Comment 8 Ariel T. Glenn 2011-11-14 13:25:05 UTC
I think the right behavior is probably what was suggested by the reporter, to use the canonical namespace names in the dump.  I'm not opposed to including the variants in the siteinfo along with which gender they go with, but that's secondary.
Comment 9 Brion Vibber 2011-11-22 19:40:50 UTC
r103945 switches the export to canonical form.

I'd like to list the aliases and all, but we'll need to adjust the <siteinfo> format to make sure it doesn't esplode on anything.
Comment 10 Brion Vibber 2011-11-22 19:41:12 UTC
Removed dep on bug 30513 -- that can remain indepedently open.
Comment 11 Robert Stojnic 2011-11-24 00:16:52 UTC
See comment:

https://www.mediawiki.org/wiki/Special:Code/MediaWiki/103945#c26471
Comment 12 Robert Stojnic 2011-11-24 11:51:59 UTC
Resolved in r104124
Comment 13 Robert Stojnic 2011-11-25 00:11:07 UTC
*** Bug 32629 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links