Last modified: 2014-02-12 23:38:08 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T47282, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 45282 - Normalize titles before lookup in the SiteLinksTable
Normalize titles before lookup in the SiteLinksTable
Status: NEW
Product: MediaWiki extensions
Classification: Unclassified
WikidataRepo (Other open bugs)
unspecified
All All
: Normal normal with 1 vote (vote)
: ---
Assigned To: Wikidata bugs
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2013-02-22 13:54 UTC by Daniel Kinzler
Modified: 2014-02-12 23:38 UTC (History)
6 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Daniel Kinzler 2013-02-22 13:54:31 UTC
SiteLinkTable should apply light weight normalization to page titles before storing the. This would avoid issues with specifying titles with or without spaces as parameters to API calls, etc.

The following normalization should be applied:

* strip leading and trailing whitespace
* unicode normalization
* converting underscores to spaces (currently, the items_per_site table uses spaces in the page titles, in violation of current practice elsewhere in the database schema)

The following normalization should not be applied:
* namespace normalization (this requires knowledge of the target wiki's config)
* first letter capitalization (requires knowledge about the target wiki's content language, but also about namespaces)
* redirect resolution (requires access to the target wiki's database)
Comment 1 jeblad 2013-02-25 10:52:21 UTC
..storing the? "Them" or something else?
Comment 2 jeblad 2013-02-25 10:58:16 UTC
I don't think this class is the correct place to do such rewrite, this class should use whatever string is passed to it or throw an error.
Comment 3 Daniel Kinzler 2013-02-26 20:17:47 UTC
@jeblad: you are right, I also noticed that when poking at the issue yesterday.

The problem seems to be that SiteLinkTable's interface is a bit asymmetric: it stores information from SiteLink objects, but for queries, it takes a site ID and page title as a string. That is convenient, but introduces inconsistencies.

Perhaps the necessary normalization should be done in the SiteLink class, and we should use SiteLink instances for querying the SiteLinkTable. But even the SiteLink class doesn't have the necessary information (namely, whether the target is a mediaWiki instance). That would have to be done in the Site object.

So, this is my current take on the issue:

* Site::normalizePageName() should get an option for enabling/disabling expensive canonical normalization. This is a core change.
* SiteLinkTable should not take site id and page title as strings, but always operate on SiteLink instances.
* SiteLink should provide way to create an instance with or without "expensive" normalization, and apply "cheap" normalization always.
Comment 4 Gerrit Notification Bot 2013-05-15 21:29:56 UTC
Related URL: https://gerrit.wikimedia.org/r/63967 (Gerrit Change I86c72ac3a9da52dfd3ee1aca86b247c59d3098ce)
Comment 5 Lydia Pintscher 2013-10-08 13:43:56 UTC
Do we still need to do this or can this be closed?
Comment 6 Daniel Kinzler 2013-10-29 16:05:11 UTC
Unicode normalization is still not applied consistently (this is relevant not only for the SiteLink table).

Perhaps we could file that as a separate bug and close this one.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links