Last modified: 2014-02-12 23:38:08 UTC
SiteLinkTable should apply light weight normalization to page titles before storing the. This would avoid issues with specifying titles with or without spaces as parameters to API calls, etc. The following normalization should be applied: * strip leading and trailing whitespace * unicode normalization * converting underscores to spaces (currently, the items_per_site table uses spaces in the page titles, in violation of current practice elsewhere in the database schema) The following normalization should not be applied: * namespace normalization (this requires knowledge of the target wiki's config) * first letter capitalization (requires knowledge about the target wiki's content language, but also about namespaces) * redirect resolution (requires access to the target wiki's database)
..storing the? "Them" or something else?
I don't think this class is the correct place to do such rewrite, this class should use whatever string is passed to it or throw an error.
@jeblad: you are right, I also noticed that when poking at the issue yesterday. The problem seems to be that SiteLinkTable's interface is a bit asymmetric: it stores information from SiteLink objects, but for queries, it takes a site ID and page title as a string. That is convenient, but introduces inconsistencies. Perhaps the necessary normalization should be done in the SiteLink class, and we should use SiteLink instances for querying the SiteLinkTable. But even the SiteLink class doesn't have the necessary information (namely, whether the target is a mediaWiki instance). That would have to be done in the Site object. So, this is my current take on the issue: * Site::normalizePageName() should get an option for enabling/disabling expensive canonical normalization. This is a core change. * SiteLinkTable should not take site id and page title as strings, but always operate on SiteLink instances. * SiteLink should provide way to create an instance with or without "expensive" normalization, and apply "cheap" normalization always.
Related URL: https://gerrit.wikimedia.org/r/63967 (Gerrit Change I86c72ac3a9da52dfd3ee1aca86b247c59d3098ce)
Do we still need to do this or can this be closed?
Unicode normalization is still not applied consistently (this is relevant not only for the SiteLink table). Perhaps we could file that as a separate bug and close this one.