Last modified: 2014-11-17 10:36:45 UTC
On all accesses to Wikidata we need to normalize the page title of the client, especially including the namespace. This needs to be done when querying Wikidata but also when storing. Also consider the constraints and their rationales in the secondary storage system.
The normalization done in the API during lookup should probably be reported somehow. Also note that there are both normalization and redirects in the present system. Normalization is always done, while redirects is something the normal API must be told to follow. Take for example a look at the URL for "_noreg" at no.wp (http://no.wikipedia.org/w/api.php?action=query&prop=info&titles=_noreg&format=jsonfm&redirects) and what it reports back. It will report both a normalization for "_noreg" into "Noreg" and then a redirect from "Noreg" into "Norge" (first form is Nynorsk, last form is Bokmål). Actual output is: { "query": { "normalized": [ { "from": "_noreg", "to": "Noreg" } ], "redirects": [ { "from": "Noreg", "to": "Norge" } ], "pages": { "728": { "pageid": 728, "ns": 0, "title": "Norge", "touched": "2012-05-03T00:05:03Z", "lastrevid": 10449638, "counter": "", "length": 59329 } } } } Also note the difference between "WP:T" at no.wp (http://no.wikipedia.org/w/api.php?action=query&prop=info&titles=WP:T&format=jsonfm&redirects) which is a normal redirect Actual output is: { "query": { "redirects": [ { "from": "WP:T", "to": "Wikipedia:Tinget" } ], "pages": { "1230": { "pageid": 1230, "ns": 4, "title": "Wikipedia:Tinget", "touched": "2012-05-03T21:15:50Z", "lastrevid": 10454551, "counter": "", "length": 126570 } } } } Then consider the same lookup at en.wp (http://en.wikipedia.org/w/api.php?action=query&prop=info&titles=WP:T&format=jsonfm&redirects) which is involving an namespace alias Actual output is: { "query": { "normalized": [ { "from": "WP:T", "to": "Wikipedia:T" } ], "redirects": [ { "from": "Wikipedia:T", "to": "Wikipedia:Tutorial" } ], "pages": { "497846": { "pageid": 497846, "ns": 4, "title": "Wikipedia:Tutorial", "touched": "2012-05-02T14:28:59Z", "lastrevid": 484946903, "counter": "", "length": 4224 } } } } If the requested site-title pair is not equal to the site-title pair in the found item the normalized form should be reported in the API. Not all wikis have the same set of namespace aliases, and the same namespaces can have different name in different languages. There are also "canonical names" for the namespaces that is reported and available for the browser. For example a page in the "Bruker" ("User") namespace in no.wp will have the following definitions mw.config.set({ "wgCanonicalNamespace":"User", "wgCanonicalSpecialPageName":false, "wgNamespaceNumber":2, "wgPageName":"Bruker:John_Erling_Blad_(WMDE)", "wgTitle":"John Erling Blad (WMDE)", "wgCurRevisionId":0, "wgArticleId":0, "wgIsArticle":true, "wgAction":"view", "wgUserName":"John Erling Blad (WMDE)", ... "wgRelevantPageName":"Bruker:John_Erling_Blad_(WMDE)", ... }); There are several values in there that can be interesting, but those are the ones I usually use. Note also whats happend if you try to follow "WP:T" at en.wp (en.wikipedia.org/wiki/WP:T) mw.config.set({ "wgCanonicalNamespace":"Project", "wgCanonicalSpecialPageName":false, "wgNamespaceNumber":4, "wgPageName":"Wikipedia:Tutorial", "wgTitle":"Tutorial", ... "wgRedirectedFrom":"Wikipedia:T", ... }); In this case you will have the source of the redirect available. You will although not have the prenormalized form. If a page manipulates the title through {{DISPLAYTITLE}} like "iPad" on en.wp (http://en.wikipedia.org/wiki/IPad) the wgTitle is still the correct one for the page (Note that wgPageTitle has an "invisible" namespace) mw.config.set({ "wgCanonicalNamespace":"", "wgCanonicalSpecialPageName":false, "wgNamespaceNumber":0, "wgPageName":"IPad", "wgTitle":"IPad", ... }); Short answare seems to be to use the "wgCanonicalNamespace" and "wgTitle" to form a new "wgCanonicalPageName" and use that as the page title for later requests from the client an browsers. This will work even if there is no common canonical name among the wikis I believe, but I have not checked. The important thing is to avoid "wgPageName" as it is now.
An excellent analysis! I fully agree, except that I don't think wgCanonicalNamespace should be used, since: 1) If it is used, it will be displayed in link titles and on hover, so it will look uglier and might even be not understandable to all users; and, 2) I'm not sure that custom namespaces exist in this form. Since when adding a link we are contacting the local Wikipedia anyway in order to get the autocomplete list, we can request the canonical name at the same time with minimal overhead. Another posibility is to make a function similar to Title::newFromText() that could be locale-aware and normalize links in any locale (but what to do with custom namespaces again?)
I would suggest to use the normalized name, not wgCanonicalNamespace + wgTitle, just what we get in query.normalized.to from the API. I am not sure about automatically resolving redirects. I guess we can leave this out for now and maybe consider later. But for now, this item means just "normalize".
Picked in Sprint 7.
Verified in Wikidata demo time for sprint 8
Shouldn't https://wikidata.org/w/api.php?action=wbgetentities&sites=enwiki&titles=Ren%C3%A9_Vautier work now then? Why isn't the underscore understood as a space?