Last modified: 2014-11-17 00:02:40 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T53326, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 51326 - Chillu letters in Wikidata API
Chillu letters in Wikidata API
Status: NEW
Product: MediaWiki extensions
Classification: Unclassified
WikidataRepo (Other open bugs)
unspecified
All All
: Low normal (vote)
: ---
Assigned To: Wikidata bugs
aklapper-moreinfo
: testme
Depends on:
Blocks: repoapi
  Show dependency treegraph
 
Reported: 2013-07-14 17:52 UTC by emaus
Modified: 2014-11-17 00:02 UTC (History)
9 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description emaus 2013-07-14 17:52:15 UTC
A behavior of Wikibase extension API differs from the general Mediawiki API when it works with invisible symbols like \u200d. The general Mediawiki API removes invisible symbols from titles in the query and returns results for titles without such symbols. For example, see [1].

The Wikibase extension API doesn't remove these symbols and it returns different result for queries with and without such symbols. For example, see [2] and [3].

I'm sure that it would be useful to have one policy for general API and its extension about these symbols.

[1] http://ml.wikipedia.org/w/api.php?action=query&format=json&prop=langlinks&lllimit=500&titles=%E0%B4%A8%E0%B4%B5%E0%B4%82%E0%B4%AC%E0%B4%B0%E0%B5%8D%E2%80%8D
[2] http://www.wikidata.org/w/api.php?action=wbgetentities&format=json&props=sitelinks&sites=mlwiki&titles=%E0%B4%A8%E0%B4%B5%E0%B4%82%E0%B4%AC%E0%B4%B0%E0%B5%8D%E2%80%8D
[3] http://www.wikidata.org/w/api.php?action=wbgetentities&format=json&props=sitelinks&sites=mlwiki&titles=%E0%B4%A8%E0%B4%B5%E0%B4%82%E0%B4%AC%E0%B5%BC
Comment 1 emaus 2013-07-28 08:25:06 UTC
Correction of bug description.

The problem is made not by invisible symbols, but by some special type of letters of Malayalam language named "chillu". See [1].

The Wikipedia API can convert them to a normal form and the Wikidata API cann't.

[1] https://en.wikipedia.org/wiki/Malayalam_alphabet#Chillus_in_Unicode
Comment 2 jeblad 2013-07-28 08:53:53 UTC
I wonder if this is in fact ZWJ and ZWNJ... They are used in a real crappy way on end of strings, and there they are stripped if I remember correct. They should probably be left there, at least for Malayalam.

The new encoding in Unicode does not have this problem, it is only the old faulty encoded strings from before 5.1 (and legacy systems .. and possibly legacy fingers).
Comment 3 jeblad 2013-07-28 08:55:36 UTC
To clearify; it is not the new chillu letters but the ZWJ/ZWNJ used to encode those letters form before unicode 5.1.
Comment 4 Andre Klapper 2013-10-31 12:15:11 UTC
[replacing wikidata keyword by adding CC - see bug 56417]
Comment 5 Lydia Pintscher 2014-09-25 12:24:46 UTC
Is this still a problem? If so can you please provide links and cases where this causes problems?
Comment 6 Andre Klapper 2014-11-15 12:04:57 UTC
emaus: Is this still a problem? If so can you please provide links and cases where this causes problems?
Comment 7 emaus 2014-11-17 00:02:40 UTC
Andre Klapper: the second link of my initial post is not working yet. The Wikipedia API processes both representations of chillu letters and Wikidata API processes the only one that doesn't contain ZWJ. In examples before, Wikidata API doesn't process the title %E0%B4%A8%E0%B4%B5%E0%B4%82%E0%B4%AC%E0%B4%B0%E0%B5%8D%E2%80%8D and processes %E0%B4%A8%E0%B4%B5%E0%B4%82%E0%B4%AC%E0%B5%BC despite the fact that they represent the same word.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links