Last modified: 2011-04-17 12:58:52 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T30541, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 28541 - api can't output binary data, like icu sortkeys
api can't output binary data, like icu sortkeys
Status: RESOLVED FIXED
Product: MediaWiki
Classification: Unclassified
API (Other open bugs)
1.18.x
All All
: Low enhancement (vote)
: ---
Assigned To: Roan Kattouw
http://translatewiki.net/w/api.php?ac...
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2011-04-14 19:49 UTC by Bawolff (Brian Wolff)
Modified: 2011-04-17 12:58 UTC (History)
6 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Bawolff (Brian Wolff) 2011-04-14 19:49:20 UTC
When doing an api query that could return binary data (icu sortkeys for example), if the binary data isn't valid utf-8 (which will happen most of the time) it will output an empty string instead.

For example:
http://translatewiki.net/w/api.php?action=query&list=categorymembers&cmprop=sortkey&cmtitle=Category:User%20en&cmlimit=1

Expected output is a big long binary string.

This is rather confusing when trying to debug stuff, although arguably much of the time this behaviour does kind of make sense. Anyways, caused by ApiResult::cleanUpUTF8.
Comment 1 Bawolff (Brian Wolff) 2011-04-15 18:56:22 UTC
>This is rather confusing when trying to debug stuff, although arguably much of
>the time this behaviour does kind of make sense. Anyways, caused by
>ApiResult::cleanUpUTF8.

For reference, this was partially caused by a bug in UtfNormal::cleanUp when the intl extension was enabled (It returned false instead of the fixed string). Thats fixed in r86130. The larger issue of the api normalizing binary data (which reminds me, are you even allowed to put things like nulls into entity references in an xml document?) is still there.
Comment 2 Brion Vibber 2011-04-15 19:02:44 UTC
If you need to output binary data in a structured text tree, it should be as hex strings (as with SHA-1 hashes of image files) or perhaps occasionally as base64.

It sounds here that something is overriding the category sort key (a manually-specified piece of text that replaces the page's own title for sorting purposes in category lists, which if not specified will take the original title's value) with a normalized binary sort key that's generated from it and used at a the low-level for sorting.

That should probably not be returned as cmsortkey (a well-known text field), but as an additional parameter which is specifically a post-processed binary key.

Probably whatever's changed the sort key is doing it directly in the database instead of adding a new field?
Comment 3 Brion Vibber 2011-04-15 19:09:09 UTC
(If it's an extension that overrides saving of cl_sortkey with a value generated from the original value passed in, then it could probably also override the API output of the sort key; converting the output into a hex string will double the length of the returned data, but should lead to consistent behavior of clients that attempt to sort things based on the key values. Be careful of any other attempted use of the sort key in this situation...)
Comment 4 Bawolff (Brian Wolff) 2011-04-15 19:13:41 UTC
Actually both can be output now. the actual sortkey as specified by humans is called sortkey_prefix (may or may not have the underscore in the api, can't remember). The question is, what are users of the api using the original sortkey for - to show what its sorted under, or to sort items. One of the rationales for doing it as it is currently (binary data in sortkey field) is if people are using the info from the api to sort things, they would want to sort by comparing the binary field. So doing it this way somewhat preserves the semantics of the original field in a sense See bug 24650

hex might be the way to go here.
Comment 5 Roan Kattouw 2011-04-17 12:58:52 UTC
Fixed in r86257 by hex-encoding the sortkeys.

I experimented with armoring binary data and managed to get it to output correctly in JSON, but XML explicitly forbids non-printable ASCII characters. Even escaping them as  is forbidden.

Since sending binary data for sortkey presentation wasn't gonna work, that also presented a problem for sortkeys used as part of the cmcontinue parameter. For that reason I decided to hex-encode all sortkeys (in list=categorymembers and prop=categories output, and in cmcontinue) and decode hex sortkeys in the cmcontinue back to binary when using them in the SQL query.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links