Last modified: 2009-01-12 14:11:58 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T18798, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 16798 - JSON encoding errors for characters outside the BMP
JSON encoding errors for characters outside the BMP
Status: RESOLVED FIXED
Product: MediaWiki
Classification: Unclassified
API (Other open bugs)
1.14.x
All All
: Normal normal (vote)
: ---
Assigned To: Roan Kattouw
: patch, patch-need-review
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2008-12-26 06:09 UTC by Brad Jorsch
Modified: 2009-01-12 14:11 UTC (History)
4 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments
Patch (6.41 KB, patch)
2008-12-26 16:53 UTC, Brad Jorsch
Details

Description Brad Jorsch 2008-12-26 06:09:21 UTC
Consider the following query: http://localhost/w/api.php?action=query&format=xml&action=expandtemplates&text=%ef%bf%bd%f0%90%80%80%f3%b0%80%8fzzz

It contains 6 characters: U+fffd, U+10000, U+f000f, U+007a, U+007a, and U+007a. In json encoding, they should be \ufffd\ud800\udc00\udb80\udc0fzzz (U+10000 and U+f000f must be encoded as surrogate pairs).

If I change the format to jsonfm, the three characters are instead encoded as \ufffd\ud800dc00\udb80dc0fzzz, which cannot be decoded correctly. This should be relatively simple to fix, I think.

If I change the format to json, it's even worse: the first two are output correctly as \ufffd\ud800\udc00, but that's it! Apparently PHP's built-in json_encode silently screws up anything over U+1ffff: U+20000-U+3ffff, U+80000-U+bffff, and U+100000-U+10ffff seem to be incorrectly encoded as U+10000-U+1ffff, while U+40000-U+7ffff and U+c0000-U+fffff seem to cause the mentioned silent truncation. The only fix I can think of is to detect if these characters are present and use the fallback code instead.

I'll see about posting a patch later on.
Comment 1 Brad Jorsch 2008-12-26 16:53:12 UTC
Created attachment 5625 [details]
Patch

The PHP bug has been reported at http://bugs.php.net/bug.php?id=46944

This patch adjusts the fallback JSON encoder to be able to handle UTF-16 surrogate pairs, and removes some of the support for invalid UTF-8 encoded characters above U+10FFFF.

It also adds a check to see if the PHP built-in json_encode is affected by PHP bug 46944, and uses our fallback code if so.
Comment 2 Brad Jorsch 2008-12-30 02:38:22 UTC
Heh, wrong example url in the original post. That should obviously be http://en.wikipedia.org/w/api.php?action=query&format=xml&action=expandtemplates&text=%EF%BF%BD%F0%90%80%80%F3%B0%80%8Fzzz
Comment 3 Roan Kattouw 2009-01-04 22:31:10 UTC
Will try to review this soon.
Comment 4 Chad H. 2009-01-06 14:32:37 UTC
On a side note, PHP reports this as being fixed now.
Comment 5 Roan Kattouw 2009-01-07 14:45:05 UTC
(In reply to comment #4)
> On a side note, PHP reports this as being fixed now.
> 

That's nice, but it means that older versions of PHP still have broken JSON formatters. At a quick glance, the patch seems to accommodate for that and only fall back to our own JSON formatter if PHP's is broken.
Comment 6 Roan Kattouw 2009-01-12 14:11:58 UTC
Slightly modified patch applied in r45674.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links