Last modified: 2014-09-23 23:13:23 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T29849, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 27849 - API: add normalized info also for unicode normalization of titles
API: add normalized info also for unicode normalization of titles
Status: NEW
Product: MediaWiki
Classification: Unclassified
API (Other open bugs)
1.18.x
All All
: Normal normal with 2 votes (vote)
: Future release
Assigned To: Nobody - You can work on this!
http://en.wikipedia.org/w/api.php?act...
: patch, patch-reviewed
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2011-03-04 16:15 UTC by P.Copp
Modified: 2014-09-23 23:13 UTC (History)
13 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments
Stashing my work-in-progress changes here, this is as good a place as any (5.40 KB, patch)
2011-05-05 16:09 UTC, Roan Kattouw
Details
Patch with debugging code removed (5.36 KB, patch)
2011-05-05 16:12 UTC, Roan Kattouw
Details

Description P.Copp 2011-03-04 16:15:55 UTC
When you give the API titles in non-NFC form (see URL), in the output they are silently normalized to NFC, which makes it difficult for the user to match the input with the output.

So there should be a 'normalized' entry for every given title in non-NFC form, like the ones for other title normalizations.
Comment 1 Brion Vibber 2011-03-05 00:22:10 UTC
IIRC, this normalization is applied on raw input in WebRequest, so the API code would only ever see the NFC form in the first place.

For it to know anything had changed, it would have to manually compare against $_GET and $_POST source variables.
Comment 2 Bryan Tong Minh 2011-03-05 14:28:30 UTC
We can add a function to WebRequest to return the original value instead of the normalized.
Comment 3 Sam Reed (reedy) 2011-03-05 14:40:56 UTC
Looks like we might need to cache it earlier...

As it looks like whenever the normalize is called, it just overrides them all..
Comment 4 Bryan Tong Minh 2011-03-05 14:42:33 UTC
The normalization is done in getGPCValue. Just add a boolean parameter $normalize.
Comment 5 merl 2011-04-12 15:44:30 UTC
This should also be implemented for special utf-8 normalization like it is done for ml and ar . e.g. http://ml.wikipedia.org/w/api.php?action=query&titles=അനിമേഷന്‍ I think its NFKC?
Comment 6 Roan Kattouw 2011-05-05 16:07:27 UTC
I took a stab at this this afternoon, but ran into an issue that I think makes this impossible to solve. I managed to delay Unicode normalization of the titles parameter until ApiPageSet::processTitlesArray(), and got ?action=query&titles=Ϋ&format=jsonfm to output a 'normalized' object. However, all data in the API result data structure is Unicode-normalized before being output, so you get stuff like: 

		"normalized": [
			{
				"from": "\u03ab",
				"to": "\u03ab"
			}
		],

where the "from" entry was originally "\u03a5\u0308" (the value specified in the query string) but got normalized prior to being output. This means from and to will always be equal (sans underscores to spaces and other existing normalizations), so this is useless.

I could armor the from value to protect it from Unicode normalization (I've written code for that before; I threw it out but I should be able to reproduce it quickly), but that would allow the injection or arbitrary non-normalized data into the result, which may be invalid UTF-8, which would break e.g. XML parsers.

Is there a way I can do this only for cases where we want this? Is "\u03a5\u0308" a string that is valid UTF-8/Unicode but is nevertheless changed by Language::normalize()? Is this true for all cases where we want this feature? Is it possible to detect this somehow? CC Brion because he probably knows more about this subject that I do.
Comment 7 Roan Kattouw 2011-05-05 16:09:39 UTC
Created attachment 8504 [details]
Stashing my work-in-progress changes here, this is as good a place as any
Comment 8 Roan Kattouw 2011-05-05 16:12:16 UTC
Created attachment 8505 [details]
Patch with debugging code removed
Comment 9 Bryan Tong Minh 2011-05-05 17:00:52 UTC
(In reply to comment #6)
> I could armor the from value to protect it from Unicode normalization (I've
> written code for that before; I threw it out but I should be able to reproduce
> it quickly), but that would allow the injection or arbitrary non-normalized
> data into the result, which may be invalid UTF-8, which would break e.g. XML
> parsers.
> 
Invalid UTF-8 is essentially random binary data and should thus be encoded, for example in base64.
Comment 10 Roan Kattouw 2011-05-05 17:21:54 UTC
(In reply to comment #9)
> Invalid UTF-8 is essentially random binary data and should thus be encoded, for
> example in base64.
Yeah. But I think it's fair not to offer this feature (normalization data) when the client gives us invalid UTF-8. Clients sending invalid UTF-8 deserve that punishment :)

However, I think there are edge cases where we normalize things that are already valid UTF-8 but not quite the way we like to see it. I'd like to be able to detect those cases.
Comment 11 Brion Vibber 2011-05-05 17:28:14 UTC
There are essentially two layers of work here, which our input validation merges into a single step:

1) invalid UTF-8 sequences must be found and replaced with valid placeholder characters

2) valid UTF-8 sequences are normalized to form C (eg, replacing 'e' followed by 'combining acute accent' into precombined character 'e with acute')

The invalid UTF-8 sequences found in part 1) **cannot be represented as strings in JSON or XML output**, because JSON and XML formats are based on Unicode text. Even if you wanted them, you can't just output them directly, nor can you use any escaping method to represent the original bad sequences.

Outputting the original bogus UTF-8 into the document would cause it to be unreadable, breaking the API.


Most likely, only 2) is of real interest: "\u03a5\u0308" is a perfectly valid Unicode string, and can be shipped around either with the JSON string escapes as above or as literals in any Unicode encoding for any JSON or XML document. We can perfectly well expect clients to sent that string, and we should be able to represent it in output.

That we normalize strings into NFC for most internal purposes should generally be an implementation detail of our data formats and how we do title comparisons, so it's reasonable to expect clients that input a given non-NFC string to see the same thing on the other side when we report how we normalized the title string.

 
Running only UTF-8 sequence validation at the $wgRequest boundary, and doing stuff like the NFC conversion to avoid extra combining characters should really be at processing and comparison boundaries like Title normalization.


So in short: don't worry about representing invalid UTF-8 byte sequences: either use a 'before' value that's been validated as UTF-8, or let the API output do UTF-8 validation (but make sure it *doesn't* apply NFC conversion on all output)
Comment 12 Roan Kattouw 2011-05-05 17:32:41 UTC
(In reply to comment #11)
> So in short: don't worry about representing invalid UTF-8 byte sequences:
> either use a 'before' value that's been validated as UTF-8, or let the API
> output do UTF-8 validation (but make sure it *doesn't* apply NFC conversion on
> all output)
Yeah I wanted to do the former, but I have no idea how. What kind of function would I call to find out if something is valid UTF-8?

(This is mostly why I'm asking you for help, I know next to nothing about MediaWiki's Unicode facilities, and IIRC you wrote them ;) )
Comment 13 Brion Vibber 2011-05-05 17:43:15 UTC
Honestly I don't think we have a good way to do that right now; UtfNormal combines it with the NFC stuff in quickIsNFCVerify(), and our fallbacks mean that a call to iconv() or mv_convert_encoding() might not actually apply anything...

Blech!
Comment 14 Bawolff (Brian Wolff) 2011-05-05 22:18:09 UTC
Can't you do something like
$string2 = $string
UtfNormal::quickIsNFCVerify( $string2 );
$stringIsValidUTF8 = $string === $string2 ? true : false;

As far as I can tell, the quickIsNFCVerify doesn't seem to do anything with the string argument other then remove invalid sequences, and remove control characters (or replace with the replacement character).
Comment 15 Brion Vibber 2011-05-05 22:35:30 UTC
(In reply to comment #14)
> Can't you do something like
> $string2 = $string
> UtfNormal::quickIsNFCVerify( $string2 );
> $stringIsValidUTF8 = $string === $string2 ? true : false;
> 
> As far as I can tell, the quickIsNFCVerify doesn't seem to do anything with the
> string argument other then remove invalid sequences, and remove control
> characters (or replace with the replacement character).

Hmmmmm, you know what, that should work just fine actually. :)

Downside: may be slower than UtfNormal::cleanUp() on some input texts on some systems, eg if NORMALIZE_ICU is on and using that extension. In other modes, that same code is already getting run if we're calling UtfNormal::cleanUp(), so it should be about the same speed for common cases if we're using either the default or the NORMALIZE_INTL mode (since it calls quickIsNFCVerify anyway to validate UTF-8 before doing the normalization call).
Comment 16 merl 2011-05-05 23:35:47 UTC
Just some statistics from my interwiki bot:
Each of my api requests normally contains 50 titles values. The title values itself are result of other api requests, so it should all be valid utf8.
After reading normalized and converted information in average on mlwiki 17 requested titles cannot found in the result, and on arwiki its about 3-4 titles.

To solve this problem i am rerequesting each not founded title using its own request, so i know that the single element contained in the response must meet the requested title. In summary my bot needs about 18 reading requests for 50 titles on mlwiki instead of only one.
Comment 17 Bawolff (Brian Wolff) 2011-05-06 00:48:34 UTC
btw, if i recall we do some other normalization beyond NFC for ml and ar wikis (that are done only on wikis with those content languages for performance reasons, so if you get an interwiki link title from an en wiki, it might have different normalization on ml or ar).
Comment 18 Bryan Tong Minh 2011-05-06 07:30:31 UTC
(In reply to comment #15)
> (In reply to comment #14)
> > Can't you do something like
> > $string2 = $string
> > UtfNormal::quickIsNFCVerify( $string2 );
> > $stringIsValidUTF8 = $string === $string2 ? true : false;
> > 
> > As far as I can tell, the quickIsNFCVerify doesn't seem to do anything with the
> > string argument other then remove invalid sequences, and remove control
> > characters (or replace with the replacement character).
> 
> Hmmmmm, you know what, that should work just fine actually. :)
> 
> Downside: may be slower than UtfNormal::cleanUp() on some input texts on some
> systems, eg if NORMALIZE_ICU is on and using that extension. In other modes,
> that same code is already getting run if we're calling UtfNormal::cleanUp(), so
> it should be about the same speed for common cases if we're using either the
> default or the NORMALIZE_INTL mode (since it calls quickIsNFCVerify anyway to
> validate UTF-8 before doing the normalization call).

It's only done on 255 byte strings, so the slow down should be negligible.
Comment 19 Mark A. Hershberger 2011-06-29 16:42:09 UTC
Bryan, Bawolff,

Could one of you take this and make the necessary changes to close the bug?
Comment 20 Mark A. Hershberger 2011-06-29 16:50:45 UTC
leaving this as a deployment blocker since all that seems to be needed here is a SMOP.
Comment 21 Bawolff (Brian Wolff) 2011-06-30 03:29:49 UTC
(In reply to comment #20)
> leaving this as a deployment blocker since all that seems to be needed here is
> a SMOP.

This could potentially lead to invalid output for XML formats (since certain characters like some control characters are not allowed to appear in XML files, even in entity form)
Comment 22 Bawolff (Brian Wolff) 2011-06-30 03:30:48 UTC
(In reply to comment #21)
> (In reply to comment #20)
> > leaving this as a deployment blocker since all that seems to be needed here is
> > a SMOP.
> 
> This could potentially lead to invalid output for XML formats (since certain
> characters like some control characters are not allowed to appear in XML files,
> even in entity form)

Hmm, I suppose I should read what other people say before commenting ;)
Comment 23 Sumana Harihareswara 2011-11-09 21:49:29 UTC
+reviewed
Comment 24 Mark A. Hershberger 2012-01-23 15:46:21 UTC
switch to milestone, remove release tracking dep
Comment 25 Roan Kattouw 2012-09-05 21:40:50 UTC
Moved patch into Gerrit, see https://gerrit.wikimedia.org/r/#/c/22831/ . It doesn't actually work yet, because the unnormalized data needs to be armored to bypass ApiResult::cleanUpUTF8() somehow.
Comment 26 Mark A. Hershberger 2012-09-30 18:00:11 UTC
punting to some point in the future.
Comment 27 Nemo 2013-03-07 16:07:26 UTC
(In reply to comment #0)
> When you give the API titles in non-NFC form (see URL), in the output they
> are
> silently normalized to NFC, which makes it difficult for the user to match
> the
> input with the output.

I suppose bug 45848 is another way to look at the problem? It's causing major problems with some API consumers like LiquidThreads; Nikerabbit proposed a solution there.
Comment 28 Gerrit Notification Bot 2014-02-02 20:00:02 UTC
Change 22831 abandoned by Hashar:
(bug 27849) Add normalized info for Unicode normalization of titles

Reason:
Cleaning up very old change. Feel free to resurrect if there is any interest in finishing this.

https://gerrit.wikimedia.org/r/22831
Comment 29 Andre Klapper 2014-03-11 13:55:27 UTC
High priority set for more than three years => reflecting reality by setting to Normal priority. Comment 27 and comment 28 describe the status here.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links