Last modified: 2014-09-23 23:13:23 UTC
When you give the API titles in non-NFC form (see URL), in the output they are silently normalized to NFC, which makes it difficult for the user to match the input with the output. So there should be a 'normalized' entry for every given title in non-NFC form, like the ones for other title normalizations.
IIRC, this normalization is applied on raw input in WebRequest, so the API code would only ever see the NFC form in the first place. For it to know anything had changed, it would have to manually compare against $_GET and $_POST source variables.
We can add a function to WebRequest to return the original value instead of the normalized.
Looks like we might need to cache it earlier... As it looks like whenever the normalize is called, it just overrides them all..
The normalization is done in getGPCValue. Just add a boolean parameter $normalize.
This should also be implemented for special utf-8 normalization like it is done for ml and ar . e.g. http://ml.wikipedia.org/w/api.php?action=query&titles=അനിമേഷന് I think its NFKC?
I took a stab at this this afternoon, but ran into an issue that I think makes this impossible to solve. I managed to delay Unicode normalization of the titles parameter until ApiPageSet::processTitlesArray(), and got ?action=query&titles=Ϋ&format=jsonfm to output a 'normalized' object. However, all data in the API result data structure is Unicode-normalized before being output, so you get stuff like: "normalized": [ { "from": "\u03ab", "to": "\u03ab" } ], where the "from" entry was originally "\u03a5\u0308" (the value specified in the query string) but got normalized prior to being output. This means from and to will always be equal (sans underscores to spaces and other existing normalizations), so this is useless. I could armor the from value to protect it from Unicode normalization (I've written code for that before; I threw it out but I should be able to reproduce it quickly), but that would allow the injection or arbitrary non-normalized data into the result, which may be invalid UTF-8, which would break e.g. XML parsers. Is there a way I can do this only for cases where we want this? Is "\u03a5\u0308" a string that is valid UTF-8/Unicode but is nevertheless changed by Language::normalize()? Is this true for all cases where we want this feature? Is it possible to detect this somehow? CC Brion because he probably knows more about this subject that I do.
Created attachment 8504 [details] Stashing my work-in-progress changes here, this is as good a place as any
Created attachment 8505 [details] Patch with debugging code removed
(In reply to comment #6) > I could armor the from value to protect it from Unicode normalization (I've > written code for that before; I threw it out but I should be able to reproduce > it quickly), but that would allow the injection or arbitrary non-normalized > data into the result, which may be invalid UTF-8, which would break e.g. XML > parsers. > Invalid UTF-8 is essentially random binary data and should thus be encoded, for example in base64.
(In reply to comment #9) > Invalid UTF-8 is essentially random binary data and should thus be encoded, for > example in base64. Yeah. But I think it's fair not to offer this feature (normalization data) when the client gives us invalid UTF-8. Clients sending invalid UTF-8 deserve that punishment :) However, I think there are edge cases where we normalize things that are already valid UTF-8 but not quite the way we like to see it. I'd like to be able to detect those cases.
There are essentially two layers of work here, which our input validation merges into a single step: 1) invalid UTF-8 sequences must be found and replaced with valid placeholder characters 2) valid UTF-8 sequences are normalized to form C (eg, replacing 'e' followed by 'combining acute accent' into precombined character 'e with acute') The invalid UTF-8 sequences found in part 1) **cannot be represented as strings in JSON or XML output**, because JSON and XML formats are based on Unicode text. Even if you wanted them, you can't just output them directly, nor can you use any escaping method to represent the original bad sequences. Outputting the original bogus UTF-8 into the document would cause it to be unreadable, breaking the API. Most likely, only 2) is of real interest: "\u03a5\u0308" is a perfectly valid Unicode string, and can be shipped around either with the JSON string escapes as above or as literals in any Unicode encoding for any JSON or XML document. We can perfectly well expect clients to sent that string, and we should be able to represent it in output. That we normalize strings into NFC for most internal purposes should generally be an implementation detail of our data formats and how we do title comparisons, so it's reasonable to expect clients that input a given non-NFC string to see the same thing on the other side when we report how we normalized the title string. Running only UTF-8 sequence validation at the $wgRequest boundary, and doing stuff like the NFC conversion to avoid extra combining characters should really be at processing and comparison boundaries like Title normalization. So in short: don't worry about representing invalid UTF-8 byte sequences: either use a 'before' value that's been validated as UTF-8, or let the API output do UTF-8 validation (but make sure it *doesn't* apply NFC conversion on all output)
(In reply to comment #11) > So in short: don't worry about representing invalid UTF-8 byte sequences: > either use a 'before' value that's been validated as UTF-8, or let the API > output do UTF-8 validation (but make sure it *doesn't* apply NFC conversion on > all output) Yeah I wanted to do the former, but I have no idea how. What kind of function would I call to find out if something is valid UTF-8? (This is mostly why I'm asking you for help, I know next to nothing about MediaWiki's Unicode facilities, and IIRC you wrote them ;) )
Honestly I don't think we have a good way to do that right now; UtfNormal combines it with the NFC stuff in quickIsNFCVerify(), and our fallbacks mean that a call to iconv() or mv_convert_encoding() might not actually apply anything... Blech!
Can't you do something like $string2 = $string UtfNormal::quickIsNFCVerify( $string2 ); $stringIsValidUTF8 = $string === $string2 ? true : false; As far as I can tell, the quickIsNFCVerify doesn't seem to do anything with the string argument other then remove invalid sequences, and remove control characters (or replace with the replacement character).
(In reply to comment #14) > Can't you do something like > $string2 = $string > UtfNormal::quickIsNFCVerify( $string2 ); > $stringIsValidUTF8 = $string === $string2 ? true : false; > > As far as I can tell, the quickIsNFCVerify doesn't seem to do anything with the > string argument other then remove invalid sequences, and remove control > characters (or replace with the replacement character). Hmmmmm, you know what, that should work just fine actually. :) Downside: may be slower than UtfNormal::cleanUp() on some input texts on some systems, eg if NORMALIZE_ICU is on and using that extension. In other modes, that same code is already getting run if we're calling UtfNormal::cleanUp(), so it should be about the same speed for common cases if we're using either the default or the NORMALIZE_INTL mode (since it calls quickIsNFCVerify anyway to validate UTF-8 before doing the normalization call).
Just some statistics from my interwiki bot: Each of my api requests normally contains 50 titles values. The title values itself are result of other api requests, so it should all be valid utf8. After reading normalized and converted information in average on mlwiki 17 requested titles cannot found in the result, and on arwiki its about 3-4 titles. To solve this problem i am rerequesting each not founded title using its own request, so i know that the single element contained in the response must meet the requested title. In summary my bot needs about 18 reading requests for 50 titles on mlwiki instead of only one.
btw, if i recall we do some other normalization beyond NFC for ml and ar wikis (that are done only on wikis with those content languages for performance reasons, so if you get an interwiki link title from an en wiki, it might have different normalization on ml or ar).
(In reply to comment #15) > (In reply to comment #14) > > Can't you do something like > > $string2 = $string > > UtfNormal::quickIsNFCVerify( $string2 ); > > $stringIsValidUTF8 = $string === $string2 ? true : false; > > > > As far as I can tell, the quickIsNFCVerify doesn't seem to do anything with the > > string argument other then remove invalid sequences, and remove control > > characters (or replace with the replacement character). > > Hmmmmm, you know what, that should work just fine actually. :) > > Downside: may be slower than UtfNormal::cleanUp() on some input texts on some > systems, eg if NORMALIZE_ICU is on and using that extension. In other modes, > that same code is already getting run if we're calling UtfNormal::cleanUp(), so > it should be about the same speed for common cases if we're using either the > default or the NORMALIZE_INTL mode (since it calls quickIsNFCVerify anyway to > validate UTF-8 before doing the normalization call). It's only done on 255 byte strings, so the slow down should be negligible.
Bryan, Bawolff, Could one of you take this and make the necessary changes to close the bug?
leaving this as a deployment blocker since all that seems to be needed here is a SMOP.
(In reply to comment #20) > leaving this as a deployment blocker since all that seems to be needed here is > a SMOP. This could potentially lead to invalid output for XML formats (since certain characters like some control characters are not allowed to appear in XML files, even in entity form)
(In reply to comment #21) > (In reply to comment #20) > > leaving this as a deployment blocker since all that seems to be needed here is > > a SMOP. > > This could potentially lead to invalid output for XML formats (since certain > characters like some control characters are not allowed to appear in XML files, > even in entity form) Hmm, I suppose I should read what other people say before commenting ;)
+reviewed
switch to milestone, remove release tracking dep
Moved patch into Gerrit, see https://gerrit.wikimedia.org/r/#/c/22831/ . It doesn't actually work yet, because the unnormalized data needs to be armored to bypass ApiResult::cleanUpUTF8() somehow.
punting to some point in the future.
(In reply to comment #0) > When you give the API titles in non-NFC form (see URL), in the output they > are > silently normalized to NFC, which makes it difficult for the user to match > the > input with the output. I suppose bug 45848 is another way to look at the problem? It's causing major problems with some API consumers like LiquidThreads; Nikerabbit proposed a solution there.
Change 22831 abandoned by Hashar: (bug 27849) Add normalized info for Unicode normalization of titles Reason: Cleaning up very old change. Feel free to resurrect if there is any interest in finishing this. https://gerrit.wikimedia.org/r/22831
High priority set for more than three years => reflecting reality by setting to Normal priority. Comment 27 and comment 28 describe the status here.