Last modified: 2009-08-23 00:45:58 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T13143, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 11143 - Invalid UTF-8 in percent-encoded links cause page rendering error
Invalid UTF-8 in percent-encoded links cause page rendering error
Product: MediaWiki
Classification: Unclassified
Parser (Other open bugs)
All All
: Normal minor with 1 vote (vote)
: ---
Assigned To: Brion Vibber
: 20346 (view as bug list)
Depends on:
  Show dependency treegraph
Reported: 2007-08-31 23:50 UTC by Dan Collins
Modified: 2009-08-23 00:45 UTC (History)
5 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Description Dan Collins 2007-08-31 23:50:07 UTC
Where one links to a typical page such as:
Where the last character is not a proper A, but instead from the 'greek encoding block':

od -c
0000000 316 221  \n

The page on which the link is displayed will, intermittently, appear blank. See and try purging the cache or making a null edit, if you can see the page correctly. It appears correctly in previews and diffs, and seems to happen more often on longer pages.

This allows anyone to insert a special link and cause the page to appear blank. It seems to me that the best solution would be to sanitize such links, but I can't tell where this problem is occuring, or if it is a symptom of a bigger problem.

We had an issue with this on enwiki, at [[RMS Titanic]], and there was a discussion at [[VP/T]].
Comment 1 Brion Vibber 2007-09-11 18:09:49 UTC
The basic problem is that the PCRE library in PHP 5.2.x is a lot more strict about input in UTF-8 mode. It now rejects an input string which isn't 100% valid, which has a nasty habit of breaking everything.

Further complicating things is that we need in certain circumstances to allow (urlencoded) non-valid-UTF-8 titles for legacy interwiki links, so blithely validating all urldecoded titles could break those. I'm not 100% sure what's the best way to handle the combination.
Comment 2 Carl Fürstenberg 2008-01-04 18:00:18 UTC
Wouldn't "Α" result in ? i.e. [[Special:Allpages/Α]]
Comment 3 mary DeMelo 2008-09-09 15:15:52 UTC
So sugar coat it with wonder why and not who he makes microsoft alot of $ i was told to buy another ph. i have went thru 5 guess nobody but me can speak it outloud no downloads with bugs right thanx anyway it was cool to see what been going thru in the proper language when entering my credit card info to purchase online even on a pc it erases or no i enter so yeah this goes on everyway all the time when on pc internet explorer goes nuts i am tired and there.s so much more that this tomcat sicko does i wish he wud get over it i knew in upoc told him off nice guy thou only bad thing he said was go play in traffic 200 items i didnt write all down to interestd in how intelegent u all are and want to read it again it makes me feel normal its a hard and difficult situation i shud be in a mental institution but nope i know everything 
Comment 4 Brion Vibber 2008-09-11 17:56:37 UTC
(In reply to comment #2)
> Wouldn't "Α" result in ?
> i.e. [[Special:Allpages/Α]]

Yes it would. However this is about this case:


which may as well be:


where "xxx" is anything that's not /[a-f]/i.

Our link normalization sees the "%"s in the link and does a transformation of /%[0-9a-f]/i sequences, to make something like this:


where "y" is a byte which, by itself, does *not* make up a valid UTF-8 sequence.

The result is we have invalid UTF-8 in our internal parser strings, and eventually it goes through the newer, stricter PCRE which barfs and silently destroys the entire string instead of processing it the way we'd expect.

The proper way to deal with this is probably for the title normalization to detect the bad UTF-8 and reject it, so we don't create a bogus link in the first place.
Comment 5 Tim Starling 2009-01-09 05:37:30 UTC
Downgrading severity/priority: no data loss, no features broken, not relevant to security.
Comment 6 Brion Vibber 2009-08-23 00:36:50 UTC
*** Bug 20346 has been marked as a duplicate of this bug. ***
Comment 7 Brion Vibber 2009-08-23 00:45:58 UTC
r55382 adds a Unicode-enabled regex check for whitespace (for bug 15248) which has the happy side effect of eliminating this bug.

I've added a comment to this effect in r55514.

Note You need to log in before you can comment on or make changes to this bug.