Last modified: 2008-03-11 17:57:57 UTC
In my install (MediaWiki: 1.11.0 (r26292), PHP: 5.2.1 (apache2handler) on Windows XAMPPlite) , just the annotation [[Test à stuff::Main Page]] (note the à -- a accent grave) on a page leads to: a) No wiki text appearing in page display and preview (only the surrounding skin appears). b) In Special:Browse it displays as a garbled string Test � stuff Main Page I spent 6 hours debugging this 8-). SMW_Factbox.php and SMW_SpecialBrowse.php both do preg_replace('/[\s]/', ... This is not multibyte safe, so the string can get garbled. In this annotation, I think the agrave accented character is C3A0 in UTF-8 , and the code point A0 in ISO8859 is NBSP. The result *on certain PHP configurations* (?) is a garbled string where the agrave is partly replaced. a) Special:Browse displays the garbled string. b) Many of MediaWiki's own preg_* functions use the /u PCRE modifier, so that "Pattern strings are treated as UTF-8." But as someone commented in http://php.net/manual/en/reference.pcre.pattern.modifiers.php#54805 , Regarding the validity of a UTF-8 string when using the /u pattern modifier, ... 2. When the subject string contains invalid UTF-8 sequences / codepoints, it basically result in a "quiet death" for the preg_* functions, where nothing is matched but without indication that the string is invalid UTF-8 In this example the Factbox code garbles this string in its generated HTML, and when MediaWiki's parser calls MagicWords which does some replacements with /u, BOOM, the first one (stripToc) returns nothing. In these two cases I found a fix is for SMW to use the /u PCRE modifier in its own preg* functions. a) SMW_SpecialBrowse.php around line 178 change $html .= '<strong>' . $skin->makeKnownLinkObj($att, preg_replace('/[\s]/', ' ', smwfT($att), 2)) . "</strong> \n"; to $html .= '<strong>' . $skin->makeKnownLinkObj($att, preg_replace('/[\s]/u', ' ', smwfT($att), 2)) . "</strong> \n"; b) SMW_Factbox.php around line 273 change $text .= '<tr><td class="smwpropname">[[' . $property->getPrefixedText() . '|' . preg_replace('/[\s]/',' ',$property->getText(),2) . ']] </td><td class="smwprops">'; to $text .= '<tr><td class="smwpropname">[[' . $property->getPrefixedText() . '|' . preg_replace('/[\s]/u',' ',$property->getText(),2) . ']] </td><td class="smwprops">'; What's confounding is this doesn't happen on ontoworld.org, or even on my ISP (see http://www.skierpage.com/tmp/mb_bug.php , works OK). Maybe it's only an issue on Windows or with XAMPPLITE.
I did implement the suggested changes in SVN, but I did not find an explicit documentation that this should really fix the problem. The PHP-docu states that the u-modifier only affects UTF-8 compatibility of the pattern, not the matched/replaced string. So it might still be an insecure fix. Also, I see a large number of preg_* functions in MediaWiki code. Are they all operating on ASCII strings only (or in a UTF-aware way?). Some comment in the PHP online docu suggests to use multi-byte extensions for ereg such as mb_ereg(), mb_eregi(), pb_ereg_replace(). Do we need that to really ensure UTF compatibility? Is this extension present in standard PHP distributions?
The /u option on PCRE patterns tells it to match based on UTF-8 characters rather than raw bytes. Most of the time this only makes a difference if your pattern uses specific lengths (that might split characters) or character classes (where byte-based matches won't match correctly). The above example uses the whitespace character class (\s); you thus need to add the /u modifier there or you're going to get bad matches, as some UTF-8 component bytes might match an 8-bit whitespace (such as the Latin-1 'non-breaking space'). If your pattern very explicitly doesn't use anything that might trip problems, then you don't really need to add the /u. But if in doubt, add it in. (The above comments do note that in some versions of PHP / PCRE, the /u mode will break if the input string is not valid UTF-8. Make sure your input is not invalid. :)
Thanks, I have now updated all our preg_s accordingly (there were quite some more associated to particular functions, especially input value parsing). This should then close that bug.