Last modified: 2008-03-11 17:57:57 UTC

Wikimedia Bugzilla is closed!

Wikimedia has migrated from Bugzilla to Phabricator. Bug reports should be created and updated in Wikimedia Phabricator instead. Please create an account in Phabricator and add your Bugzilla email address to it.
Wikimedia Bugzilla is read-only. If you try to edit or create any bug report in Bugzilla you will be shown an intentional error message.
In order to access the Phabricator task corresponding to a Bugzilla report, just remove "static-" from its URL.
You could still run searches in Bugzilla or access your list of votes but bug reports will obviously not be up-to-date in Bugzilla.
Bug 13321 - SMW mb unsafe regexps can produce garbled UTF-8 strings
SMW mb unsafe regexps can produce garbled UTF-8 strings
Product: MediaWiki extensions
Classification: Unclassified
Semantic MediaWiki (Other open bugs)
All All
: Normal major (vote)
: ---
Assigned To: Markus Krötzsch
Depends on:
  Show dependency treegraph
Reported: 2008-03-11 03:29 UTC by S Page
Modified: 2008-03-11 17:57 UTC (History)
0 users

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Description S Page 2008-03-11 03:29:56 UTC
In my install (MediaWiki: 1.11.0 (r26292), PHP: 5.2.1 (apache2handler) on Windows XAMPPlite) , just the annotation

[[Test à stuff::Main Page]]

(note the à -- a accent grave) on a page leads to:

a) No wiki text appearing in page display and preview (only the surrounding skin appears).

b) In Special:Browse it displays as a garbled string 
   Test �  stuff  Main Page

I spent 6 hours debugging this 8-).

SMW_Factbox.php and SMW_SpecialBrowse.php both do
  preg_replace('/[\s]/', ...

This is not multibyte safe, so the string can get garbled.
In this annotation, I think the agrave accented character is C3A0 in UTF-8 , and the code point A0 in ISO8859 is NBSP.

The result *on certain PHP configurations* (?) is a garbled string where the agrave is partly replaced. 

a) Special:Browse displays the garbled string.

b) Many of MediaWiki's own preg_* functions use the /u PCRE modifier, so that "Pattern strings are treated as UTF-8."  But as someone commented in ,
 Regarding the validity of a UTF-8 string when using the /u pattern modifier,
2. When the subject string contains invalid UTF-8 sequences / codepoints, it basically result in a "quiet death" for the preg_* functions, where nothing is matched but without indication that the string is invalid UTF-8

In this example the Factbox code garbles this string in its generated HTML, and when MediaWiki's parser calls MagicWords which does some replacements with /u, BOOM, the first one (stripToc) returns nothing.

In these two cases I found a fix is for SMW to use the /u PCRE modifier in its own preg* functions.
a) SMW_SpecialBrowse.php around line 178 change
  $html .=  '<strong>' . $skin->makeKnownLinkObj($att, preg_replace('/[\s]/', '&nbsp;', smwfT($att), 2)) . "</strong>&nbsp; \n";
  $html .=  '<strong>' . $skin->makeKnownLinkObj($att, preg_replace('/[\s]/u', '&nbsp;', smwfT($att), 2)) . "</strong>&nbsp; \n";

b) SMW_Factbox.php around line 273 change
 $text .= '<tr><td class="smwpropname">[[' . $property->getPrefixedText() . '|' . preg_replace('/[\s]/','&nbsp;',$property->getText(),2) . ']] </td><td class="smwprops">';

  $text .= '<tr><td class="smwpropname">[[' . $property->getPrefixedText() . '|' . preg_replace('/[\s]/u','&nbsp;',$property->getText(),2) . ']] </td><td class="smwprops">';

What's confounding is this doesn't happen on, or even on my ISP (see , works OK).  Maybe it's only an issue on Windows or with XAMPPLITE.
Comment 1 Markus Krötzsch 2008-03-11 15:37:49 UTC
I did implement the suggested changes in SVN, but I did not find an explicit documentation that this should really fix the problem. The PHP-docu states that the u-modifier only affects UTF-8 compatibility of the pattern, not the matched/replaced string. So it might still be an insecure fix.

Also, I see a large number of preg_* functions in MediaWiki code. Are they all operating on ASCII strings only (or in a UTF-aware way?). Some comment in the PHP online docu suggests to use multi-byte extensions for ereg such as mb_ereg(), mb_eregi(), pb_ereg_replace(). Do we need that to really ensure UTF compatibility? Is this extension present in standard PHP distributions?
Comment 2 Brion Vibber 2008-03-11 16:28:03 UTC
The /u option on PCRE patterns tells it to match based on UTF-8 characters rather than raw bytes. Most of the time this only makes a difference if your pattern uses specific lengths (that might split characters) or character classes (where byte-based matches won't match correctly).

The above example uses the whitespace character class (\s); you thus need to add the /u modifier there or you're going to get bad matches, as some UTF-8 component bytes might match an 8-bit whitespace (such as the Latin-1 'non-breaking space').

If your pattern very explicitly doesn't use anything that might trip problems, then you don't really need to add the /u. But if in doubt, add it in.

(The above comments do note that in some versions of PHP / PCRE, the /u mode will break if the input string is not valid UTF-8. Make sure your input is not invalid. :)
Comment 3 Markus Krötzsch 2008-03-11 17:57:57 UTC
Thanks, I have now updated all our preg_s accordingly (there were quite some more associated to particular functions, especially input value parsing). This should then close that bug.

Note You need to log in before you can comment on or make changes to this bug.