Last modified: 2008-03-11 17:57:57 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T15321, the corresponding Phabricator task for complete and up-to-date bug report information.

Bug 13321 - SMW mb unsafe regexps can produce garbled UTF-8 strings


Summary:	SMW mb unsafe regexps can produce garbled UTF-8 strings

Status:	RESOLVED FIXED

Product:	MediaWiki extensions
Classification:	Unclassified
Component:	Semantic MediaWiki (Other open bugs)
Version:	unspecified
Hardware:	All All

Importance:	Normal major (vote)
Target Milestone:	---
Assigned To:	Markus Krötzsch

URL:	http://www.skierpage.com/tmp/mb_bug.txt
Whiteboard:
Keywords:

Depends on:
Blocks:
	Show dependency tree / graph

Reported:	2008-03-11 03:29 UTC by S Page
Modified:	2008-03-11 17:57 UTC (History)
CC List:	0 users

See Also:
Web browser:	---
Mobile Platform:	---
Assignee Huggle Beta Tester:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Description S Page 2008-03-11 03:29:56 UTC

In my install (MediaWiki: 1.11.0 (r26292), PHP: 5.2.1 (apache2handler) on Windows XAMPPlite) , just the annotation

[[Test à stuff::Main Page]]

(note the à -- a accent grave) on a page leads to:

a) No wiki text appearing in page display and preview (only the surrounding skin appears).

b) In Special:Browse it displays as a garbled string 
   Test �  stuff  Main Page


I spent 6 hours debugging this 8-).

SMW_Factbox.php and SMW_SpecialBrowse.php both do
  preg_replace('/[\s]/', ...

This is not multibyte safe, so the string can get garbled.
In this annotation, I think the agrave accented character is C3A0 in UTF-8 , and the code point A0 in ISO8859 is NBSP.

The result *on certain PHP configurations* (?) is a garbled string where the agrave is partly replaced. 

a) Special:Browse displays the garbled string.

b) Many of MediaWiki's own preg_* functions use the /u PCRE modifier, so that "Pattern strings are treated as UTF-8."  But as someone commented in http://php.net/manual/en/reference.pcre.pattern.modifiers.php#54805 ,
 Regarding the validity of a UTF-8 string when using the /u pattern modifier,
...
2. When the subject string contains invalid UTF-8 sequences / codepoints, it basically result in a "quiet death" for the preg_* functions, where nothing is matched but without indication that the string is invalid UTF-8

In this example the Factbox code garbles this string in its generated HTML, and when MediaWiki's parser calls MagicWords which does some replacements with /u, BOOM, the first one (stripToc) returns nothing.


In these two cases I found a fix is for SMW to use the /u PCRE modifier in its own preg* functions.
a) SMW_SpecialBrowse.php around line 178 change
  $html .=  '<strong>' . $skin->makeKnownLinkObj($att, preg_replace('/[\s]/', '&nbsp;', smwfT($att), 2)) . "</strong>&nbsp; \n";
to
  $html .=  '<strong>' . $skin->makeKnownLinkObj($att, preg_replace('/[\s]/u', '&nbsp;', smwfT($att), 2)) . "</strong>&nbsp; \n";

b) SMW_Factbox.php around line 273 change
 $text .= '<tr><td class="smwpropname">[[' . $property->getPrefixedText() . '|' . preg_replace('/[\s]/','&nbsp;',$property->getText(),2) . ']] </td><td class="smwprops">';

to
  $text .= '<tr><td class="smwpropname">[[' . $property->getPrefixedText() . '|' . preg_replace('/[\s]/u','&nbsp;',$property->getText(),2) . ']] </td><td class="smwprops">';


What's confounding is this doesn't happen on ontoworld.org, or even on my ISP (see http://www.skierpage.com/tmp/mb_bug.php , works OK).  Maybe it's only an issue on Windows or with XAMPPLITE.

Comment 1 Markus Krötzsch 2008-03-11 15:37:49 UTC

I did implement the suggested changes in SVN, but I did not find an explicit documentation that this should really fix the problem. The PHP-docu states that the u-modifier only affects UTF-8 compatibility of the pattern, not the matched/replaced string. So it might still be an insecure fix.

Also, I see a large number of preg_* functions in MediaWiki code. Are they all operating on ASCII strings only (or in a UTF-aware way?). Some comment in the PHP online docu suggests to use multi-byte extensions for ereg such as mb_ereg(), mb_eregi(), pb_ereg_replace(). Do we need that to really ensure UTF compatibility? Is this extension present in standard PHP distributions?

Comment 2 Brion Vibber 2008-03-11 16:28:03 UTC

The /u option on PCRE patterns tells it to match based on UTF-8 characters rather than raw bytes. Most of the time this only makes a difference if your pattern uses specific lengths (that might split characters) or character classes (where byte-based matches won't match correctly).

The above example uses the whitespace character class (\s); you thus need to add the /u modifier there or you're going to get bad matches, as some UTF-8 component bytes might match an 8-bit whitespace (such as the Latin-1 'non-breaking space').

If your pattern very explicitly doesn't use anything that might trip problems, then you don't really need to add the /u. But if in doubt, add it in.

(The above comments do note that in some versions of PHP / PCRE, the /u mode will break if the input string is not valid UTF-8. Make sure your input is not invalid. :)

Comment 3 Markus Krötzsch 2008-03-11 17:57:57 UTC

Thanks, I have now updated all our preg_s accordingly (there were quite some more associated to particular functions, especially input value parsing). This should then close that bug.

Wikimedia Bugzilla is closed!

Search

Personal tools

Navigation

Links