Last modified: 2013-06-18 13:29:35 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T24761, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 22761 - Abuse filter regex \b considers unicode characters as word boundaries
Abuse filter regex \b considers unicode characters as word boundaries
Status: RESOLVED WORKSFORME
Product: MediaWiki extensions
Classification: Unclassified
AbuseFilter (Other open bugs)
unspecified
All All
: Normal major (vote)
: ---
Assigned To: Nobody - You can work on this!
:
: 36129 (view as bug list)
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2010-03-08 03:28 UTC by Matthew P. Del Buono
Modified: 2013-06-18 13:29 UTC (History)
10 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Matthew P. Del Buono 2010-03-08 03:28:43 UTC
In analyzing a false positive, I've been trying to track down the reason my regex debugger says a regex doesn't match yet it does match on the abuse filter. Eventually I found what appears to be a good lead on the issue.

Details of the incorrect match are here: http://test.wikipedia.org/w/index.php?title=Special:AbuseLog&details=1784

It appears what's going on is the é (which appears to be encoded in UTF-8) is mishandled when testing against the regex. It appears to the regex engine as a word boundary, so the match succeeds (specifically, "\brence\b" matches "conférence"). 

Hopefully there's a way to correct this and it's not a problem in the heart of PHP instead.

Please let me know if you need any additional information.

-- Shirik @ enwiki
Comment 1 Jérémie Roquet 2010-06-24 10:31:27 UTC
Hello,

We have the same problem at frwiki: accented letters cause many false positives because they are recognized as word boundaries.

Best regards,

-- 
Arkanosis @ frwiki
Comment 2 Helder 2011-03-24 01:41:06 UTC
Same problem on ptwikibooks, where a '\bpica\b' matched 'típica'.
Comment 3 Helder 2011-04-28 14:30:23 UTC
(In reply to comment #2)
> Same problem on ptwikibooks, where a '\bpica\b' matched 'típica'.
and also 'hípica' when this edit was made last week:
http://pt.wikibooks.org/?diff=prev&oldid=218088
This forced us have to use something like [^A-ZÁÀÂÃÇÉÊẼÍÓÒÔÕQ̃ÚŰÜŨ015] in a regex, so that it doesn't match accented letters
Comment 4 Helder 2011-04-28 14:32:54 UTC
(In reply to comment #3)
> [^A-ZÁÀÂÃÇÉÊẼÍÓÒÔÕQ̃ÚŰÜŨ015]
Where "015" is there as an workaround to Bug 27987.
Comment 5 Andrew Garrett 2012-02-10 23:49:24 UTC
The /u flag is definitely there, as far as I can see.

	public static function keywordRegex( $str, $regex, $pos, $insensitive = false ) {
		$str = $str->toString();
		$pattern = $regex->toString();

		$pattern = preg_replace( '!(\\\\\\\\)*(\\\\)?/!', '$1\/', $pattern );
		$pattern = "/$pattern/u";

		if( $insensitive ) {
			$pattern .= 'i';
		}

		$handler = new AFPRegexErrorHandler( $pattern, $pos );
		try {
			$handler->install();
			$result = preg_match( $pattern, $str );
			$handler->restore();
		} catch ( Exception $e ) {
			$handler->restore();
			throw $e;
		}
		return new AFPData( self::DBool, (bool)$result );
	}
Comment 6 Mark Nelson 2013-01-31 22:54:05 UTC
Is this still happening? My understanding is that it was a PHP behavior (possibly a "bug" depending on your interpretation) that was changed in more recent versions.

My understanding of the problem:

PHP passes its regexes to PCRE, and the traditional PCRE behavior is that the "traditional" character class \w is defined as just [A-Za-z0-9_], even when matching a UTF8 string. The word boundary class \b is defined based on \w, so behaves similarly. If you wanted the Unicode notion of "letter character", there are instead Unicode character classes, such as \pL. The /u switch in PHP just tells PCRE it's matching a Unicode string, but didn't change the definition of the legacy character classes.

But, PCRE added a switch 'PCRE_UCP' in mid-2010, which if set makes the traditional character classes into aliases for morally equivalent Unicode character classes. That should produce something closer to the expected behavior, at least in our case. From what I can find, this PCRE switch is enabled when /u is specified in newer versions of PHP, starting with 5.3.4, which came out in late 2010. I assume Wikimedia must be using a sufficiently new version by now?
Comment 7 Mark Nelson 2013-01-31 23:01:04 UTC
*** Bug 36129 has been marked as a duplicate of this bug. ***
Comment 8 Jérémie Roquet 2013-02-01 09:46:30 UTC
(In reply to comment #6)
> Is this still happening? My understanding is that it was a PHP behavior
> (possibly a "bug" depending on your interpretation) that was changed in more
> recent versions.
>
> <snip>
> 
> PCRE added a switch 'PCRE_UCP' in mid-2010, which if set makes the
> traditional character classes into aliases for morally equivalent Unicode
> character classes. That should produce something closer to the expected
> behavior, at least in our case. From what I can find, this PCRE switch is
> enabled when /u is specified in newer versions of PHP, starting with 5.3.4,
> which came out in late 2010. I assume Wikimedia must be using a sufficiently
> new version by now?

Indeed, I haven't managed to reproduce the bug this morning.
I'm marking the bug as RESOLVED FIXED but don't hesitate to reopen if it's not OK for you.

Thanks Mark for the detailed explanation!
Comment 9 Nemo 2013-02-05 10:16:52 UTC
Surely it's not FIXED if the behaviour changed in PHP -> WORKSFORME.
WMF wikis are currently using PHP 5.3.10-1ubuntu3.4+wmf1, by the way.
Comment 10 Jérémie Roquet 2013-02-05 10:29:20 UTC
It's fixed somehow by updating PHP (it didn't workforme back in 2010, now it does). But I guess this is of little importance.

Thanks for the info on the PHP version :)

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links