Last modified: 2011-05-03 20:25:40 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T28332, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 26332 - Spam-blacklist does not support unicode characters in regex, needed to filter internationalized domain names
Spam-blacklist does not support unicode characters in regex, needed to filter...
Status: RESOLVED FIXED
Product: MediaWiki extensions
Classification: Unclassified
Spam Blacklist (Other open bugs)
unspecified
All All
: High major with 1 vote (vote)
: ---
Assigned To: Mark A. Hershberger
: patch, patch-need-review
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2010-12-14 12:51 UTC by Alex Lazovsky
Modified: 2011-05-03 20:25 UTC (History)
3 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments
Suggested patch (783 bytes, patch)
2011-04-27 05:02 UTC, Mark A. Hershberger
Details

Comment 1 Bawolff (Brian Wolff) 2010-12-15 19:42:23 UTC
Presumably the SpamBlacklist extension needs to be modified to use the u flag for the regexes it makes so it interprets them as UTF-8.

As a temporary work around, you can escape unicode characters using \xHH (replace HH with hex codes). For example:

\bмакросъемка\.рф  becomes \b\xD0\xBC\xD0\xB0\xD0\xBA\xD1\x80\xD0\xBE\xD1\x81\xD1\x8A\xD0\xB5\xD0\xBC\xD0\xBA\xD0\xB0\.\xD1\x80\xD1\x84

\bпример\.испытание becomes \b\xD0\xBF\xD1\x80\xD0\xB8\xD0\xBC\xD0\xB5\xD1\x80\.\xD0\xB8\xD1\x81\xD0\xBF\xD1\x8B\xD1\x82\xD0\xB0\xD0\xBD\xD0\xB8\xD0\xB5
Comment 2 Alex Lazovsky 2010-12-16 11:36:21 UTC
at first look this work around does not work,
http://ru.wikipedia.org/w/index.php?diff=30229518
http://ru.wikipedia.org/w/index.php?diff=30229527

Now I use AbuseFilter http://ru.wikipedia.org/wiki/Special:AbuseFilter/117 to block such links, but this approach has some drawbacks.
Comment 3 Bawolff (Brian Wolff) 2010-12-27 03:36:18 UTC
Sorry, the work around should not have the \b in it (presumably because things like \xD0 aren't word characters in non-utf8).

\bмакросъемка\.рф  becomes
\xD0\xBC\xD0\xB0\xD0\xBA\xD1\x80\xD0\xBE\xD1\x81\xD1\x8A\xD0\xB5\xD0\xBC\xD0\xBA\xD0\xB0\.\xD1\x80\xD1\x84

\bпример\.испытание becomes
\xD0\xBF\xD1\x80\xD0\xB8\xD0\xBC\xD0\xB5\xD1\x80\.\xD0\xB8\xD1\x81\xD0\xBF\xD1\x8B\xD1\x82\xD0\xB0\xD0\xBD\xD0\xB8\xD0\xB5

-----

Would someone who knows about such things be able to comment if adding the /u flag to the generated regexes would have any adverse performance affects?
Comment 4 Alex Lazovsky 2011-01-04 23:25:58 UTC
This work around works fine, thanks!

Alex
Comment 5 Brion Vibber 2011-02-13 22:39:48 UTC
I haven't tried profiling, but tossing a /u on in SpamRegexBatch::buildRegexes() doesn't seem to break at least. It should however be double-checked with the full-size blacklists.

However -- this isn't necessarily sufficient for handling IDN domain spam, as it won't match the punycode form of the name if it's linked that way. May require some normalization to really do this right.
Comment 6 Mark A. Hershberger 2011-04-27 05:02:29 UTC
Created attachment 8465 [details]
Suggested patch

Could you verify that the attached patch is where you think the /u should go to fix this?
Comment 7 Mark A. Hershberger 2011-05-03 20:25:40 UTC
r87352

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links