Last modified: 2014-08-05 06:45:19 UTC
At Marathi language wikipedia usually I am useing "contains_any(added_lines,"तू")"parameter to filter a given word,since say i want to stop use of word "तू" . To avoid false positives due to prefixes and suffixes to the word; we want to use parameter \b as word boundry on either side of the word or both side of the word as per reuirement. We wish we should be able to use *contains_any(added_lines,"तू\b"),should work, so that we do not get a false positive on word "तूप" and many similler words. *contains_any(added_lines,"\bतू"),should work, so that we do not get a false positive on word "धातू" and many similler words. *contains_any(added_lines,"\bतू\b"),should work,so that we do not get a false positive on word "दुकानातून" and many similler words. The related edit(abuse) filter on Marathi language wikipedia is http://mr.wikipedia.org/wiki/विशेष:दुरूपयोग_गाळणी/10 For words where prefixes and suffixes are less we are using ! parameter but this parameter is not sufficient in words where too many suffixes or prefixes are possible. If parameter \b can work or any other good option for word boundry it will be usefull to many devanagari script using wikis like Hindi and many other.
*One suggession was given to use 'added_lines rlike "\bतू\b" ' but this also did not work. *It seems Some youropean languages had problems related to \b parameter and those are resolved so request to developers to support devanagari script using languages in this respect.
I just tried to fix this, using the following expression : added_lines irlike "\bतू\b" (this is the way it is done on the french Wikipedia from what I've seen) The expression was not matched as I expected. Apparently it comes from the fact that "\b" does not support UTF-8 characters. Regards, Quentinv57
Hi everyone, Using rlike is indeed the way to go as contains_any works on plain strings, not regular expressions (in this context, \b is nothing more than an invalid escaped character). We've indeed already had the same problem on several European languages such as French and Portuguese (see bug 22761), but it has been fixed by updating PHP to a newer version which provides UTF-8-aware special characters. Now it would be interesting to test PCRE alone, to see if it can handle this well. Best regards
*In most filters we are shifting from 'contains_any(added_lines,"") to added_lines irlike" " *Ofcourse we still need solution to \b word boundry issue and is very important to several of our filters and indic language wikis. *( BTW -bit of subject diversion- added_lines irlike" " Seems to have some unstated limit to the number of words/strings it can handle in a single filter ? or this behaviour only with Devnagari script)
(In reply to comment #3) > We've indeed already had the same problem on several European languages such > as > French and Portuguese (see bug 22761), but it has been fixed by updating PHP > to > a newer version which provides UTF-8-aware special characters. So it should be reported upstream to PHP?
-upstream keyword: "Bugs marked this way *should* include a link to the upstream bug report in the "See Also" field!" (https://bugzilla.wikimedia.org/describekeywords.cgi)
(In reply to comment #6) > -upstream keyword: "Bugs marked this way *should* include a link to the > upstream bug report in the "See Also" field!" > (https://bugzilla.wikimedia.org/describekeywords.cgi) Sure. That's why I added it.
But there's no PHP bug URL in the See Also field...
Is there something like a minimal test script to trigger this? Also wondering about our "PHP version" and "Package affected". See https://bugs.php.net/report.php
Change 71718 had a related patch set uploaded by Hashar: test word boundaries in devanagari words https://gerrit.wikimedia.org/r/71718
Created attachment 12734 [details] PCRE unit tests without and with unicode mode The root cause is that PCRE does not look up unicode characters properties by default and would not recognize word boundaries in various scripts. To make PCRE matches the word boundaries, we need to have PCRE act in unicode mode using the 'u' regex modifiers. That will make PCRE to lookup the character properties in a huge table which might be a bit slow. So that is definitely doable, but we have to look at the performance impact. The change https://gerrit.wikimedia.org/r/71718 adds a lame test in MediaWiki core which shows the problem. $ php phpunit.php --testdox includes/bug46773Test.php PHPUnit 3.7.21 by Sebastian Bergmann. Configuration read from /Users/amusso/projects/mediawiki/core/tests/phpunit/suite.xml bug46773 [ ] Regex boundaries devanagari [x] Regex boundaries devanagari in unicode mode [x] Media wiki test case parent setup called $ (a 'x' denote test is passing). Attached is the --tap output of the test.
Change 71718 abandoned by Hashar: test word boundaries in devanagari words Reason: That was an example for bug 46773 https://gerrit.wikimedia.org/r/71718
Posted a comment on wikitech-l to attract more people: http://lists.wikimedia.org/pipermail/wikitech-l/2013-July/070191.html
Running a basic preg_match 1M times with and without the modifier, with the u it averaged 15% longer. Doing regexes isn't the only thing AbuseFilter does, so I think we would be safe enabling it with a flag, and then we can watch the performance of it to make sure we don't see anything too crazy.
(In reply to comment #14) >>safe enabling it with a flag<< Hi, Are we expected to do any Edit filter testing at our local wiki? Thanks and regards
Any good news for us on this bug, Please.