Last modified: 2014-08-05 06:45:19 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T48773, the corresponding Phabricator task for complete and up-to-date bug report information.

Bug 46773 - Word boundary parameter \b not working with Unicode devanagari words


Summary:	Word boundary parameter \b not working with Unicode devanagari words

Status:	NEW

Product:	MediaWiki extensions
Classification:	Unclassified
Component:	AbuseFilter (Other open bugs)
Version:	unspecified
Hardware:	All All

Importance:	Normal major (vote)
Target Milestone:	---
Assigned To:	Nobody - You can work on this!

URL:
Whiteboard:
Keywords:	i18n, upstream

Depends on:
Blocks:
	Show dependency tree / graph

Reported:	2013-04-02 06:49 UTC by Mahitgar
Modified:	2014-08-05 06:45 UTC (History)
CC List:	14 users (show)

See Also:	20008 22761
Web browser:	---
Mobile Platform:	---
Assignee Huggle Beta Tester:	---

Attachments
PCRE unit tests without and with unicode mode (1.25 KB, text/plain) 2013-07-02 20:46 UTC, Antoine "hashar" Musso (WMF)	Details
Add an attachment (proposed patch, testcase, etc.)

Description Mahitgar 2013-04-02 06:49:35 UTC

At Marathi language wikipedia usually I am useing "contains_any(added_lines,"तू")"parameter to filter  a given word,since say i want to stop use of word "तू" .

To avoid false positives due to prefixes and suffixes to the word; we want to use parameter \b as word boundry on either side of the word or both side of the word as per reuirement.   

We wish we should be able to use 

*contains_any(added_lines,"तू\b"),should work, so that we do not get a false positive on word "तूप" and many similler words.

*contains_any(added_lines,"\bतू"),should work, so that we do not get a false positive on word "धातू" and many similler words.   

*contains_any(added_lines,"\bतू\b"),should work,so that we do not get a false positive on word "दुकानातून" and many similler words.

The related edit(abuse) filter on Marathi language wikipedia is http://mr.wikipedia.org/wiki/विशेष:दुरूपयोग_गाळणी/10 

For words where prefixes and suffixes are less we are using ! parameter but this parameter is not sufficient in words where too many suffixes or prefixes are possible.

If  parameter \b can work or any other good option for word boundry it will be usefull to many devanagari script using wikis like Hindi and many other.

Comment 1 Mahitgar 2013-04-02 06:54:52 UTC

*One suggession was given to use 'added_lines rlike "\bतू\b" ' but this also did not work. 

*It seems Some youropean languages had problems  related to \b parameter and those are resolved so request to developers to support devanagari script using languages in this respect.

Comment 2 Quentinv57 2013-04-03 11:47:17 UTC

I just tried to fix this, using the following expression :
       added_lines irlike "\bतू\b"
(this is the way it is done on the french Wikipedia from what I've seen)

The expression was not matched as I expected. Apparently it comes from the fact that "\b" does not support UTF-8 characters.

Regards,

Quentinv57

Comment 3 Jérémie Roquet 2013-04-03 13:36:54 UTC

Hi everyone,

Using rlike is indeed the way to go as contains_any works on plain strings, not regular expressions (in this context, \b is nothing more than an invalid escaped character).

We've indeed already had the same problem on several European languages such as French and Portuguese (see bug 22761), but it has been fixed by updating PHP to a newer version which provides UTF-8-aware special characters.

Now it would be interesting to test PCRE alone, to see if it can handle this well.

Best regards

Comment 4 Mahitgar 2013-04-15 01:50:28 UTC

*In most filters we are shifting from  'contains_any(added_lines,"") to added_lines irlike" " 

*Ofcourse we still need solution to \b word boundry issue and is very important to several of our filters and indic language wikis.

*( BTW -bit of subject diversion- added_lines irlike" "  Seems to have some unstated limit to the number of words/strings it can handle in a single filter ? or this behaviour only with Devnagari script)

Comment 5 Nemo 2013-06-09 09:57:44 UTC

(In reply to comment #3)
> We've indeed already had the same problem on several European languages such
> as
> French and Portuguese (see bug 22761), but it has been fixed by updating PHP
> to
> a newer version which provides UTF-8-aware special characters.

So it should be reported upstream to PHP?

Comment 6 Alex Monk 2013-06-09 12:58:47 UTC

-upstream keyword: "Bugs marked this way *should* include a link to the upstream bug report in the "See Also" field!" (https://bugzilla.wikimedia.org/describekeywords.cgi)

Comment 7 Nemo 2013-06-09 13:30:38 UTC

(In reply to comment #6)
> -upstream keyword: "Bugs marked this way *should* include a link to the
> upstream bug report in the "See Also" field!"
> (https://bugzilla.wikimedia.org/describekeywords.cgi)

Sure. That's why I added it.

Comment 8 Alex Monk 2013-06-09 13:37:12 UTC

But there's no PHP bug URL in the See Also field...

Comment 9 Andre Klapper 2013-06-11 18:06:14 UTC

Is there something like a minimal test script to trigger this? Also wondering about our "PHP version" and "Package affected". 
See https://bugs.php.net/report.php

Comment 10 Gerrit Notification Bot 2013-07-02 20:40:37 UTC

Change 71718 had a related patch set uploaded by Hashar:
test word boundaries in devanagari words

https://gerrit.wikimedia.org/r/71718

Comment 11 Antoine "hashar" Musso (WMF) 2013-07-02 20:46:12 UTC

Created attachment 12734 [details]
PCRE unit tests without and with unicode mode

The root cause is that PCRE does not look up unicode characters properties by default and would not recognize word boundaries in various scripts.

To make PCRE matches the word boundaries, we need to have PCRE act in unicode mode using the 'u' regex modifiers.  That will make PCRE to lookup the character properties in a huge table which might be a bit slow.

So that is definitely doable, but we have to look at the performance impact.


The change https://gerrit.wikimedia.org/r/71718 adds a lame test in MediaWiki core which shows the problem.


$ php phpunit.php --testdox includes/bug46773Test.php 
PHPUnit 3.7.21 by Sebastian Bergmann.

Configuration read from /Users/amusso/projects/mediawiki/core/tests/phpunit/suite.xml

bug46773
 [ ] Regex boundaries devanagari
 [x] Regex boundaries devanagari in unicode mode
 [x] Media wiki test case parent setup called
$

(a 'x' denote test is passing).


Attached is the --tap output of the test.

Comment 12 Gerrit Notification Bot 2013-07-02 20:47:03 UTC

Change 71718 abandoned by Hashar:
test word boundaries in devanagari words

Reason:
That was an example for bug 46773

https://gerrit.wikimedia.org/r/71718

Comment 13 Antoine "hashar" Musso (WMF) 2013-07-02 20:59:14 UTC

Posted a comment on wikitech-l to attract more people: http://lists.wikimedia.org/pipermail/wikitech-l/2013-July/070191.html

Comment 14 Chris Steipp 2013-07-09 03:18:45 UTC

Running a basic preg_match 1M times with and without the modifier, with the u it averaged 15% longer.

Doing regexes isn't the only thing AbuseFilter does, so I think we would be safe enabling it with a flag, and then we can watch the performance of it to make sure we don't see anything too crazy.

Comment 15 Mahitgar 2013-07-26 07:13:26 UTC

(In reply to comment #14)

>>safe enabling it with a flag<<

Hi,

Are we expected to do any Edit filter testing at our local wiki? 


Thanks and regards

Comment 16 Mahitgar 2014-08-05 06:45:19 UTC

Any good news for us on this bug, Please.

Note You need to log in before you can comment on or make changes to this bug.

Wikimedia Bugzilla is closed!

Search

Personal tools

Navigation

Links