Last modified: 2014-11-17 10:34:54 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T29987, the corresponding Phabricator task for complete and up-to-date bug report information.

Bug 27987 - Function ccnorm shouldn't convert "I" and "L" to "1", "O" to "0" and "S" to "5"


Summary:	Function ccnorm shouldn't convert "I" and "L" to "1", "O" to "0" and "S" to "5"

Status:	PATCH_TO_REVIEW

Product:	MediaWiki extensions
Classification:	Unclassified
Component:	AntiSpoof (Other open bugs)
Version:	unspecified
Hardware:	All All

Importance:	Normal normal with 2 votes (vote)
Target Milestone:	---
Assigned To:	Ryan Kaldari

URL:
Whiteboard:
Keywords:

Duplicates:	56189 (view as bug list)
Depends on:
Blocks:
	Show dependency tree / graph

Reported:	2011-03-11 02:36 UTC by Helder
Modified:	2014-11-17 10:34 UTC (History)
CC List:	11 users (show)

See Also:
Web browser:	---
Mobile Platform:	---
Assignee Huggle Beta Tester:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Description Helder 2011-03-11 02:36:58 UTC

Currently the result of 
------------------------------------------------
 ccnorm("ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuwxyz")
------------------------------------------------

is
------------------------------------------------
 ABCDEFGH1JK1MN0PQR5TUVWXYZ_ABCDEFGH1JK1MN0PQR5TUWXYZ
------------------------------------------------

This makes the creation of filters on [[Special:AbuseFilter]] not intuitive, since if we want to catch all variations of a word like "testing" and try to use something like
------------------------------------------------
 words :="TESTING|VANDALIZING";
 ccnorm(added_lines) rlike words)
 & !(ccnorm(removed_lines) rlike words)
------------------------------------------------

it won't work. Instead of this natural approach, the text would need to be changed to
------------------------------------------------
 words :="TE5T1NG|VANDA11Z1NG";
------------------------------------------------


You can confirm the problem on [[Special:AbuseFilter/tools]], by using the following:
------------------------------------------------
 words :="TESTING|VANDALIZING";
 ccnorm("I'm testing here. I'm vandalizing the article!") rlike words
------------------------------------------------

The regex above will not match, but it will match in the following:
------------------------------------------------
 words := "TE5T1NG|VANDA11Z1NG";
 ccnorm("I'm testing here. I'm vandalizing the article!") rlike words
------------------------------------------------

Could this be fixed?

Comment 1 Helder 2011-04-28 14:36:34 UTC

This caused a false positive on this edit
http://pt.wikibooks.org/?diff=218742&oldid=218711
and forced us to workaround the bug adding "015" to some regexes in the filter:
http://pt.wikibooks.org/wiki/Especial:AbuseFilter/history/9/diff/prev/97

Comment 2 Helder 2011-10-07 18:02:41 UTC

These inconsistencies on ccnorm just caused one more false positive on Portuguese Wikibooks. It should convert "ó" to "O" but it is always converting to "ó" (i.e. it doesn't change this character).

Please fix this!

https://pt.wikibooks.org/wiki/Especial:AbuseLog/1550
https://pt.wikibooks.org/wiki/Especial:AbuseFilter/history/9/diff/prev/103

Comment 3 Helder 2011-12-14 18:27:50 UTC

Besides the characters mentioned above, "ï" should be converted to "I".

Does anyone knows if this bug is complicated to fix?

I imagine there is some list of conversion pairs somewhere which just needs to be updated.

Comment 4 Nikola Kovacs 2011-12-26 21:25:16 UTC

The list is part of Extension:AntiSpoof. The file is http://svn.wikimedia.org/viewvc/mediawiki/trunk/extensions/AntiSpoof/maintenance/equivset.in

Comment 5 Andrew Garrett 2012-02-10 23:50:59 UTC

In general you should use ccnorm() on both items in a comparison.

i.e. you should use ccnorm(new_wikitext) contains ccnorm("Vandalism")

Comment 6 Nischay Nahata 2013-03-27 20:33:48 UTC

(In reply to comment #5)
> In general you should use ccnorm() on both items in a comparison.
> 
> i.e. you should use ccnorm(new_wikitext) contains ccnorm("Vandalism")

In that case close this bug as invalid or some changes in equivset.in ?

Comment 7 Helder 2013-10-06 23:39:33 UTC

(In reply to comment #5)
> In general you should use ccnorm() on both items in a comparison.
> 
> i.e. you should use ccnorm(new_wikitext) contains ccnorm("Vandalism")

But then we cannot use "|" in the regular expression, since
ccnorm( 'ab|cd|...|yz' ) === 'AB1CD1...1YZ'

Comment 8 Ryan Kaldari 2013-10-26 03:22:04 UTC

*** Bug 56189 has been marked as a duplicate of this bug. ***

Comment 9 Kunal Mehta (Legoktm) 2013-10-26 03:25:13 UTC

l is a look-alike character to 1, O is a lookalike to 0 and S is a lookalike to 5. This isn't going to be changed in the AbuseFilter extension, since it just uses the normalizations from AntiSpoof. If there are characters that arent being normalized properly, file bugs against AntiSpoof.

Comment 10 Ryan Kaldari 2013-10-26 03:30:33 UTC

Legoktm: I think you are misunderstanding the bug. Yes, the characters are supposed to be mapped, but they are being mapped in the wrong direction.

Comment 11 Ryan Kaldari 2013-10-26 03:36:20 UTC

This bug is actually trivial to fix. It looks like there are a handful of entries in AntiSpoof/equivset.php that have the index and key reversed. Among them, the 4 mentioned in the bug summary: I, L, O, and S. The only problem with fixing this bug is it will temporarily break some AbuseFilter filters that were previously working around this bug.

Comment 12 Ryan Kaldari 2013-10-26 03:37:37 UTC

Sorry, meant to say 'value and key reversed' not 'index and key reversed' :)

Comment 13 Kunal Mehta (Legoktm) 2013-10-26 03:40:43 UTC

I'm not sure I agree with that, but shouldn't this bug be in the AntiSpoof component then?

If someone does want to switch them, they're going to need to somehow coordinate regenerating the spoof user table on every wiki + centralauth as well as updating any abusefilter rule on every wiki that uses ccnorm.

Comment 14 Kunal Mehta (Legoktm) 2013-10-26 03:41:10 UTC

(In reply to comment #13)
> I'm not sure I agree with that, but shouldn't this bug be in the AntiSpoof
> component then?
> 
Mid-air collision. Thanks for moving it.

Comment 15 Ryan Kaldari 2013-10-26 03:56:50 UTC

Legoktm: Sorry for suggesting you didn't understand the bug. You were totally right that the bug was in AntiSpoof, not AbuseFilter. I was originally thinking that Abusefilter was reversing all the mappings, but it turns out that AntiSpoof/equivset.php is just a big mess. Not sure how it got into such a sorry state.

Luckily, it looks like AntiSpoof doesn't actually use equivset.php itself, which probably means that CentralAuth is OK as well. Fixing all the AbuseFilter rules is going to be a mess though. It's sad such a trivial-to-fix bug was allowed to languish for 4 years leading to so many workarounds.

Comment 16 Gerrit Notification Bot 2013-10-26 08:25:57 UTC

Change 92057 had a related patch set uploaded by Kaldari:
Make sure AntiSpoof mappings are mapping in the correct direction. For example, 5 should map to S, not the other way around. Also correcting some lowercase to uppercase mappings. There are a lot more mappings in equivset.in that need to be fixed but this 

https://gerrit.wikimedia.org/r/92057

Comment 17 Nemo 2013-10-26 08:52:40 UTC

Kaldari, *thank you* for working on this!

(In reply to comment #15)
> It's sad such a trivial-to-fix bug was
> allowed to languish for 4 years leading to so many workarounds.

Indeed. The AbuseFilter component has not received much (systematic) love in the last 4-5 years, I'm sure there's plenty of such low hanging fruit which could be found with some bug triaging by people knowing the underlying PHP/AntiSpoof magic/libraries a bit.

Comment 18 Tim Starling 2013-11-07 04:55:45 UTC

I don't really understand this bug. Surely the real issue is that the output of ccnorm() is exposed to the user, rather than being hidden in some opaque object? There's no way you could generate a canonical form which would make sense for readers of every language. Wouldn't the use case described in the original report be better served by syntax along the lines of:

added_lines cclike ["testing", "vandalizing"]

rather than requiring the user to look up a table of confusable characters in git and convert each character by hand or something?

Comment 19 Helder 2013-11-07 08:03:59 UTC

We need to use regex syntax ([abc], [^abc] ^, $, ?, etc). See e.g. the discussion (in Portuguese) about
https://pt.wikipedia.org/wiki/Special:AbuseFilter/18
at
https://pt.wikipedia.org/wiki/WP:Filtro_de_edições/18#Filtro_112_quebrou.21

PS: I use the [[Special:AbuseFilter/tools]] to get the value of ccnorm('foobar') to put into the regex

Comment 20 Tim Starling 2013-11-08 04:44:55 UTC

(In reply to comment #19)
> We need to use regex syntax ([abc], [^abc] ^, $, ?, etc).

Well, how about

added_lines cclike "testing|vandalizing"

Where the regex would be tokenized and reassembled, with alphabetic parts normalised with equivset?

Comment 21 Helder 2013-11-08 12:21:12 UTC

(In reply to comment #20)
> (In reply to comment #19)
> > We need to use regex syntax ([abc], [^abc] ^, $, ?, etc).
> 
> Well, how about
> 
> added_lines cclike "testing|vandalizing"
> 
> Where the regex would be tokenized and reassembled, with alphabetic parts
> normalised with equivset?

Seems good, if that is feasible.

Comment 22 Ryan Kaldari 2013-11-23 15:11:42 UTC

Is it possible right now to do something like:

words :="testing|vandalizing";
ccnorm(added_lines) rlike ccnorm(words)

That would at least make the filters a lot more readable (until something like Tim's solution was implemented).

Comment 23 Helder 2013-11-23 15:49:56 UTC

(In reply to comment #22)
The problem is that

ccnorm( '\\{}()[].?*+-^$|1iI' )
    === '\\{}()[].?*+-^$1111'

so it is not possible to distinguish between the letters "i", "I" etc... and the regex symbol "|".

Comment 24 Nemo 2014-03-28 21:46:45 UTC

(In reply to Helder from comment #23)
> so it is not possible to distinguish between the letters "i", "I" etc... and
> the regex symbol "|".

That's not really a problem. The point is that to compare two things they must *both* be normalised, you can't just assume you know what's the normalised form. So if not

> words :="testing|vandalizing";
> ccnorm(added_lines) rlike ccnorm(words)

then

words :="ccnorm(testing)|ccnorm(vandalizing)";
ccnorm(added_lines) rlike words

as long as they're not apple and oranges.

Comment 25 Nemo 2014-03-28 22:23:48 UTC

Comment 20, 22, 24: split to bug 63242 "ccnorm revamp: add a more sensible interface for normalised comparison".

Note You need to log in before you can comment on or make changes to this bug.

Wikimedia Bugzilla is closed!

Search

Personal tools

Navigation

Links