Last modified: 2014-11-17 10:34:54 UTC
Currently the result of ------------------------------------------------ ccnorm("ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuwxyz") ------------------------------------------------ is ------------------------------------------------ ABCDEFGH1JK1MN0PQR5TUVWXYZ_ABCDEFGH1JK1MN0PQR5TUWXYZ ------------------------------------------------ This makes the creation of filters on [[Special:AbuseFilter]] not intuitive, since if we want to catch all variations of a word like "testing" and try to use something like ------------------------------------------------ words :="TESTING|VANDALIZING"; ccnorm(added_lines) rlike words) & !(ccnorm(removed_lines) rlike words) ------------------------------------------------ it won't work. Instead of this natural approach, the text would need to be changed to ------------------------------------------------ words :="TE5T1NG|VANDA11Z1NG"; ------------------------------------------------ You can confirm the problem on [[Special:AbuseFilter/tools]], by using the following: ------------------------------------------------ words :="TESTING|VANDALIZING"; ccnorm("I'm testing here. I'm vandalizing the article!") rlike words ------------------------------------------------ The regex above will not match, but it will match in the following: ------------------------------------------------ words := "TE5T1NG|VANDA11Z1NG"; ccnorm("I'm testing here. I'm vandalizing the article!") rlike words ------------------------------------------------ Could this be fixed?
This caused a false positive on this edit http://pt.wikibooks.org/?diff=218742&oldid=218711 and forced us to workaround the bug adding "015" to some regexes in the filter: http://pt.wikibooks.org/wiki/Especial:AbuseFilter/history/9/diff/prev/97
These inconsistencies on ccnorm just caused one more false positive on Portuguese Wikibooks. It should convert "ó" to "O" but it is always converting to "ó" (i.e. it doesn't change this character). Please fix this! https://pt.wikibooks.org/wiki/Especial:AbuseLog/1550 https://pt.wikibooks.org/wiki/Especial:AbuseFilter/history/9/diff/prev/103
Besides the characters mentioned above, "ï" should be converted to "I". Does anyone knows if this bug is complicated to fix? I imagine there is some list of conversion pairs somewhere which just needs to be updated.
The list is part of Extension:AntiSpoof. The file is http://svn.wikimedia.org/viewvc/mediawiki/trunk/extensions/AntiSpoof/maintenance/equivset.in
In general you should use ccnorm() on both items in a comparison. i.e. you should use ccnorm(new_wikitext) contains ccnorm("Vandalism")
(In reply to comment #5) > In general you should use ccnorm() on both items in a comparison. > > i.e. you should use ccnorm(new_wikitext) contains ccnorm("Vandalism") In that case close this bug as invalid or some changes in equivset.in ?
(In reply to comment #5) > In general you should use ccnorm() on both items in a comparison. > > i.e. you should use ccnorm(new_wikitext) contains ccnorm("Vandalism") But then we cannot use "|" in the regular expression, since ccnorm( 'ab|cd|...|yz' ) === 'AB1CD1...1YZ'
*** Bug 56189 has been marked as a duplicate of this bug. ***
l is a look-alike character to 1, O is a lookalike to 0 and S is a lookalike to 5. This isn't going to be changed in the AbuseFilter extension, since it just uses the normalizations from AntiSpoof. If there are characters that arent being normalized properly, file bugs against AntiSpoof.
Legoktm: I think you are misunderstanding the bug. Yes, the characters are supposed to be mapped, but they are being mapped in the wrong direction.
This bug is actually trivial to fix. It looks like there are a handful of entries in AntiSpoof/equivset.php that have the index and key reversed. Among them, the 4 mentioned in the bug summary: I, L, O, and S. The only problem with fixing this bug is it will temporarily break some AbuseFilter filters that were previously working around this bug.
Sorry, meant to say 'value and key reversed' not 'index and key reversed' :)
I'm not sure I agree with that, but shouldn't this bug be in the AntiSpoof component then? If someone does want to switch them, they're going to need to somehow coordinate regenerating the spoof user table on every wiki + centralauth as well as updating any abusefilter rule on every wiki that uses ccnorm.
(In reply to comment #13) > I'm not sure I agree with that, but shouldn't this bug be in the AntiSpoof > component then? > Mid-air collision. Thanks for moving it.
Legoktm: Sorry for suggesting you didn't understand the bug. You were totally right that the bug was in AntiSpoof, not AbuseFilter. I was originally thinking that Abusefilter was reversing all the mappings, but it turns out that AntiSpoof/equivset.php is just a big mess. Not sure how it got into such a sorry state. Luckily, it looks like AntiSpoof doesn't actually use equivset.php itself, which probably means that CentralAuth is OK as well. Fixing all the AbuseFilter rules is going to be a mess though. It's sad such a trivial-to-fix bug was allowed to languish for 4 years leading to so many workarounds.
Change 92057 had a related patch set uploaded by Kaldari: Make sure AntiSpoof mappings are mapping in the correct direction. For example, 5 should map to S, not the other way around. Also correcting some lowercase to uppercase mappings. There are a lot more mappings in equivset.in that need to be fixed but this https://gerrit.wikimedia.org/r/92057
Kaldari, *thank you* for working on this! (In reply to comment #15) > It's sad such a trivial-to-fix bug was > allowed to languish for 4 years leading to so many workarounds. Indeed. The AbuseFilter component has not received much (systematic) love in the last 4-5 years, I'm sure there's plenty of such low hanging fruit which could be found with some bug triaging by people knowing the underlying PHP/AntiSpoof magic/libraries a bit.
I don't really understand this bug. Surely the real issue is that the output of ccnorm() is exposed to the user, rather than being hidden in some opaque object? There's no way you could generate a canonical form which would make sense for readers of every language. Wouldn't the use case described in the original report be better served by syntax along the lines of: added_lines cclike ["testing", "vandalizing"] rather than requiring the user to look up a table of confusable characters in git and convert each character by hand or something?
We need to use regex syntax ([abc], [^abc] ^, $, ?, etc). See e.g. the discussion (in Portuguese) about https://pt.wikipedia.org/wiki/Special:AbuseFilter/18 at https://pt.wikipedia.org/wiki/WP:Filtro_de_edições/18#Filtro_112_quebrou.21 PS: I use the [[Special:AbuseFilter/tools]] to get the value of ccnorm('foobar') to put into the regex
(In reply to comment #19) > We need to use regex syntax ([abc], [^abc] ^, $, ?, etc). Well, how about added_lines cclike "testing|vandalizing" Where the regex would be tokenized and reassembled, with alphabetic parts normalised with equivset?
(In reply to comment #20) > (In reply to comment #19) > > We need to use regex syntax ([abc], [^abc] ^, $, ?, etc). > > Well, how about > > added_lines cclike "testing|vandalizing" > > Where the regex would be tokenized and reassembled, with alphabetic parts > normalised with equivset? Seems good, if that is feasible.
Is it possible right now to do something like: words :="testing|vandalizing"; ccnorm(added_lines) rlike ccnorm(words) That would at least make the filters a lot more readable (until something like Tim's solution was implemented).
(In reply to comment #22) The problem is that ccnorm( '\\{}()[].?*+-^$|1iI' ) === '\\{}()[].?*+-^$1111' so it is not possible to distinguish between the letters "i", "I" etc... and the regex symbol "|".
(In reply to Helder from comment #23) > so it is not possible to distinguish between the letters "i", "I" etc... and > the regex symbol "|". That's not really a problem. The point is that to compare two things they must *both* be normalised, you can't just assume you know what's the normalised form. So if not > words :="testing|vandalizing"; > ccnorm(added_lines) rlike ccnorm(words) then words :="ccnorm(testing)|ccnorm(vandalizing)"; ccnorm(added_lines) rlike words as long as they're not apple and oranges.
Comment 20, 22, 24: split to bug 63242 "ccnorm revamp: add a more sensible interface for normalised comparison".