Last modified: 2012-11-27 13:08:17 UTC
A more and more common form of abuse consists of vandals and trolls registering new accounts that "look like" other users' accounts, by using characters that look like other characters. For example, "l" may be used instead of "I", or an acute-accented 'i' used instead of an ordinary one. These accounts can cause no end of trouble by being used to conceal other kinds of mischief, or to get the impersonated user into trouble. It is very difficult to tell these apart without detailed inspection, and the software at present has no idea of visual similarity between usernames. Proposed solution: Keep a homograph character table, and for each new username, canonicalize it by applying the homograph table to it. Then compare this canonicalized version of the name with a pre-existing list of canonicalized usernames, and block it if it occurs in that list. In this way, registering a username will block the registration of other "confusingly similar" usernames. The good news is that that the heavy lifting for this work has already been performed as part of trying to close the same spoofing hole for internationalized domain names, and homograph lists have already been compiled as part of this work. E-mail me if you want me to dig out the lists; I don't have links to them to hand on this machine.
See the references towards the end of http://unicode.org/reports/tr36/ for a very simple example of confusables data file; but I know that much more complete ones have been compiled elsewhere...
Here is the URL for the very nicely compiled multilingual confusables file, in what I hope is a sufficiently self-documenting format: http://unicode.org/reports/tr36/draft/confusables.txt Persumably the "official" TR36 file, and any updates, will also be in a similar format.
During a vandal attack on a MediaWiki installation I run, the vandal used Cyrillic lookalikes to impersonate an administrator. No amount of visual scrutiny would have revealed anything, since typically Cyrillic glyphs are copied from the Latin lookalikes. Fortunately this is also covered in the confusables table.
*** Bug 3313 has been marked as a duplicate of this bug. ***
I just want to add that many cyrillic letters look the same as letters in latin script, so confusion is possible. The letters are "A B C E H J K M O P T X a c e j o p x" as opposed to "%D0%90 %D0%92 %D0%A1 %D0%95 %D0%9D %D0%88 %D0%9A %D0%9C %D0%9E %D0%A0 %D0%A2 %D0%A5 %D0%90 %D1%81 %D0%B5 %D1%98 %D0%BE %D1%80 %D1%85" (as shown in the nav-bar). They are all the same, except for one pair, which is extremely similar.
*** This bug has been marked as a duplicate of 1524 ***
People are/were discussing this at bug 1524, but this remains a separate issue. It took me forever to find this by searching, since it was closed.
*** Bug 3982 has been marked as a duplicate of this bug. ***
Created attachment 2347 [details] Python code for filtering usernames Here's some Python code to canonicalize user names to reject most spoofing attacks. The program also returns an error status if the username is malformed, for example by containing non-script characters, or mixing two incompatible scripts. The general idea is to keep a canonicalized version of each username in another table, and, when registering a new username, look up the canonicalized username to see if it is already registered. If it is, the user should be told that their username is too similar to an existing username, and prompted to try again. For example: "SOME USERNAME" canonicalizes to v1:50MEU5EMAME (the v1: is a version tag, in case the canonicalization code ever changes). The same canonical string will be generated for "some username", "SOME USERNAME!!!!!", "S0ME U5ERNAME", and so on... I can easily add other filters, so that, for example, "Some Username5" canonicalizes to the same string as "Some Username 4", and "Bad, bad user" would canonicalize to the same string as "Bad, bad, bad user". This version of the code is a bit aggressive, as it assumes that labels can be in any one script, so E, H, and N are currently considered equivalent because of the need for transitivity between different cases of different scripts: if usernames can be restricted to a small subset of possible scripts, some of the more aggressive canonicalization can be relaxed, and E, H, and N can again be distinguished. Preliminary testing shows that this code appears to have a false-positive rate of under 1% on random plausible names, which is probably acceptable.
Oh, and I should mention, just in case you're not reading the code, that it works on a vast number of scripts.
Created attachment 2348 [details] Python code for filtering usernames Murphy's law in action: the example I gave the attachment comment is an edge case that didn't get tested properly: now fixed.
Created attachment 2354 [details] Experimental language-code-to-script-code mapping This file attempts to map languages to sets of possible scripts. Where a language can be written in multiple scripts, both script codes are added. Where multiple scripts can be used for a language, all scripts known are included. Where an example character does not have a script code, it is output as U+XXXX.
Created attachment 2363 [details] Experimental language-code-to-script-code mapping; Now with 79 more script repertoires, based on analyzing the wikipedia.org front page
Created attachment 2369 [details] Python code for filtering usernames, v0.3 Now uses stdin/stdout for input and output, thus allowing for batch conversion and freeing the command line up for later addition of option flags.
Created attachment 2370 [details] Python code for filtering usernames, v0.4 Now with exception handling, just in case of nasty attacks (eg. BiDi violations) intended to blow up the low-level Unicode-processing code.
I've translated Neil's code to PHP, committed in r16555. Can build an extension around that to check on account creation. Currently there are some lazy and inefficient bits; it runs about 30% slower than the Python version on the set of usernames from meta.wikimedia.org, but that's plenty fast for the individual checking, a smidge under a millisecond per name on a 2 GHz G5. (Live check will just be a single name munging and a DB lookup.)
There are false positive problems with the existing code which need a more careful second pass to check strings which match the initial checks. Code to follow...
Created attachment 2699 [details] New confusables equivalence sets file, generated from UTR#39 confusables.txt Note: this file is encoded in UTF-8, and contains exotic characters, many of which may display as spaces or not at all: beware! This is a transitive closure of the single-character to single-character mappings within UTR #39s confusables.txt file. Remember to normalize strings before applying these mappings...
Created attachment 2700 [details] Some extra confusables (UTF-8 format text file) Some extra confusables that are not in UTR#39, spotted by eye.
Created attachment 2702 [details] Some extra confusables, v2 (UTF-8 format text file) A second version of the above...
Created attachment 2703 [details] New confusables equivalence sets file v2, generated from UTR#39 confusables.txt + extras Note: this file is encoded in UTF-8, and contains exotic characters, many of which may display as spaces or not at all: beware! This is a transitive closure of the single-character to single-character mappings within UTR #39s confusables.txt file, combined with my extra_confusables.txt file. Remember to normalize strings before applying these mappings...
Note: some letterforms are confusable with more than one other letterform, but these other letterforms are not confusable with each other. This should be taken into account in later, more sophisticated, versions of this code.
Created attachment 2704 [details] Python code for creating equivalence sets of characters
This was poked, prodded, converted and ported into the AntiSpoof extension, available in Subversion.
Not sure if this comment belongs against this bog but I have userid "Lar" on many WMF wikis. I recently started having trouble registering this userid on new wikis as a conflict with user "Iar"... based on discussion on #mediawiki it was suggested that this is because the software sees uppercase I and lowercase L as similar, and that's tripping me up. I'm not sure how to get around that best, but it's a nuisnace to have to contact each wiki admin separately. See Neil Harris's comment of 11-14 01:29 which perhaps alludes to this... presumably once WMF wikis have SUL this goes away?
It would no longer be a problem for existing users, but it would still be a problem for people signing up for a WMF account for the first time, so it's still undesirable. See bug 8257.
I have a question to ask here, In different languages, the same characters can be identified as different names? Does Python code take of this? Can this thread be closed?
Anu: This report/"thread" has been closed as RESOLVED FIXED six years ago already, and MediaWiki does not use Python code here. Please refrain from commenting on this ticket - thanks. :)