Last modified: 2012-11-27 13:08:17 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T4290, the corresponding Phabricator task for complete and up-to-date bug report information.

Bug 2290 - Disallow usernames that are too similar to existing names (confusables, impersonation)


Summary:	Disallow usernames that are too similar to existing names (confusables, imper...

Status:	RESOLVED FIXED

Product:	MediaWiki
Classification:	Unclassified
Component:	User login and signup (Other open bugs)
Version:	unspecified
Hardware:	All All

Importance:	Normal enhancement with 3 votes (vote)
Target Milestone:	---
Assigned To:	Neil Harris

URL:
Whiteboard:
Keywords:

Duplicates:	3313 3982 (view as bug list)
Depends on:
Blocks:	unicode 3985
	Show dependency tree / graph

Reported:	2005-06-02 13:24 UTC by Neil Harris
Modified:	2012-11-27 13:08 UTC (History)
CC List:	6 users (show)

See Also:
Web browser:	---
Mobile Platform:	---
Assignee Huggle Beta Tester:	---

Attachments
Python code for filtering usernames (89.45 KB, text/plain) 2006-09-14 12:46 UTC, Neil Harris	Details
Python code for filtering usernames (89.59 KB, text/plain) 2006-09-14 14:10 UTC, Neil Harris	Details
Experimental language-code-to-script-code mapping (2.56 KB, text/plain) 2006-09-14 23:02 UTC, Neil Harris	Details
Experimental language-code-to-script-code mapping; (4.03 KB, text/plain) 2006-09-15 21:55 UTC, Neil Harris	Details
Python code for filtering usernames, v0.3 (89.96 KB, text/plain) 2006-09-18 08:00 UTC, Neil Harris	Details
Python code for filtering usernames, v0.4 (90.14 KB, text/plain) 2006-09-18 08:42 UTC, Neil Harris	Details
New confusables equivalence sets file, generated from UTR#39 confusables.txt (33.90 KB, text/plain) 2006-11-14 00:40 UTC, Neil Harris	Details
Some extra confusables (UTF-8 format text file) (291 bytes, text/plain) 2006-11-14 01:06 UTC, Neil Harris	Details
Some extra confusables, v2 (UTF-8 format text file) (325 bytes, text/plain) 2006-11-14 01:21 UTC, Neil Harris	Details
New confusables equivalence sets file v2, generated from UTR#39 confusables.txt + extras (33.65 KB, text/plain) 2006-11-14 01:25 UTC, Neil Harris	Details
Python code for creating equivalence sets of characters (1.33 KB, text/plain) 2006-11-14 01:32 UTC, Neil Harris	Details
Show Obsolete (6) Add an attachment (proposed patch, testcase, etc.)

Description Neil Harris 2005-06-02 13:24:36 UTC

A more and more common form of abuse consists of vandals and trolls registering
new accounts that "look like" other users' accounts, by using characters that
look like other characters. For example, "l" may be used instead of "I", or an
acute-accented 'i' used instead of an ordinary one. These accounts can cause no
end of trouble by being used to conceal other kinds of mischief, or to get the
impersonated user into trouble. It is very difficult to tell these apart without
detailed inspection, and the software at present has no idea of visual
similarity between usernames.

Proposed solution:

Keep a homograph character table, and for each new username, canonicalize it by
applying the homograph table to it. Then compare this canonicalized version of
the name with a pre-existing list of canonicalized usernames, and block it if it
occurs in that list. In this way, registering a username will block the
registration of other "confusingly similar" usernames.

The good news is that that the heavy lifting for this work has already been
performed as part of trying to close the same spoofing hole for
internationalized domain names, and homograph lists have already been compiled
as part of this work. E-mail me if you want me to dig out the lists; I don't
have links to them to hand on this machine.

Comment 1 Neil Harris 2005-06-02 13:29:44 UTC

See the references towards the end of http://unicode.org/reports/tr36/ for a
very simple example of confusables data file; but I know that much more complete
ones have been compiled elsewhere...

Comment 2 Neil Harris 2005-06-04 00:43:37 UTC

Here is the URL for the very nicely compiled multilingual confusables file, in
what I hope is a sufficiently self-documenting format:

http://unicode.org/reports/tr36/draft/confusables.txt

Persumably the "official" TR36 file, and any updates, will also be in a similar
format.

Comment 3 Zhen Lin 2005-06-06 01:38:00 UTC

During a vandal attack on a MediaWiki installation I run, the vandal used
Cyrillic lookalikes to impersonate an administrator. No amount of visual
scrutiny would have revealed anything, since typically Cyrillic glyphs are
copied from the Latin lookalikes. Fortunately this is also covered in the
confusables table.

Comment 4 Zigger 2005-09-06 13:30:59 UTC

*** Bug 3313 has been marked as a duplicate of this bug. ***

Comment 5 Filip Maljkovic [Dungodung] 2005-09-06 16:07:11 UTC

I just want to add that many cyrillic letters look the same as letters in latin
script,  so confusion is possible. The letters are "A B C E H J K M O P T X a c
e j o p x" as opposed to "%D0%90 %D0%92 %D0%A1 %D0%95 %D0%9D %D0%88 %D0%9A
%D0%9C %D0%9E %D0%A0 %D0%A2 %D0%A5 %D0%90 %D1%81 %D0%B5 %D1%98 %D0%BE %D1%80
%D1%85" (as shown in the nav-bar). They are all the same, except for one pair,
which is extremely similar.

Comment 6 Ævar Arnfjörð Bjarmason 2005-10-07 20:55:47 UTC


*** This bug has been marked as a duplicate of 1524 ***

Comment 7 Aryeh Gregor (not reading bugmail, please e-mail directly) 2006-09-04 15:12:20 UTC

People are/were discussing this at bug 1524, but this remains a separate issue.
 It took me forever to find this by searching, since it was closed.

Comment 8 Aryeh Gregor (not reading bugmail, please e-mail directly) 2006-09-04 15:12:47 UTC

*** Bug 3982 has been marked as a duplicate of this bug. ***

Comment 9 Neil Harris 2006-09-14 12:46:21 UTC

Created attachment 2347 [details]
Python code for filtering usernames

Here's some Python code to canonicalize user names to reject most spoofing
attacks. The program also returns an error status if the username is malformed,
for example by containing non-script characters, or mixing two incompatible
scripts.

The general idea is to keep a canonicalized version of each username in another
table, and, when registering a new username, look up the canonicalized username
to see if it is already registered. If it is, the user should be told that
their username is too similar to an existing username, and prompted to try
again.

For example: 

"SOME USERNAME" canonicalizes to v1:50MEU5EMAME (the v1: is a version tag, in
case the canonicalization code ever changes). The same canonical string will be
generated for "some username", "SOME USERNAME!!!!!", "S0ME U5ERNAME", and so
on... 

I can easily add other filters, so that, for example, "Some Username5"
canonicalizes to the same string as "Some Username 4", and "Bad, bad user"
would canonicalize to the same string as "Bad, bad, bad user".

This version of the code is a bit aggressive, as it assumes that labels can be
in any one script, so E, H, and N are currently considered equivalent because
of the need for transitivity between different cases of different scripts: if
usernames can be restricted to a small subset of possible scripts, some of the
more aggressive canonicalization can be relaxed, and E, H, and N can again be
distinguished.

Preliminary testing shows that this code appears to have a false-positive rate
of under 1% on random plausible names, which is probably acceptable.

Comment 10 Neil Harris 2006-09-14 12:47:59 UTC

Oh, and I should mention, just in case you're not reading the code, that it
works on a vast number of scripts.

Comment 11 Neil Harris 2006-09-14 14:10:43 UTC

Created attachment 2348 [details]
Python code for filtering usernames

Murphy's law in action: the example I gave the attachment comment is an edge
case that didn't get tested properly: now fixed.

Comment 12 Neil Harris 2006-09-14 23:02:24 UTC

Created attachment 2354 [details]
Experimental language-code-to-script-code mapping

This file attempts to map languages to sets of possible scripts. Where a
language can be written in multiple scripts, both script codes are added. Where
multiple scripts can be used for a language, all scripts known are included.

Where an example character does not have a script code, it is output as U+XXXX.

Comment 13 Neil Harris 2006-09-15 21:55:47 UTC

Created attachment 2363 [details]
Experimental language-code-to-script-code mapping;

Now with 79 more script repertoires, based on analyzing the wikipedia.org front
page

Comment 14 Neil Harris 2006-09-18 08:00:11 UTC

Created attachment 2369 [details]
Python code for filtering usernames, v0.3

Now uses stdin/stdout for input and output, thus allowing for batch conversion
and freeing the command line up for later addition of option flags.

Comment 15 Neil Harris 2006-09-18 08:42:27 UTC

Created attachment 2370 [details]
Python code for filtering usernames, v0.4

Now with exception handling, just in case of nasty attacks (eg. BiDi
violations) intended to blow up the low-level Unicode-processing code.

Comment 16 Brion Vibber 2006-09-19 11:01:55 UTC

I've translated Neil's code to PHP, committed in r16555.

Can build an extension around that to check on account creation.

Currently there are some lazy and inefficient bits; it runs about 30% slower than the Python 
version on the set of usernames from meta.wikimedia.org, but that's plenty fast for the individual 
checking, a smidge under a millisecond per name on a 2 GHz G5. (Live check will just be a single 
name munging and a DB lookup.)

Comment 17 Neil Harris 2006-11-13 01:11:51 UTC

There are false positive problems with the existing code which need a more
careful second pass to check strings which match the initial checks. Code to
follow...

Comment 18 Neil Harris 2006-11-14 00:40:56 UTC

Created attachment 2699 [details]
New confusables equivalence sets file, generated from UTR#39 confusables.txt

Note: this file is encoded in UTF-8, and contains exotic characters, many of
which may display as spaces or not at all: beware!

This is a transitive closure of the single-character to single-character
mappings within UTR #39s confusables.txt file. Remember to normalize strings
before applying these mappings...

Comment 19 Neil Harris 2006-11-14 01:06:42 UTC

Created attachment 2700 [details]
Some extra confusables (UTF-8 format text file)

Some extra confusables that are not in UTR#39, spotted by eye.

Comment 20 Neil Harris 2006-11-14 01:21:59 UTC

Created attachment 2702 [details]
Some extra confusables, v2 (UTF-8 format text file)

A second version of the above...

Comment 21 Neil Harris 2006-11-14 01:25:14 UTC

Created attachment 2703 [details]
New confusables equivalence sets file v2, generated from UTR#39 confusables.txt + extras

Note: this file is encoded in UTF-8, and contains exotic characters, many of
which may display as spaces or not at all: beware!

This is a transitive closure of the single-character to single-character
mappings within UTR #39s confusables.txt file, combined with my
extra_confusables.txt file. Remember to normalize strings
before applying these mappings...

Comment 22 Neil Harris 2006-11-14 01:29:39 UTC

Note: some letterforms are confusable with more than one other letterform, but
these other letterforms are not confusable with each other. This should be taken
into account in later, more sophisticated, versions of this code.

Comment 23 Neil Harris 2006-11-14 01:32:57 UTC

Created attachment 2704 [details]
Python code for creating equivalence sets of characters

Comment 24 Rob Church 2006-11-29 15:34:47 UTC

This was poked, prodded, converted and ported into the AntiSpoof extension,
available in Subversion.

Comment 25 Larry Pieniazek 2006-12-13 17:35:22 UTC

Not sure if this comment belongs against this bog but I have userid "Lar" on
many WMF wikis. I recently started having trouble registering this userid on new
wikis as a conflict with user "Iar"... based on discussion on #mediawiki it was
suggested that this is because the software sees uppercase I and lowercase L as
similar, and that's tripping me up. I'm not sure how to get around that best,
but it's a nuisnace to have to contact each wiki admin separately. See Neil
Harris's comment of 11-14 01:29 which perhaps alludes to this...  presumably
once WMF wikis have SUL this goes away?

Comment 26 Aryeh Gregor (not reading bugmail, please e-mail directly) 2006-12-14 03:33:54 UTC

It would no longer be a problem for existing users, but it would still be a
problem for people signing up for a WMF account for the first time, so it's
still undesirable.

See bug 8257.

Comment 27 Anu 2012-11-27 10:27:56 UTC

I have a question to ask here, In different languages, the same characters can be identified as different names? Does Python code take of this?

Can this thread be closed?

Comment 28 Andre Klapper 2012-11-27 13:08:17 UTC

Anu: This report/"thread" has been closed as RESOLVED FIXED six years ago already, and MediaWiki does not use Python code here. 
Please refrain from commenting on this ticket - thanks. :)

Wikimedia Bugzilla is closed!

Search

Personal tools

Navigation

Links