Last modified: 2010-05-15 15:37:24 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T3524, the corresponding Phabricator task for complete and up-to-date bug report information.

Bug 1524 - Usernames should use unicode whitelist


Summary:	Usernames should use unicode whitelist

Status:	RESOLVED FIXED

Product:	MediaWiki
Classification:	Unclassified
Component:	User login and signup (Other open bugs)
Version:	1.5.x
Hardware:	All All

Importance:	Normal major with 2 votes (vote)
Target Milestone:	---
Assigned To:	Nobody - You can work on this!

URL:
Whiteboard:
Keywords:

Duplicates:	7463 (view as bug list)
Depends on:
Blocks:	unicode 3985 2593 12499
	Show dependency tree / graph

Reported:	2005-02-13 20:50 UTC by River Tarnell
Modified:	2010-05-15 15:37 UTC (History)
CC List:	6 users (show)

See Also:
Web browser:	---
Mobile Platform:	---
Assignee Huggle Beta Tester:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Description River Tarnell 2005-02-13 20:50:31 UTC

usernames should be restricted to a whitelist of characters which includes only
valid alphanumeric characters in each language, and punctuation.  otherwise,
creating usernames (and page titles) with invalid characters will make it hard
to block vandals.

Comment 1 Brion Vibber 2005-02-13 21:35:27 UTC

*Invalid* characters (those that are illegal in XML or don't reliably cut and paste) need to be outright 
blocked in titles.

Characters that simply some people are unable to type should not be a real problem as either there 
should be a direct 'block' link, or cut-and-paste will always be available.

I'm not really inclined to proclaim what characters are appropriate for each language, as this will make 
interoperability, writing on foreign topics, shared data, shared user accounts, global user accounts etc 
very hard and will require a lot of manual mucking about as people whine for whitelists to be updated.

Comment 2 p_simoons 2005-08-30 18:11:12 UTC

Agreed. There are to my knowledge no legit users on the English wiki that use
non-ASCII characters in their name, but it's a favorite trick of vandals and
impersonators.

Comment 3 Ævar Arnfjörð Bjarmason 2005-10-07 20:55:49 UTC

*** Bug 2290 has been marked as a duplicate of this bug. ***

Comment 4 lɛʁi לערי ריינהארט 2005-10-07 22:01:57 UTC

(In reply to comment #0)
> usernames should be restricted to a whitelist of characters which includes only
> valid alphanumeric characters in each language, and punctuation.

This requirement and single user login will conflict with the wish to use
*natives* (non latin) alphabets in user names.

Comment 5 River Tarnell 2005-10-07 22:09:10 UTC

(In reply to comment #4) 
> (In reply to comment #0) 
> > usernames should be restricted to a whitelist of characters which includes only 
> > valid alphanumeric characters in each language, and punctuation. 
 
> This requirement and single user login will conflict with the wish to use 
> *natives* (non latin) alphabets in user names. 
 
why?

Comment 6 Zhen Lin 2005-10-08 07:24:33 UTC

Usernames shouldn't be stored in a normalised form, however, users should not be
permitted to register names which would conflict with existing usernames, when
normalised.

Perhaps this could be achieved by adding a new field to the user table -
'username_normal' - and storing the normalised username there. Add a unique
constraint to the field, and then attempts to register a username which will
result in a collision when normalised will... well, result in a database error.

Now the question is, where do we get a reasonable map of confusable characters. 
http://www.unicode.org/draft/reports/tr36/Attic/confusables.txt isn't
particularly extensive, but should work for most malicious cases. Perhaps we
should try to get a copy of the IDN normalisation map. The Unicode Consortium
has a long document about visual spoofing:
http://www.unicode.org/draft/reports/tr36/tr36.html

Comment 7 lɛʁi לערי ריינהארט 2005-10-08 11:33:52 UTC

(In reply to comment #5)

> why?

There are many opinions about the restriction of usernames:
"Since this is the English Wikipedia, usernames ought to be constructed using
English characters, with allowances for scripts from other languages ..." from
[[en:Wikipedia_talk:Username#On_Unicode_and_other_odd_characters_in_usernames]]

Nevertheless the communitys decision about this should be more tolerant. With
regard to single user login it should be allowed to use Arabic, Cyrilic, Hebrew,
Hindu, Georgian whatsoever alphabets.

I would not object to usernames as [[user:۞]], [[user:░]], [[User:–]] etc. The
usernames are part of personality and creativity. Whatever opinion we have on
this / how we deal with this it is *reality* that there are also usernames like
[[en:user:god]] - see [[en:user talk:god]], [[en:user:satan]],
[[en:user:antichrist]] etc.

Comment 8 lɛʁi לערי ריינהארט 2005-10-12 16:11:18 UTC

some examples related to
bug 337: inconsistent treatment of character entities and illegal chararcters in
titles/links

http://en.wikipedia.org/wiki/User:%E2%80%8F
http://en.wikipedia.org/wiki/Special:Contributions/%E2%80%8F
http://en.wikipedia.org/wiki/User:Gangleri/tests/bugzilla:00337#User:.26rlm.3B

Comment 9 lɛʁi לערי ריינהארט 2005-10-12 18:10:14 UTC

http://en.wikipedia.org/wiki/User:%C2%A0

is a "construct" based on
bug 2173: Fatal error when removing an article with an whitespace title from the
watchlist

Comment 10 lɛʁi לערי ריינהארט 2005-10-13 11:26:55 UTC

compare with

bug 3696: Unicode Control Characters should be restricted in title text

Comment 11 lɛʁi לערי ריינהארט 2005-10-14 14:51:31 UTC

see also

bug 2593: Non-printing characters allowed in registration

Comment 12 lɛʁi לערי ריינהארט 2005-10-26 14:52:42 UTC

(In reply to comment #6)
> Usernames shouldn't be stored in a normalised form, however, users should not be
> permitted to register names which would conflict with existing usernames, when
> normalised.

Depending on the used font two "ו" characters can look like one  "װ" character:
[[yi:user:גאַװיאַל]] and [[yi:user:גאַוויאַל]]

Comment 13 Zhen Lin 2005-10-27 10:27:33 UTC

Hmm, you could say similar things about vv and w (though generally w is
narrower)...

Comment 14 lɛʁi לערי ריינהארט 2005-11-15 21:56:37 UTC

compare with
bug 3982: Maybe...

Comment 15 lɛʁi לערי ריינהארט 2005-12-19 06:36:12 UTC

*** Bug 4312 has been marked as a duplicate of this bug. ***

Comment 16 lɛʁi לערי ריינהארט 2006-02-21 04:20:29 UTC

Is this FIXED already?

I could create a user page
http://test.wikipedia.org/wiki/User:%E2%80%AEresu_ladnav_%E2%80%AD%E2%80%AC
but I could not create such an *account*.

Please see
http://mail.wikipedia.org/pipermail/mediawiki-cvs/2006-February/013973.html
User.php,1.212,1.213 by Brion
"Blocking some Unicode whitespace characters in usernames. Should check if some
or all should be blocked from all page titles."

A block list is equivalent to a whitelist.

It might a good idea to give a feedback why the user name used during create new
user is invalid / show what Unicode character is used.

For "transparency" of wiki configuration the list of blocked characters should
be displayed.

best regards reinhardt [[user:gangleri]]

Comment 17 Rob Church 2006-02-21 11:56:22 UTC

(sigh)

Blocking != Whitelisting

The list of blocked characters is available if you look at the code and also the
relevant commit message in the mediawiki-cvs archives.

Comment 18 Neil Harris 2006-02-21 12:08:58 UTC

Here's a good way of filtering names: 
1) first, do Nameprep
2) only allow the use of characters specific to one particular writing system in
the resulting string, and a few carefully selected non-alphabetic characters
(such as space, apostrophe, and any others you want to add to the whitelist).

This is being used in IDN at the moment, and it's very successful at preventing
a very wide variety of potential abuses, such as mixed-script spoofing and the
use of exotic Unicode characters to break rendering engines.

I happen to have some nice compact table-driven C code for doing this: mail me
if you want it.

We should file the within-script character spoofing problem as a separate bug:
as stated above, this is easily dealt with by storing a normalized form of each
name alongside the real name, and checking that no normalized form is ever
duplicated: given this, the only problem is working out the ruleset for
normalizing these strings.

Comment 19 Brion Vibber 2006-10-02 18:06:57 UTC

*** Bug 7463 has been marked as a duplicate of this bug. ***

Comment 20 Invalid Account 2006-10-03 18:34:37 UTC

I emailed Neil and he told me that there is a MediaWiki extention out to block unicode in usernames.  
Can anyone confirm this or deny it?

Comment 21 Brion Vibber 2006-10-03 18:54:21 UTC

We will never "block out Unicode" as that doesn't make sense.
*Every* username is Unicode, with *no exceptions*.

What we will do is enforce restrictions on some characters
and mixed-script names. Please see the code in AntiSpoof extension.

Comment 22 Invalid Account 2006-10-03 21:10:53 UTC

I download the files and AntiSpoof has no docs or explanations not findable on mediawiki, google, or 
in the code.  I had to read through the code of the six files to determine which one to include.

First, is AntiSpoof still in testing and not working correctly yet?

Also, is patch-antispoof.sql.txt needed or is some SQL work needed to be done first before using 
AntiSpoof?

And for its log file is that something saved like debug.log, something only in the MySQL, or 
something viewable in mediawiki itself?

Comment 23 Brion Vibber 2006-10-03 21:34:59 UTC

This bug entry is not a discussion forum. If you want to ask general
questions about how to operate software, please do it separately.

Comment 24 Aaron Schulz 2008-05-16 20:13:00 UTC

Done reasonably with AntiSpoof

Wikimedia Bugzilla is closed!

Search

Personal tools

Navigation

Links