Last modified: 2011-09-20 10:19:21 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T26999, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 24999 - Cannot create a username containing a Zero width joiner on languages where a ZWJ makes a visible difference and is required
Cannot create a username containing a Zero width joiner on languages where a ...
Status: NEW
Product: MediaWiki
Classification: Unclassified
User login and signup (Other open bugs)
unspecified
All All
: Low enhancement (vote)
: ---
Assigned To: Santhosh Thottingal
: i18n
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2010-08-31 14:59 UTC by Lee
Modified: 2011-09-20 10:19 UTC (History)
7 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Lee 2010-08-31 14:59:22 UTC
Hi,

We are having some issues with creating users in Sinhala Wikipedia. We are not allowed to create user names like "සසීන්ද්‍ර" and "නන්දිමිත්‍ර". 

This looks like something to do with modifiers on Sinhala letters. May be zero width joiner (ZWJ) need to be allowed?

rakaranshaya (්‍ර) is written as: hal kereema + zero width joiner(ZWJ) + ra


Thanks in advance,
/Lee
Comment 1 Platonides 2010-08-31 19:24:32 UTC
Zero width joiner is forbidden from appearing in a username character since r13007.

Perhaps we could allow it if surrounded by Sinhala characters? :s
Comment 2 Lee 2010-09-01 03:37:20 UTC
Is there any reason why these characters are black listed?
Comment 3 Platonides 2010-09-01 22:24:25 UTC
If you have a user called "Some admin", having another account called "Some admin" but using a non-default space is confusing. Moreover, trying to block the vandal you are likely to block the right user (or be unable to, if the account with normal space didn't exists).

I think that's what Brion referred as 'troublemaker characters'.

On the other hand, the request to use නන්දිමිත්‍ර is perfectly reasonable.
Comment 4 Roan Kattouw 2010-09-12 15:26:01 UTC
(In reply to comment #3)
> If you have a user called "Some admin", having another account called "Some
> admin" but using a non-default space is confusing. Moreover, trying to block
> the vandal you are likely to block the right user (or be unable to, if the
> account with normal space didn't exists).
Don't we have Extension:AntiSpoof for this?

> I think that's what Brion referred as 'troublemaker characters'.
> 
> On the other hand, the request to use නන්දිමිත්‍ර is perfectly reasonable.
Most of the banned characters (whitespace, nbsp, control chars) do look like troublemakers, but ZWJ seems perfectly reasonable to me.
Comment 5 Platonides 2010-09-12 21:55:29 UTC
> Don't we have Extension:AntiSpoof for this?

Antispoof is more powerful: checks similar characters, blocks mixed scripts...


> Most of the banned characters (whitespace, nbsp, control chars) do look like
> troublemakers, but ZWJ seems perfectly reasonable to me.

Are you sure? Please compare in your browser [[User:Catrope]] vs [[User:Cat‍rope]]. There's no visual difference in mine.
Comment 6 Roan Kattouw 2010-09-13 16:54:24 UTC
(In reply to comment #5)
> Are you sure? Please compare in your browser [[User:Catrope]] vs
> [[User:Cat‍rope]]. There's no visual difference in mine.
That's what we have AntiSpoof for, right? I'm sure there's plenty of characters that look very much like an ASCII 'C'.
Comment 7 Platonides 2010-09-13 22:39:56 UTC
You failed. That C is the normal one.
What I did was inserting a ZWJ between Cat and rope.
Comment 8 Roan Kattouw 2010-09-14 12:13:09 UTC
(In reply to comment #7)
> You failed. That C is the normal one.
> What I did was inserting a ZWJ between Cat and rope.
I knew that, I was just pointing out there's other ways to construct a username looking just like 'Catrope' without using ZWJs or other characters currently forbidden in usernames.
Comment 9 Platonides 2010-09-14 12:45:46 UTC
Sure you could use [[С]] for writing [[User:Сatrope]], and that would be blocked by AntiSpoof.
The point is, ZWJ should not be allowed in usernames unless the bad usage keeps blocked.
Comment 10 Lee 2010-12-10 14:51:08 UTC
Do we have any update on this?
Comment 11 Platonides 2010-12-11 22:03:46 UTC
Lee, can you figure out in which cases a ZWJ makes a visual difference?
I think that's the blocker here. If we can isolate some unambiguous instances of ZWJ, we could try whitelisting them.
Comment 12 Bawolff (Brian Wolff) 2010-12-12 05:24:49 UTC
According to wikipedia, that'd be arabic and most indic scripts have at least some characters where it makes a visual difference.


Googling, http://www.unicode.org/reports/tr31/ (section 2.3) seems to have some advice on when and when not to ban ZWJ. (it even gives perl regexes, but using the fancy stuff that I don't think is supported by pcre)

http://unicode.org/review/pr-96.html also seems to have some advice (and seems more down to the point), but its unclear what the status of that document is.
Comment 13 Lee 2010-12-17 12:55:37 UTC
It looks like I'm going to need some help to answer that question. I'm not that expert in the language. I'll ask around so someone with the proper knowledge can help here.
Comment 14 Santhosh Thottingal 2011-09-06 15:37:39 UTC
According to Unicode Annex 31(http://www.unicode.org/reports/tr31/), Identifier patterns, as an exception to the usual exclusion of ZWJ is not allowed for certain scripts. That includes Sinhala. But the policy is strict about where and how one can use ZWJ. 
Sinhala , many Indian languages  and Arabix require zwj, which make visual difference.
We need to implement UAX31 on top of r13007
Comment 15 Santhosh Thottingal 2011-09-06 15:40:53 UTC
(In reply to comment #14)
> According to Unicode Annex 31(http://www.unicode.org/reports/t0/), Identifier
> patterns, as an exception to the usual exclusion of ZWJ is not allowed for
> certain scripts. 
Sorry. Read it as :

According to Unicode Annex 31(http://www.unicode.org/reports/t0/), Identifier patterns, as an exception to the usual exclusio,  ZWJ *is allowed* for certain scripts,
Comment 16 Platonides 2011-09-15 20:53:07 UTC
The right url seems to be http://unicode.org/reports/tr31/
There are some regular expressions reported, I think they are based on \L{} (Unicode properties). Luckily, we can do some slow things on this path.
Comment 17 Bawolff (Brian Wolff) 2011-09-15 20:58:15 UTC
(In reply to comment #16)
> The right url seems to be http://unicode.org/reports/t��0��/
> There are some regular expressions reported, I think they are based on \L{}
> (Unicode properties). Luckily, we can do some slow things on this path.

I think the rXXX in the url is screwing it up with magic revision auto-linking. Lets try http://www.unicode.org/reports/t%7231/

Last time I looked at that page, the regexs used things based on the more complex unicode properties supported by perl but not pcre. However it was still very do-able, one just needed to create a fairly large (not huge though) character class by hand.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links