Last modified: 2008-07-12 20:52:39 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T6185, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 4185 - feature request: provide a notification for irregular Unicode characters
feature request: provide a notification for irregular Unicode characters
Product: MediaWiki
Classification: Unclassified
Internationalization (Other open bugs)
All All
: Normal enhancement with 1 vote (vote)
: ---
Assigned To: Nobody - You can work on this!
Depends on:
Blocks: 3985
  Show dependency treegraph
Reported: 2005-12-05 19:01 UTC by lɛʁi לערי ריינהארט
Modified: 2008-07-12 20:52 UTC (History)
2 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Description lɛʁi לערי ריינהארט 2005-12-05 19:01:52 UTC

This request proposes a synthesis solution for different bugs:
a) Bug 1414: Unicode whitespaces allowed in article title
b) Bug 1524: usernames should use unicode whitelist
c) Bug 2593: Non-printing characters allowed in registration
d) Bug 3819: strip phantom general punctuation characters from page titles

Requests and solutions can be "restrictive" but these would make it impossible
to use these characters at all. Personaly I do not like restrictive solutions.

The solution proposed here is to implement a notification for "action=submit"
(preview or save) indicating that saving would generate "irregular links", links
containing "irregular characters".

The notification should list *all* "irregular links" individualy (what would be
an irregular link should be defined in a .php include file) and a "save anyway"

*notifications* are not new in MediaWiki:
- Special:Upload notifies if the size of a file to be uploaded is above a limit.
- Special:Upload notifies if a file would be uploaded with a title that is
already existing.
Both notifications are using [[MediaWiki:Uploadwarning]] button:
[[MediaWiki:Savefile]] text: [[MediaWiki:Ignorewarning]] etc.

The proposed solution would meat the main goal:
- generating a warning if somthing could happen what makes trouble
- if the generation is intended then it is up to the user to generate the link
Benefit: The warning should prevent from generating "unintended" "irregular links".

The list of the "irregular links" should display the "irregular characters" as
HTML entities if such exist else in &#nnnn; notation and *not* as UTF-8 because
it would not be possible to see / distinguish many of them as UTF-8.

*main* "irregular characters" identified until now:
- whitespace / non-printing characters
- general punctuation characters

The notification should support all types of codings of the "irregular
characters": UTF-8, HTML entities (‎ rlm; ...) &#nnnn;, &#xnnnn; %XX%YY%ZZ
in links or their parameters (also inside {{localurl}}, {{fullurl}} ...).

The proposed solution would make it easy to identify such forms of vandalism or
mistakes caused by copy and paste or incorrect editing due to insertion /
deletion of such characters. Detecting and fixing them now is very time consuming.

*other* "irregular characters"
It should be evaluated if this function can be used for "Unicode character
normalisation" also. This is dealing with MediaWiki's conversion of Unicode
precomposed characters to a group of Unicode characters.

An optimal achievement would be to generate "proposals" "what to replace with
what" offering checkboxes beside the links.

A Unicode Character HEBREW LETTER ALEF WITH PATAH - U FB2E would be replaced
anyway by MediaWiki with the two characters HEBREW LETTER ALEF - U+05D0 and
HEBREW POINT PATAH - U+05B7. So if we change the characters in the build in
title normalisation why not being able to change also
- the &#nnnn; representation אַ to אַ
- the &#xnnnn; representation אַ to אַ
- the %EF%AC%AE to %D7%90%D6%B7
in the source of the page?
It makes only trouble to keep these. See Bug 3860: links generated with
precombined characters show red despite the fact that the normalised links exist
testcase: [[wiktionary:yi:bugzilla/03860]]

Because changes would be controled by checkboxes it would still be possible to
maintain precombined characters for documentation, testing ... However fixing /
"converting to the standard" would be achieved with a "build in help" "knowledge
tool" and can save much time.

some bugs dealing with Unicode normalization:
- Bug 1375: Unicode normalization leaves red links
- Bug 1527: problem on URL with Devanagari characters
- Bug 2399: Unicode normalization interferes with Hebrew and Arabic with vowels

Best regards reinhardt [[user:gangleri]]
Comment 1 lɛʁi לערי ריינהארט 2005-12-05 20:04:44 UTC
(In reply to comment #0)
> An optimal achievement would be to generate "proposals" "what to replace with
what" offering checkboxes beside the links.

This handles "character conversion".
adding blocks
Bug 3985: character conversion (tracking)
Comment 2 lɛʁi לערי ריינהארט 2005-12-05 21:57:01 UTC

This request handles only the occurence of "irregular characters" in links. For
the handling in the rest of the page source see
Bug 4012: feature request: add a felexible magic character conversion to the
build in editor
Comment 3 lɛʁi לערי ריינהארט 2005-12-10 02:15:27 UTC

Because this request is related to action=submit it should also make an analysis
of {{PAGENAME}}. This will prevent creating such pages and avert editors about
the problem.

However this request does specify to make an analysis of {{PAGENAME}} for other
actions as view, watch, history, move, delete, validate etc.
Comment 4 Brion Vibber 2005-12-11 02:29:48 UTC
Problem characters would simply be forbidden. "Notification" is unnecessary.
Comment 5 lɛʁi לערי ריינהארט 2008-03-13 06:11:47 UTC
REOPENing this bug and changing title to
feature request: provide a notification for irregular Unicode characters

Dear friends; describes how persistend and irritating *invisible* Unicode characters (as the General Unicode Punctuation characters) can be.

As a documentation text was copied and pasted from the page

General Unicode Punctuation characters *infected*
and whatever other pages, emails etc. which used these pages as a source.

[[user:Splarka]] made
which is available for tests at

With this tool it is possible to identify a configurable set of "''Evil Unicode characters''".
The source of the page content is displayed as
        Author  Title   Year    Library         Sysno
        ‫ לנסקי, אהרן,1955- ‬     ‫ נגד כיוון ההיסטוריה :הרפתקאותיו המופלאות של האיש שהצ ‬     2005    HAI Haifa U.    006639172

This is a very convenient way to eliminate all unvanted "''Evil Unicode characters''".

Please reconsider to include this or similar code as a standard function in MediaWiki.

Thanks in advance for all your efforts.

Best regards
Reinhardt [[user:Gangleri]]
Comment 6 Niklas Laxström 2008-07-12 20:52:39 UTC
Sounds a job for an extension or a gadget, which already seems to exist.

Note You need to log in before you can comment on or make changes to this bug.