Last modified: 2008-03-13 06:17:49 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T5819, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 3819 - strip phantom general punctuation characters from page titles
strip phantom general punctuation characters from page titles
Status: RESOLVED DUPLICATE of bug 3696
Product: MediaWiki
Classification: Unclassified
Database (Other open bugs)
All All
: Normal trivial with 1 vote (vote)
: ---
Assigned To: Nobody - You can work on this!
Depends on: 3887
Blocks: rtl 3985 1381
  Show dependency treegraph
Reported: 2005-10-28 17:01 UTC by lɛʁi לערי ריינהארט
Modified: 2008-03-13 06:17 UTC (History)
1 user (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Description lɛʁi לערי ריינהארט 2005-10-28 17:01:05 UTC
Sorry for this!


a) I tested character normalisation which seams part of title normalisation.
Regarding precombined characters - NON-precombined characters this workes fine:
point to the same page despite different coding.

b) The bug's URL will list four different pages with "identical optical title".
There are "phantom" trailing general punctation characters generating different
URL's. Compare:
Unicode Character 'RIGHT-TO-LEFT EMBEDDING' (U+202B)
  UTF-8 (hex) 0xE2 0x80 0xAB (e280ab)

The generated URL's are:

There are many aspects to this:
a) possible vandalism - suggestion: Please evaluate if "phantom" = unnecessary
heading or trailing punctuation should be stripped from database titles
++ this looks like a normalisation 
b) garbage in - garbage out

Regards Reinhardt [[user:gangleri]]

P.S. I run into this because of textual ambiguosities at Wikipedia in Yiddish
relating to the usage of "tsvey vovn" versus "vov + vov", "tsvey-yudn": versus
"yud + yud" etc.

example 1: There is an article [[yi:וויץ]] but not [[yi:װיץ]] .

example 2: contains "vey iz (tsu) mir"
which is written *there* both with "vov + vov" and "yud + yud". Nevertheless translates with
"tsvey vovn" and "tsvey-yudn": װײ איז (צו) מיר!

It seems that automatical character substitution is not possible because of
ambiguasities when three characters meet together as in at
farvunderung - פֿאַרווונדערונג , "farvundert" - פֿאַרווונדערט
and the other way around at
oyspruvn - אויספּרווון
Comment 1 lɛʁi לערי ריינהארט 2005-11-04 18:59:50 UTC
You will find typical examples at the end of and at .

Summary is available at .

These pages where created because I have "compiled" the titles with "copy and
paste" (of hebrew characters) between different Firefox browsers on Windows.

A workaround is to use an usefull keyboard as described at
and avoid this silly "copy and pasts".
See : Yiddish Pasekh and Keyman
keyboard for Windows

Regards Reinhardt [[user:gangleri]]
Comment 2 lɛʁi לערי ריינהארט 2005-11-05 02:20:45 UTC

This bug can cause some confusion in a wiki. I assume that many contributors are
using "copy and paste" to insert a few hebrew characters.

As you can see from
%E2%80%AB can be
- at the begining of a title
- at the end of a title
- (I assume also inside the title)

There would be different things to do:
- avoid generation of such titles during editing, linking etc.
- clear the database - this is a maintenance issue

Regards Reinhardt [[user:gangleri]]
Comment 3 lɛʁi לערי ריינהארט 2005-11-05 09:40:41 UTC

I found more incorect titles (only with heading RIGHT-TO-LEFT_EMBEDDING) in
other projects with

Beside RTL wiki's [[ar:]] [[fa:]] [[he:]] [[ur:]] [[yi:]] their wiktioaries etc.
all other projects can be affected.

These wrong titles at [[yi:]] have been created by 5 contributors. This shows
that it is a general problem. If contributors use "copy" from a web page and
copy it (as hebrew characters) into the URL from the browser (I use mainly
Firefox myself) they might copy / paste leading trailing punctuation characters
and the browser will *generate* these URL's.

Of course this is not the proper way to generate titles (one should use a
keyboard) and might be a Firefox issue (I do not know if it is reported at if not please do so) or not but is common praxis of a signifficant
amount of contributors to RTL projects.

You will find the affected titles at:

Best regards Reinhardt [[user:gangleri]]
Comment 4 lɛʁi לערי ריינהארט 2005-11-05 10:24:03 UTC
more characters:

I found
which contained originaty a trailing %E2%80%AC

Unicode Character 'RIGHT-TO-LEFT EMBEDDING' (U+202B)
UTF-8 (hex) 0xE2 0x80 0xAB (e280ab)

Compare also:
Unicode Character 'LEFT-TO-RIGHT EMBEDDING' (U+202A)
UTF-8 (hex) 0xE2 0x80 0xAA (e280aa)
UTF-8 (hex) 0xE2 0x80 0xAC (e280ac)
Unicode Character 'LEFT-TO-RIGHT OVERRIDE' (U+202D)
UTF-8 (hex) 0xE2 0x80 0xAD (e280ad)
Unicode Character 'RIGHT-TO-LEFT OVERRIDE' (U+202E)
UTF-8 (hex) 0xE2 0x80 0xAD (e280ae)

Variations / modifications of
are of limited use only because (theoreticaly) these characters can be included
anywhere in a title.

I will open another enhancement request about a special page alowing to instring
search of titles specifying %nn values.
Comment 5 lɛʁi לערי ריינהארט 2005-11-05 10:43:18 UTC
(In reply to comment #4)
> I will open another enhancement request about a special page alowing to instring
> search of titles specifying %nn values.

bug 3887: create a special page for instring search of titles specifying %nn values
Comment 6 lɛʁi לערי ריינהארט 2005-11-05 12:09:56 UTC
sorry for this


You may say: "garbague in garbague out"

But this seams to be a subsequent error. It "seams" to interfear with setup
about case sensitive / non case sensitive titles. The earlier this bug gets
fixed the less subsequent errors we get.
Comment 7 lɛʁi לערי ריינהארט 2005-11-06 15:30:30 UTC
sorry for this
this title is invalid because it starts with %E2%80%AB = Unicode Character

However it is a mess editing BiDi and generate pages like
and also taking care of all these !*%$$€@*# bugs.

These pages look fine but the titles they link to should be invalid and the
links should not show red. Best would be to let them with [[ and ]] brackets
same as invalid links.

Best regards Reinhardt [[user:gangleri]]
Comment 8 lɛʁi לערי ריינהארט 2005-11-06 15:49:40 UTC
(In reply to comment #7)
> sorry for this
> and also taking care of all these !*%$$€@*# bugs.

I fixed the involved links so the Whatlinkshere is no longer valid . Compare:
bug 3894 white space characters, BiDi control characters should show up in diff
Comment 9 lɛʁi לערי ריינהארט 2005-11-13 10:45:21 UTC
fixing this would require later a validation according to
bug 3904 disallow user pages and user_talk pages starting with lower case on
case sensitive wikis

adding blocks bug 3904
Comment 10 lɛʁi לערי ריינהארט 2005-11-16 20:04:42 UTC
Hi! The code on FiverAlpha is changing.
and bug 3888 comment 3

The category ilustrates that the
punctuation characters can be used for fraud and vandalim.

If you are not used to the punctuation topics you may *not* notice that
the edit of this *false account* contains punctuation characters in

- one way to see these characters are verifying the URL; this is simple if most
of the contained characters are 7-bit ASCII;
- onother way to see these characters is inserting the cursor in the text and
moving the cursor with the mouse trough the text area
- another way to see these characters is to mark the text with the mouse

Because these characters make more trouble then providing benefit I suggest to
suppress the punctuation characters in titles until a solution could be provided
which could be generaly accepted. As it is now mimic accounts can be created.
This opens doors for fraud and vandalism.

regard reinhardt [[user:gangleri]]
Comment 11 lɛʁi לערי ריינהארט 2005-11-20 19:23:38 UTC
(In reply to comment #4)
> more characters:
I found also
Unicode Character 'LEFT-TO-RIGHT MARK' (U+200E)
UTF-8 (hex) 0xE2 0x80 0x8E (e2808e)
Unicode Character 'RIGHT-TO-LEFT MARK' (U+200F)
UTF-8 (hex) 0xE2 0x80 0x8F (e2808f)

Comment 12 lɛʁi לערי ריינהארט 2005-12-10 02:14:39 UTC

I would like to CANCEL this request / draw it back. (There is no such MediaZilla

The request is to restrictive to me and other methods to avoid the problem / to
fix affected pages should be found.

Such tools are requested at
- Bug 4012: feature request: add a felexible magic character conversion to the
build in editor
which would allow to identify these characters in the editor
- Bug 4185: feature request: provide a notification for irregular links
which would avert users before submitting such links / such pages (either new or
Comment 13 Gabriel Wicke 2006-03-24 11:49:47 UTC
Closing as requested
Comment 14 lɛʁi לערי ריינהארט 2008-03-13 06:15:38 UTC
as status is now this is more a DUPLICATE of

bug 3696 Unicode Control Characters should be restricted in title text (RLM, LRM, RLO, LRO, . . .)

*** This bug has been marked as a duplicate of bug 3696 ***

Note You need to log in before you can comment on or make changes to this bug.