Last modified: 2008-03-13 06:17:49 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T5819, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 3819 - strip phantom general punctuation characters from page titles
strip phantom general punctuation characters from page titles
Status: RESOLVED DUPLICATE of bug 3696
Product: MediaWiki
Classification: Unclassified
Database (Other open bugs)
unspecified
All All
: Normal trivial with 1 vote (vote)
: ---
Assigned To: Nobody - You can work on this!
http://yi.wikipedia.org/w/index.php?t...
:
Depends on: 3887
Blocks: rtl 3985 1381
  Show dependency treegraph
 
Reported: 2005-10-28 17:01 UTC by lɛʁi לערי ריינהארט
Modified: 2008-03-13 06:17 UTC (History)
1 user (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description lɛʁi לערי ריינהארט 2005-10-28 17:01:05 UTC
Sorry for this!

Hallo!

a) I tested character normalisation which seams part of title normalisation.
Regarding precombined characters - NON-precombined characters this workes fine:
[[User:Gangleri/tests/אָ]]
http://yi.wikipedia.org/wiki/User:Gangleri/tests/%EF%AC%AF
http://yi.wikipedia.org/wiki/User:Gangleri/tests/%D7%90%D6%B8
point to the same page despite different coding.

b) The bug's URL will list four different pages with "identical optical title".
There are "phantom" trailing general punctation characters generating different
URL's. Compare:
http://www.fileformat.info/info/unicode/char/202b/index.htm
Unicode Character 'RIGHT-TO-LEFT EMBEDDING' (U+202B)
  UTF-8 (hex) 0xE2 0x80 0xAB (e280ab)
http://homepage1.nifty.com/nomenclator/unicode/data/punct.htm

The generated URL's are:
http://yi.wikipedia.org/wiki/User:Gangleri/tests/%E2%80%AB%D7%B0%D7%99%D7%A5
http://yi.wikipedia.org/wiki/User:Gangleri/tests/%E2%80%AB%D7%B0%D7%99%D7%A5%E2%80%AB
http://yi.wikipedia.org/wiki/User:Gangleri/tests/%E2%80%AB%D7%B0%D7%99%D7%A5%E2%80%AB%E2%80%AB
http://yi.wikipedia.org/wiki/User:Gangleri/tests/%E2%80%AB%D7%B0%D7%99%D7%A5%E2%80%AB%E2%80%AB%E2%80%AB

There are many aspects to this:
a) possible vandalism - suggestion: Please evaluate if "phantom" = unnecessary
heading or trailing punctuation should be stripped from database titles
++ this looks like a normalisation 
b) garbage in - garbage out

Regards Reinhardt [[user:gangleri]]

P.S. I run into this because of textual ambiguosities at Wikipedia in Yiddish
relating to the usage of "tsvey vovn" versus "vov + vov", "tsvey-yudn": versus
"yud + yud" etc.

example 1: There is an article [[yi:וויץ]] but not [[yi:װיץ]] .


example 2: http://www.yiddishdictionaryonline.com/ contains "vey iz (tsu) mir"
which is written *there* both with "vov + vov" and "yud + yud". Nevertheless
http://www.cs.engr.uky.edu/~raphael/yiddish/makeyiddish.html translates with
"tsvey vovn" and "tsvey-yudn": װײ איז (צו) מיר!

It seems that automatical character substitution is not possible because of
ambiguasities when three characters meet together as in
http://www.yiddishdictionaryonline.com/ at
farvunderung - פֿאַרווונדערונג , "farvundert" - פֿאַרווונדערט
and the other way around at
oyspruvn - אויספּרווון
Comment 1 lɛʁi לערי ריינהארט 2005-11-04 18:59:50 UTC
You will find typical examples at the end of
http://yi.wiktionary.org/wiki/Special:Allpages and at
http://yi.wiktionary.org/w/index.php?title=Category:Bugzilla .

Summary is available at http://yi.wiktionary.org/wiki/%E2%80%AB .

These pages where created because I have "compiled" the titles with "copy and
paste" (of hebrew characters) between different Firefox browsers on Windows.

A workaround is to use an usefull keyboard as described at http://www.uyip.org/
and avoid this silly "copy and pasts".
See http://www.geocities.com/fontboard/yiddish.html : Yiddish Pasekh and Keyman
keyboard for Windows

Regards Reinhardt [[user:gangleri]]
Comment 2 lɛʁi לערי ריינהארט 2005-11-05 02:20:45 UTC
Note:

This bug can cause some confusion in a wiki. I assume that many contributors are
using "copy and paste" to insert a few hebrew characters.

As you can see from
http://yi.wikipedia.org/wiki/User:Gangleri/tests/%E2%80%AB%D7%B0%D7%99%D7%A5%E2%80%AB
%E2%80%AB can be
- at the begining of a title
- at the end of a title
- (I assume also inside the title)

There would be different things to do:
- avoid generation of such titles during editing, linking etc.
- clear the database - this is a maintenance issue

Regards Reinhardt [[user:gangleri]]
Comment 3 lɛʁi לערי ריינהארט 2005-11-05 09:40:41 UTC
additions:

I found more incorect titles (only with heading RIGHT-TO-LEFT_EMBEDDING) in
other projects with
http://yi.wikipedia.org/wiki/Special:Prefixindex/%E2%80%AB
beside
http://yi.wiktionary.org/wiki/Special:Prefixindex/%E2%80%AB

Beside RTL wiki's [[ar:]] [[fa:]] [[he:]] [[ur:]] [[yi:]] their wiktioaries etc.
all other projects can be affected.

These wrong titles at [[yi:]] have been created by 5 contributors. This shows
that it is a general problem. If contributors use "copy" from a web page and
copy it (as hebrew characters) into the URL from the browser (I use mainly
Firefox myself) they might copy / paste leading trailing punctuation characters
and the browser will *generate* these URL's.

Of course this is not the proper way to generate titles (one should use a
keyboard) and might be a Firefox issue (I do not know if it is reported at
bugzilla.org if not please do so) or not but is common praxis of a signifficant
amount of contributors to RTL projects.


You will find the affected titles at:
[[yi:Category:Bugzilla/Unicode_character_RIGHT-TO-LEFT_EMBEDDING_-_U_202B]]
http://yi.wiktionary.org/wiki/Category:Bugzilla/Unicode_character_RIGHT-TO-LEFT_EMBEDDING_-_U_202B

Best regards Reinhardt [[user:gangleri]]
Comment 4 lɛʁi לערי ריינהארט 2005-11-05 10:24:03 UTC
more characters:

I found
http://yi.wikipedia.org/w/index.php?title=%E2%80%AB%D7%A7%D7%94%D7%9C_%D7%A4%D6%BF%D7%95%D7%9F_%E2%80%AB%D7%96%D7%A2%D7%9C%D7%91%D7%A9%D7%98%D7%A2%D7%A0%D7%93%D7%99%D7%A7%D7%A2%D7%A8_%D7%A9%D7%98%D7%90%D6%B7%D7%98%D7%9F%E2%80%AC&redirect=no
which contained originaty a trailing %E2%80%AC

Beside
http://www.fileformat.info/info/unicode/char/202b/index.htm
Unicode Character 'RIGHT-TO-LEFT EMBEDDING' (U+202B)
UTF-8 (hex) 0xE2 0x80 0xAB (e280ab)

Compare also:
http://www.fileformat.info/info/unicode/char/202a/index.htm
Unicode Character 'LEFT-TO-RIGHT EMBEDDING' (U+202A)
UTF-8 (hex) 0xE2 0x80 0xAA (e280aa)

http://www.fileformat.info/info/unicode/char/202c/index.htm
Unicode Character 'POP DIRECTIONAL FORMATTING' (U+202C)
UTF-8 (hex) 0xE2 0x80 0xAC (e280ac)

http://www.fileformat.info/info/unicode/char/202d/index.htm
Unicode Character 'LEFT-TO-RIGHT OVERRIDE' (U+202D)
UTF-8 (hex) 0xE2 0x80 0xAD (e280ad)

http://www.fileformat.info/info/unicode/char/202e/index.htm
Unicode Character 'RIGHT-TO-LEFT OVERRIDE' (U+202E)
UTF-8 (hex) 0xE2 0x80 0xAD (e280ae)

Variations / modifications of
http://yi.wikipedia.org/wiki/Special:Prefixindex/%E2%80%AB
as
http://yi.wikipedia.org/wiki/Special:Prefixindex/%E2%80%AA
http://yi.wikipedia.org/wiki/Special:Prefixindex/%E2%80%AC
http://yi.wikipedia.org/wiki/Special:Prefixindex/%E2%80%AD
http://yi.wikipedia.org/wiki/Special:Prefixindex/%E2%80%AE
are of limited use only because (theoreticaly) these characters can be included
anywhere in a title.

I will open another enhancement request about a special page alowing to instring
search of titles specifying %nn values.
Comment 5 lɛʁi לערי ריינהארט 2005-11-05 10:43:18 UTC
(In reply to comment #4)
> I will open another enhancement request about a special page alowing to instring
> search of titles specifying %nn values.

bug 3887: create a special page for instring search of titles specifying %nn values
Comment 6 lɛʁi לערי ריינהארט 2005-11-05 12:09:56 UTC
sorry for this

see
http://yi.wikipedia.org/wiki/%E2%80%AEtest
http://yi.wiktionary.org/wiki/%E2%80%AEtest

You may say: "garbague in garbague out"

But this seams to be a subsequent error. It "seams" to interfear with setup
about case sensitive / non case sensitive titles. The earlier this bug gets
fixed the less subsequent errors we get.
Comment 7 lɛʁi לערי ריינהארט 2005-11-06 15:30:30 UTC
sorry for this

http://yi.wiktionary.org/wiki/Special:Whatlinkshere/%E2%80%AB%D7%B0%D7%90%D6%B8%D7%9B%D7%A0%D7%98%D7%90%D6%B8%D7%92
this title is invalid because it starts with %E2%80%AB = Unicode Character
'RIGHT-TO-LEFT EMBEDDING' (U+202B)

However it is a mess editing BiDi and generate pages like
http://yi.wiktionary.org/wiki/%D7%B0%D7%90%D6%B8%D7%9A
http://yi.wiktionary.org/wiki/%D7%98%D7%90%D6%B8%D7%92
and also taking care of all these !*%$$€@*# bugs.

These pages look fine but the titles they link to should be invalid and the
links should not show red. Best would be to let them with [[ and ]] brackets
same as invalid links.

Best regards Reinhardt [[user:gangleri]]
Comment 8 lɛʁi לערי ריינהארט 2005-11-06 15:49:40 UTC
(In reply to comment #7)
> sorry for this
> and also taking care of all these !*%$$€@*# bugs.

I fixed the involved links so the Whatlinkshere is no longer valid . Compare:
http://yi.wiktionary.org/w/index.php?title=%D7%B0%D7%90%D6%B8%D7%9A&diff=4483&oldid=4477
http://yi.wiktionary.org/w/index.php?title=%D7%95%D7%95%D7%90%D6%B8%D7%9B%D7%A0%D7%98%D7%90%D6%B8%D7%92&diff=4482&oldid=4463
and
bug 3894 white space characters, BiDi control characters should show up in diff
Comment 9 lɛʁi לערי ריינהארט 2005-11-13 10:45:21 UTC
fixing this would require later a validation according to
bug 3904 disallow user pages and user_talk pages starting with lower case on
case sensitive wikis

adding blocks bug 3904
Comment 10 lɛʁi לערי ריינהארט 2005-11-16 20:04:42 UTC
Hi! The code on FiverAlpha is changing.
See http://test.leuksman.com/view/Category:Mimic
and bug 3888 comment 3

The category http://test.leuksman.com/view/Category:Mimic ilustrates that the
punctuation characters can be used for fraud and vandalim.

If you are not used to the punctuation topics you may *not* notice that
http://test.leuksman.com/edit/User:Brion%E2%80%AD%E2%80%AC?oldid=9812
the edit of this *false account* contains punctuation characters in
[[User:Brion|Brion]].

- one way to see these characters are verifying the URL; this is simple if most
of the contained characters are 7-bit ASCII;
- onother way to see these characters is inserting the cursor in the text and
moving the cursor with the mouse trough the text area
- another way to see these characters is to mark the text with the mouse

Because these characters make more trouble then providing benefit I suggest to
suppress the punctuation characters in titles until a solution could be provided
which could be generaly accepted. As it is now mimic accounts can be created.
This opens doors for fraud and vandalism.

regard reinhardt [[user:gangleri]]
Comment 11 lɛʁi לערי ריינהארט 2005-11-20 19:23:38 UTC
(In reply to comment #4)
> more characters:
> 
I found also

http://www.fileformat.info/info/unicode/char/200e/index.htm
Unicode Character 'LEFT-TO-RIGHT MARK' (U+200E)
UTF-8 (hex) 0xE2 0x80 0x8E (e2808e)

http://www.fileformat.info/info/unicode/char/200f/index.htm
Unicode Character 'RIGHT-TO-LEFT MARK' (U+200F)
UTF-8 (hex) 0xE2 0x80 0x8F (e2808f)

source:
http://www.fileformat.info/info/unicode/block/general_punctuation/list.htm
Comment 12 lɛʁi לערי ריינהארט 2005-12-10 02:14:39 UTC
Hallo!

I would like to CANCEL this request / draw it back. (There is no such MediaZilla
resolution).

The request is to restrictive to me and other methods to avoid the problem / to
fix affected pages should be found.

Such tools are requested at
- Bug 4012: feature request: add a felexible magic character conversion to the
build in editor
which would allow to identify these characters in the editor
- Bug 4185: feature request: provide a notification for irregular links
which would avert users before submitting such links / such pages (either new or
changed).
Comment 13 Gabriel Wicke 2006-03-24 11:49:47 UTC
Closing as requested
Comment 14 lɛʁi לערי ריינהארט 2008-03-13 06:15:38 UTC
as status is now this is more a DUPLICATE of

bug 3696 Unicode Control Characters should be restricted in title text (RLM, LRM, RLO, LRO, . . .)

*** This bug has been marked as a duplicate of bug 3696 ***

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links