Last modified: 2011-09-04 14:56:51 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T3527, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 1527 - *first* perform Unicode normalisation and check for existence of pages *after* the normalisation
*first* perform Unicode normalisation and check for existence of pages *after...
Status: RESOLVED FIXED
Product: MediaWiki
Classification: Unclassified
Internationalization (Other open bugs)
1.4.x
All All
: Normal major with 3 votes (vote)
: ---
Assigned To: Nobody - You can work on this!
http://hi.wiktionary.org
:
: 1375 3860 (view as bug list)
Depends on: 2399
Blocks: 3985 4917
  Show dependency treegraph
 
Reported: 2005-02-14 14:48 UTC by Yann Forget
Modified: 2011-09-04 14:56 UTC (History)
4 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Yann Forget 2005-02-14 14:48:32 UTC
Hi,  
  
I found a problem on URL with some Devanagari characters on present (14.02.2005) Hindi Wiktionary  
project. This is tested with Konqueror and Mozilla and I think only present in 1.4.  
  
URLs with some Devanagari characters (at least ज़, ड़ and फ़) can't be resolved. Links appears in red  
although the article exists. Same while using Unicode numbers, respectively ज़ ड़ and  
फ़ for the 3 characters above.  
  
Examples :  
http://hi.wiktionary.org/wiki/शनिवार  
Article [[हफ़्ता]] exists, but is not accessible on http://hi.wiktionary.org/wiki/हफ़्ता  
  
Thanks a lot, 
Yann
Comment 3 Yann Forget 2005-02-26 22:56:15 UTC
This bug also appears with Firefox and IE on Windows, so it's independent of the browser. 
Comment 4 Yann Forget 2005-03-01 17:35:53 UTC
Same bug as http://bugzilla.wikipedia.org/show_bug.cgi?id=1375 
Comment 5 Yann Forget 2005-03-01 18:27:45 UTC
Here is a way to get out of it, thanks to Muke. Yann 
 
<MukeUTF-8> I have run into the same bug 
<MukeUTF-8> It is because of Unicode normalization 
<MukeUTF-8> the same happened with old articles using the Greek acute accent  
<MukeUTF-8> I think it is the same problem.  i am looking into ti 
<MukeUTF-8> *it 
<yannf> MukeUTF-8, oh interesting 
<MukeUTF-8> the reason is because, say. 
<MukeUTF-8> the mediawiki software takes the "ज़" that you type in 
<MukeUTF-8> and it converts it into the "ज" plus the dot 
<MukeUTF-8> as two separate characters, because the Unicode standard defines them as identical. 
<MukeUTF-8> the problem is that your article with "ज़" in the title was created first... and it was never 
converted 
<yannf> yes, it appears on URLs with letters with a dot 
<yannf> what do you mean by "converted" ? 
<MukeUTF-8> I mean that it converts the one character "ज़" into the two characters "ज" and "़" 
<yannf> how can we solve this ? 
<MukeUTF-8> Someone has to go into the database and convert the old article titles. 
<yannf> there are also articles which are accessible, but the link remains red 
<MukeUTF-8> or at least convert whatever points to the articles 
<yannf> also with a dot in the URL 
<yannf> "at least convert whatever points to the articles" <- but the links seem to be ok 
<MukeUTF-8> I mean in the database 
<MukeUTF-8> I don't really know the details of how it could be fixed. 
<yannf> why it appears only in 1.4 ? 
<MukeUTF-8> Because Unicode normalization was implemented 
<MukeUTF-8> which means, for convenience of storage and searching and whatnot, characters that are defined as 
identical are stored in a canonical form, which may not be the same form as was typed in 
<MukeUTF-8> another example was the Greek characters I mentioned... where "ά" (greek alpha with old acute 
accent) was typed in before, it is now converted to "ά" (greek alpha with modern tonos) 
<MukeUTF-8> So old article titles with "ά" with the old accent can't be reached anymore, because it will always 
be turned into the letter with the modern accent by the software 
<yannf> what if i copy the articles by hand ? 
<MukeUTF-8> if you can get to the article 
<MukeUTF-8> New articles shouldn't have any trouble 
<MukeUTF-8> only ones from before the conversion 
<yannf> yes, but there are also articles which are accessible, but the link remains red 
<MukeUTF-8> that i'm not sure about 
<MukeUTF-8> oh wait 
<MukeUTF-8> When was the last time the page with the link was edited? 
<yannf> http://hi.wiktionary.org/w/index.php?title=Template:-fr-&action=history 
<yannf> Dec 30, 2004 
<MukeUTF-8> because not only old article titles, but old article text was not converted.  So if the link 
contains an "old" character, it will consider it a red link, even though the target page with the "new" 
character exists.    But the conversion is in place now, so if you edit the page, it should convert it to a 
"new" character and work properly.  Try it now, edit the page and hit "preview" 
<MukeUTF-8> (the page is not loading for me atm, or i would check this myself) 
<yannf> if i edit the page, the link becomes red on http://hi.wiktionary.org/wiki/Template:-fr- 
<MukeUTF-8> ah...  
<MukeUTF-8> that's because the page with the "old" character exists, but not the page with the "new" character 
<yannf> yes, i think i understood 
<MukeUTF-8> http://bugzilla.wikipedia.org/show_bug.cgi?id=1375 
<yannf> on this page, the link was red, i edited, and it's now blue, http://hi.wiktionary.org/wiki/Template:kk 
<MukeUTF-8> *nod* 
<MukeUTF-8> the articles can be updated to the new characters by editing them... but the titles need to be 
edited by someone with access to the database, because we can't reach them from here 
<MukeUTF-8> I posted on the wiktionary mailing list for them to do it for the Greek words involved but it never 
happened :\ 
<yannf> well, there are only a handful of them, so i could even create them again, if it solves the pb 
<MukeUTF-8> but then, the things i ask for never seem to happen...  
<yannf> i have a dump of the old database 
<MukeUTF-8> true, you could make them again, though you lose the history 
<MukeUTF-8> and attributions 
<yannf> yes, i am the only editor on the indi wiktionary ;) 
<yannf> *hindi 
<MukeUTF-8> ah, well, then that is probably ok :x) 
<MukeUTF-8> i'm just about the only editor on the latin one, so I know how it is ;) 
<yannf> ;) 
<MukeUTF-8> there is like... one other regular user.  but he only speaks Japanese, and only adds proper 
names...  
<MukeUTF-8> so I don't generally count him o-o 
<yannf> there will be a few lost articles in the database, that the only remaining pb 
<MukeUTF-8> hmm, i suppose i could pull those greek articles out of the old db dumps... 
<yannf> may i copy the log of this chat to the bug report ? 
<yannf> it would be others 
<yannf> it would help others 
<MukeUTF-8> ok 
<MukeUTF-8> I have to go to work now.  ttyl. 
<yannf> ok thanks 
<MukeUTF-8> no problem :) 
Comment 6 Yann Forget 2005-03-05 09:41:16 UTC
So I created again the inaccessible articles. Now the old ones need to be deleted: all articles with ड़ 
(&#x095C;), ज़ (&#x095B;) or फ़ (&#x095E;) in the URL created before the conversion have to be deleted.  
 
Comment 7 lɛʁi לערי ריינהארט 2005-12-11 06:24:18 UTC
Hallo!

please see
- http://hi.wiktionary.org/w/index.php?diff=10894&oldid=5133
- http://hi.wiktionary.org/w/index.php?diff=10895&oldid=5009

This fixed the problem both for the section and the category and also
[[wiktionary:hi:अंग्रेज़ी]]. (All links are blue now / some black at
[[wiktionary:hi:अंग्रेज़ी]]).
http://hi.wiktionary.org/w/index.php?title=%E0%A4%85%E0%A4%82%E0%A4%97%E0%A5%8D%E0%A4%B0%E0%A5%87%E0%A4%9C%E0%A4%BC%E0%A5%80&action=purge

A duplicate of this is
Bug 3860: links generated with precombined characters show red despite the fact
that the normalised links exist

best regards reinhardt [[user:gangleri]]
Comment 8 lɛʁi לערי ריינהארט 2005-12-11 06:24:31 UTC
*** Bug 3860 has been marked as a duplicate of this bug. ***
Comment 9 lɛʁi לערי ריינהארט 2005-12-11 08:01:16 UTC
making readjustments for component and dependencies
There are some plans to make this easier in Bugzilla:
Bug [Bugzilla] 102161
== Resolving as duplicate should display field differences
  Bug [Bugzilla] 319803
  == feature request: when changing product, component etc. display old product,
old component, other fields in all required steps
Bugzilla [Bugzilla] 65382
== Let people know when deps exist as resolving duplicate.

Bug 3860
depends on Bug 2399: Unicode normalization interferes with Hebrew and Arabic
with vowels
blocks Bug 3985: character conversion (tracking)

"Component" will be changed to "Internationalization" in a next "edit".
Comment 10 lɛʁi לערי ריינהארט 2005-12-11 09:02:30 UTC
*** Bug 1375 has been marked as a duplicate of this bug. ***
Comment 11 lɛʁi לערי ריינהארט 2005-12-11 09:13:04 UTC
changing summary from
problem on URL with Devanagari characters
to
*first* perform Unicode normalisation and check for existence of pages *after*
the normalisation

Hope that this would be easy to fix. Unicode normalisation should always be
performed *first*.

chnaging Severity from "normal" to "major".

Bug 1375: Unicode normalization leaves red links
mentions that special:Whatlinkshere might be afected as well. Please verify if
this will be fixed as well.

Hopefully there are no other places in the code where the Unicode normalisation
is *not* performed first.

best regards reinhardt [[user:gangleri]]
Comment 12 lɛʁi לערי ריינהארט 2005-12-11 09:23:55 UTC
(In reply to comment #11)
> Bug 1375: Unicode normalization leaves red links
> mentions that special:Whatlinkshere might be afected as well. Please verify if
> this will be fixed as well.

http://la.wiktionary.org/wiki/Special:Whatlinkshere/%E1%BD%88%CE%BE%CF%8D%CF%82
does *not* show "[[wiktionary:la:Ὀξύς]]"

*but* *every* [[Special:Whatlinkshere/foo]] shows [[foo]] in the list.
This is easier to see at [[Special:Whatlinkshere/Tofu]].
Why this is *not* the case at [[wiktionary:la:Special:Whatlinkshere/Ὀξύς]]?
Comment 13 Brion Vibber 2005-12-11 09:29:15 UTC
Removed bogus dependency.
Comment 14 lɛʁi לערי ריינהארט 2005-12-12 00:01:09 UTC
(In reply to comment #6)
> So I created again the inaccessible articles. Now the old ones need to be
deleted: all articles with ड़ 
> (&#x095C;), ज़ (&#x095B;) or फ़ (&#x095E;) in the URL created before the
conversion have to be deleted.  

please read also the disussion from comment #5

Yann I understand that there was / there is also *another* problem related to
page titles you can not access and which should be deleted.
Please go to [[wiktionary:hi:special:Allpages]]. Tray to identify if you see
titles which would not open or which would apear to be twice there. Please do both:
a) make a screen dump and mark / some of the titles which have / create problems
b) please provide the links
c) please describe the problem from *your* point of view (what you expect, what
you can, what does not work
d) How many namespaces are affected?
Thanks in advance!

best regards reinhardt [[user:gangleri]]
Comment 15 lɛʁi לערי ריינהארט 2006-02-08 15:26:26 UTC
(In reply to comment #12)
> (In reply to comment #11)
> > Bug 1375: Unicode normalization leaves red links
> > mentions that special:Whatlinkshere might be afected as well. Please verify if
> > this will be fixed as well.

"special:Whatlinkshere might be afected as well" see also
[[user:Gangleri/tests/bugzilla/03860]]
Comment 17 Niklas Laxström 2008-07-13 08:54:57 UTC
As far as I can see the problem only affects very old titles, and I think your script that checks invalid titles should catch them.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links