Last modified: 2005-11-15 08:11:16 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T2065, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 65 - Links mixing latin1 literal characters and entities produce bad UTF-8
Links mixing latin1 literal characters and entities produce bad UTF-8
Status: RESOLVED FIXED
Product: MediaWiki
Classification: Unclassified
Parser (Other open bugs)
unspecified
All All
: Highest major with 2 votes (vote)
: ---
Assigned To: Brion Vibber
: utf8
: 66 67 555 909 1554 1591 1597 (view as bug list)
Depends on:
Blocks: unicode
  Show dependency treegraph
 
Reported: 2004-08-16 05:21 UTC by Timwi
Modified: 2005-11-15 08:11 UTC (History)
2 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Timwi 2004-08-16 05:21:26 UTC
BUG MIGRATED FROM SOURCEFORGE
http://sourceforge.net/tracker/index.php?func=detail&aid=978668&group_id=34373&atid=411192
Originally submitted by Ilguiz Latypov (ilgiz)  2004-06-24 04:29


This bug description may require changing the browser's
Character Encoding to UTF-8 in order to see Unicode
characters.

The Unicode link at
  http://en.wikipedia.org/wiki/Tatar_language
was entered as
  [[:tt:]

It is displayed correctly as:
   tt:
and the resulting link is correctly encoded in UTF-8:
  http://tt.wikipedia.org/wiki/%C5%9E%FCr%E4le

However, on clicking the link, I am redirected to a
different URL
  http://tt.wikipedia.org/wiki/%C3%85%C5%BE%C3%BCr%C3%A4le

This URL rewriting happens on the server, as the
following log shows:

==============================
ilgiz@ei:~$ telnet tt.wikipedia.org 80
Trying 207.142.131.248...
Connected to tt.wikipedia.org.
Escape character is '^]'.
GET /wiki/%C5%9E%FCr%E4le HTTP/1.1
Host: tt.wikipedia.org

HTTP/1.0 301 Moved Permanently
Date: Thu, 24 Jun 2004 02:23:29 GMT
Server: Apache/1.3.29 (Unix) PHP/4.3.4
X-Powered-By: PHP/4.3.4
Vary: Accept-Encoding,Cookie
Expires: -1
Cache-Control: private, s-maxage=0, max-age=0,
must-revalidate
Last-Modified: Thu, 24 Jun 2004 02:23:29 GMT
Location:
http://tt.wikipedia.org/wiki/%C3%85%C5%BE%C3%BCr%C3%A4le
Content-Type: text/html
X-Cache: MISS from wikipedia.org
Connection: close

==============================

It seems that the URL rewriting attempts to encode the
link names from a supposed character set into another
one.  Because the original link was UTF-8 encoded, the
additional translation corrupted the link.

------------------------- Additional comments ------------------------
Date: 2004-06-24 07:40
Sender: SF user vibber

The following link is corrupt:
http://tt.wikipedia.org/wiki/%C5%9E%FCr%E4le

The  is encoded according to UTF-8, but the literal haracters are encoded according to 8-bit ISO 8859-1. Thus, when
received the wiki sees that the incoming URL is *not* valid UTF-8
and
tries to upconvert it from ISO 8859-1. This correctly converts
the ut of course the  is further corrupted.

To workaround the inconsistency, use numerical references or
hard-
coded % codes in the link.

-------------------------------------------------
Date: 2004-06-24 14:32
Sender: SF user ilgiz

Thanks for getting into the root cause.  I've changed the
rest of the link into numeric HTML entity characters, and it
worked.

I now realize the server should have treated all the
characters I entered in the edit field as Unicode only. 
This would be possible if the ISO-8859-1 character coding of
the :en: pages was replaced with UTF-8.

In trying to do the best, the server recognized that the
first Unicode character U+015E in the POST data couldn't be
kept in the page's ISO-8859-1 code page and correctly
presented it as an HTML entity &350;.  However, the rest of
the Unicode characters in the input field (U+00FC, ü
and U+00E4, ä) could be represented by the ISO-8859-1
characters, and the server returned their ISO-8859-1
encoding in the saved page text, in agreement with the
page's character coding.

In fact, it could be the Firefox browser's problem triggered
by the presence of characters outside the page's character
coding in the form of numeric HTML entities.  The browser
should have correctly encoded the text of link into UTF-8,
as you mentioned, regardless of the page's character
encoding.  When I move the mouse over the link, the browser
incorrectly shows the first character as 2 ISO-8859-1
characters (probably derived from the UTF presentation). 
Besides, the browser keeps the ISO-8859-1 encoding even when
I click the link.

-------------------------------------------------
Date: 2004-06-24 18:40
Sender: SF user vibber

Firefox has nothing to do with it; the link is inherently wrong
and would
work under no circumstances.
Comment 1 Timwi 2004-08-16 05:30:48 UTC
*** Bug 67 has been marked as a duplicate of this bug. ***
Comment 2 Timwi 2004-08-16 05:30:55 UTC
*** Bug 66 has been marked as a duplicate of this bug. ***
Comment 3 Ilguiz Latypov 2004-08-16 10:06:12 UTC
Thank you again for pointing out that the original link I used for testing 

  http://tt.wikipedia.org/wiki/%C5%9E%FCr%E4le

was internally inconsistent.  I agree that replacing all non-ASCII characters in
the editable links is the way to avoid confusion.
And you are right, there wasn't a problem with firefox.  The wikipedia server
doesn't encode href attributes correctly.

The above inconsistent link was produced by the wikipedia server from the valid
input

  Tatar myths, including the story of [[:tt:Şüräle]]

in the page by the address

  http://en.wikipedia.org/wiki/Tatar_language

Here is what the page source would look like when typing/pasting the non-ISO
link in the edit form literally:

  Tatar myths,  including the story of <a
href="http://tt.wikipedia.org/wiki/%C5%9E%FCr%E4le" class='extiw' 
    title="tt:Å&#382;üräle">tt:&#350;üräle</a></p>

If I encode the link manually as you suggest,

  Tatar myths, including the story of [[:tt:&#350;&#252;r&#228;le]]

then the href attribute produced by the wikipedia server becomes correct:

  Tatar myths,  including the story of <a
href="http://tt.wikipedia.org/wiki/%C5%9E%C3%BCr%C3%A4le" class='extiw' 
    title="tt:">tt:&#350;üräle</a>

In both cases the text node of the <a> tag (the word in front of </a>) is
generated correctly.

I still suspect that having to enter non-ASCII wiki names using the "HTML
entity" &#NUMBER; or the "UTF-8/URL" %HEXBYTE encoding might not be convenient.
 People who create new links may not know about the existing problem and enter
the links literally.  

A note on HTML 4.0 suggests that all non-ASCII href characters should be
represented the "UTF-8/URL" way in the HTML page regardless of the character set
of the page:

  http://www.w3.org/TR/PR-html40-971107/appendix/notes.html#h-B.1

Interestingly, the tt.wikipedia.org server seems to convert any non-ASCII
character in href according to the w3 recommendation.  This could be related to
the fact that the tt server produces the UTF-8 pages while the en server
produces ISO-8859-1 pages.  So the quick fix might be as easy as changing the
default encoding of the en server to UTF-8.  If that solution is impossible due
to some other concerns, the href encoding of the en server needs to be fixed.
Comment 4 JeLuF 2004-08-29 20:16:49 UTC
Changing en to UTF-8 is not as trivial as it sounds. 
It would require converting the entire database.
This will probably happen in the future, but not in the next days.

And I agree, it's probably the only sane way to fix this bug.
Always creating interwiki links as UTF-8 would break links to
the few remaining latin-1 wikis. For the time being, please use
the workaround.
Comment 5 Brion Vibber 2004-08-29 20:57:44 UTC
Creating interwiki links as UTF-8 should work fine with the Latin-1 wikis; they'll detect the encoding in the 
URL and convert it. However this means knowing what you're dealing with before trying to interpret the 
character references.
Comment 6 Brion Vibber 2004-09-22 11:56:24 UTC
*** Bug 555 has been marked as a duplicate of this bug. ***
Comment 7 Brion Vibber 2004-11-19 09:32:30 UTC
*** Bug 909 has been marked as a duplicate of this bug. ***
Comment 8 Brion Vibber 2005-01-30 04:50:05 UTC
Changed summary to explain problem
Comment 9 Brion Vibber 2005-02-17 18:54:05 UTC
*** Bug 1554 has been marked as a duplicate of this bug. ***
Comment 10 Ilguiz Latypov 2005-02-17 20:09:01 UTC
Changed the summary from 
  Links mixing latin1 literal characters and entities produce bad UTF-8
to 
 Typing in a non-latin1 interwiki link produces a wrong hex UTF-8 URL  in "a href"
Comment 11 Brion Vibber 2005-02-25 19:54:49 UTC
*** Bug 1591 has been marked as a duplicate of this bug. ***
Comment 12 Brion Vibber 2005-02-25 19:55:22 UTC
Restored legible summary.
Comment 13 Brion Vibber 2005-02-26 03:29:01 UTC
*** Bug 1597 has been marked as a duplicate of this bug. ***
Comment 14 Brion Vibber 2005-02-26 03:29:43 UTC
This is getting reported a lot and it's starting to piss me off. ;)

Raising priority.
Comment 15 lɛʁi לערי ריינהארט 2005-02-26 06:28:01 UTC
Thanks Brion!

http://en.wikipedia.org/wiki/User:Gangleri/tests/bugzilla:1691 gives more
examples and variations for the two characters ú and š.

Regards Reinhardt
Comment 16 Brion Vibber 2005-02-26 10:11:29 UTC
I've managed to whip up some code to normalize 'mixed' interwiki links to UTF-8, 
which fixes nearly all of ganglieri's test cases.

I'll make a couple more tweaks in the morning and check it in.
Comment 17 Brion Vibber 2005-02-27 06:10:50 UTC
Checked in and put live. Seems to be working.

A couple notes re: Ganglieri's test cases...
* The w:ro links are broken due to bug 563.

* NEVER use &#154; or &#x9A; for s-caron. Numeric character references always 
refer to Unicode code points, and U+009A is a reserved control character, *not* s-
caron. It might appear to work sometimes due to a fluke and crappy workarounds 
for compatibility with a Windows bug, but should definitely not be relied upon. Use 
the real Unicode number, &#353;. The same goes for the other characters in the 
Windows CP1252 extended range (see http://en.wikipedia.org/wiki/ISO_8859
-1#Windows-1252 )

* For the moment the only named character references that will work in links are the 
ISO 8859-1 ones (s-caron does not appear in ISO 8859-1). Stick with the numbers 
for now.
Comment 18 Brion Vibber 2005-02-28 00:24:50 UTC
Someone removed all the CCs on this bug and put in an apparently unrelated page as the sample URL without explanation. Removing.
Comment 19 lɛʁi לערי ריינהארט 2005-05-09 20:23:32 UTC
Hi Brion!

Sorry for reopening this bug. You managed to solve 80%.

See: http://jadesukka.homelinux.org:8180/betawiki/Bugzilla_0065
and http://test.leuksman.com/index.php/Bugzilla_0065

THERE links as [[wikipedia:bg:Бела Барток]] will FAIL.
THERE links as [[wikipedia:sr:User:Горан Анђелковић]] are OK ( User is used )
but [[wikipedia:sr:Корисник:Горан Анђелковић]] will FAIL.

----

Please remember that at en.wikipedia links as [[w:he:א]] will fail because this
is RTL.

For your help: At the pages mentioned above:

THERE links as [[wikipedia:he:User:גראנ]] are OK ( User is used ) but
THERE links as wikipedia:he:משתמש:גראנ]] will FAIL.

----

Please see also:
http://jadesukka.homelinux.org:8180/betawiki/Bugzilla_unknown_02
http://test.leuksman.com/index.php/Bugzilla_unknown_02
[[en:User:Gangleri/tests/sister projects]]

where two problems are described:
* <nowiki>[[wikibooks:]]</nowiki> generates
** [[wikibooks:]]
** this looks like: '''<span
class="plainlinks">http://en.wikibooks.org/wiki/</span>"
title="wikibooks:">wikibooks:'''
** [[wikibooks:Main Page]] is '''OK'''
* <nowiki>[[rfc:]]</nowiki> generates
** [[rfc:]]
** this looks like: '''<span
class="plainlinks">http://www.rfc-editor.org/rfc/rfc.txt</span>" title="rfc:">rfc'''

----

At the described pages the following link will fail: 

[[wikipedia:da:]]
[[wikipedia:de:]]
[[wikipedia:eo:]]
[[wikipedia:es:]]
[[wikipedia:ar:]]
[[wikipedia:he:]]

It seems to be the same problem as experienced at
[[en:User:Gangleri/tests/sister projects]] where

[[w:]], [[b:]], [[n:]] and [[q:]] are OK but
[[w:en:]], [[b:en:]], [[n:en:]] and [[q:en:]] will FAIL.

Thanks for your patience an d fixing these bugs.

Best regards Reinhardt [[user:gangleri]]
Comment 20 Brion Vibber 2005-05-09 21:07:08 UTC
Right-to-leftness probably has nothing at all to do with anything, since that's a 
property of text *DISPLAY* and is far, far, far, far outside anything that can 
affect this.

Most likely a character validity check is tripping on conflicting bytes again.
Comment 21 lɛʁi לערי ריינהארט 2005-06-03 07:48:23 UTC
[[wikipedia:bg:Бела Барток]] still fails

changed Severity to "major" - Regards Reinhardt
Comment 22 Brion Vibber 2005-06-03 08:19:13 UTC
Comments after reopening appear unrelated to this bug, which by definition can only occur on 
Latin-1 wikis. Re-closing; please file a separate bug report.
Comment 23 lɛʁi לערי ריינהארט 2005-06-10 14:49:23 UTC
parts of comment 19 where reported as individual bugs
bug 2342 - w:he:(aleph) is a valid interwiki link at en: but not at other wiki's
bug 2372 - [[interwiki_foo:]] will generate wrong code and render incorrect

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links