Last modified: 2004-08-16 05:32:27 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T2066, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 66 - corrupted link due to URL translation
corrupted link due to URL translation
Status: CLOSED DUPLICATE of bug 65
Product: MediaWiki
Classification: Unclassified
General/Unknown (Other open bugs)
unspecified
All All
: Normal normal (vote)
: ---
Assigned To: Nobody - You can work on this!
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2004-08-16 05:29 UTC by Timwi
Modified: 2004-08-16 05:32 UTC (History)
1 user (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Timwi 2004-08-16 05:29:01 UTC
BUG MIGRATED FROM SOURCEFORGE
http://sourceforge.net/tracker/index.php?func=detail&aid=978668&group_id=34373&atid=411192
Originally submitted by Ilguiz Latypov (ilgiz)  2004-06-24 04:29


This bug description may require changing the browser's
Character Encoding to UTF-8 in order to see Unicode
characters.

The Unicode link at
  http://en.wikipedia.org/wiki/Tatar_language
was entered as
  [[:tt:]

It is displayed correctly as:
   tt:
and the resulting link is correctly encoded in UTF-8:
  http://tt.wikipedia.org/wiki/%C5%9E%FCr%E4le

However, on clicking the link, I am redirected to a
different URL
  http://tt.wikipedia.org/wiki/%C3%85%C5%BE%C3%BCr%C3%A4le

This URL rewriting happens on the server, as the
following log shows:

==============================
ilgiz@ei:~$ telnet tt.wikipedia.org 80
Trying 207.142.131.248...
Connected to tt.wikipedia.org.
Escape character is '^]'.
GET /wiki/%C5%9E%FCr%E4le HTTP/1.1
Host: tt.wikipedia.org

HTTP/1.0 301 Moved Permanently
Date: Thu, 24 Jun 2004 02:23:29 GMT
Server: Apache/1.3.29 (Unix) PHP/4.3.4
X-Powered-By: PHP/4.3.4
Vary: Accept-Encoding,Cookie
Expires: -1
Cache-Control: private, s-maxage=0, max-age=0,
must-revalidate
Last-Modified: Thu, 24 Jun 2004 02:23:29 GMT
Location:
http://tt.wikipedia.org/wiki/%C3%85%C5%BE%C3%BCr%C3%A4le
Content-Type: text/html
X-Cache: MISS from wikipedia.org
Connection: close

==============================

It seems that the URL rewriting attempts to encode the
link names from a supposed character set into another
one.  Because the original link was UTF-8 encoded, the
additional translation corrupted the link.

------------------------- Additional comments ------------------------
Date: 2004-06-24 07:40
Sender: SF user vibber

The following link is corrupt:
http://tt.wikipedia.org/wiki/%C5%9E%FCr%E4le

The  is encoded according to UTF-8, but the literal haracters are encoded according to 8-bit ISO 8859-1. Thus, when
received the wiki sees that the incoming URL is *not* valid UTF-8
and
tries to upconvert it from ISO 8859-1. This correctly converts
the ut of course the  is further corrupted.

To workaround the inconsistency, use numerical references or
hard-
coded % codes in the link.

-------------------------------------------------
Date: 2004-06-24 14:32
Sender: SF user ilgiz

Thanks for getting into the root cause.  I've changed the
rest of the link into numeric HTML entity characters, and it
worked.

I now realize the server should have treated all the
characters I entered in the edit field as Unicode only. 
This would be possible if the ISO-8859-1 character coding of
the :en: pages was replaced with UTF-8.

In trying to do the best, the server recognized that the
first Unicode character U+015E in the POST data couldn't be
kept in the page's ISO-8859-1 code page and correctly
presented it as an HTML entity &350;.  However, the rest of
the Unicode characters in the input field (U+00FC, ü
and U+00E4, ä) could be represented by the ISO-8859-1
characters, and the server returned their ISO-8859-1
encoding in the saved page text, in agreement with the
page's character coding.

In fact, it could be the Firefox browser's problem triggered
by the presence of characters outside the page's character
coding in the form of numeric HTML entities.  The browser
should have correctly encoded the text of link into UTF-8,
as you mentioned, regardless of the page's character
encoding.  When I move the mouse over the link, the browser
incorrectly shows the first character as 2 ISO-8859-1
characters (probably derived from the UTF presentation). 
Besides, the browser keeps the ISO-8859-1 encoding even when
I click the link.

-------------------------------------------------
Date: 2004-06-24 18:40
Sender: SF user vibber

Firefox has nothing to do with it; the link is inherently wrong
and would
work under no circumstances.
Comment 1 Timwi 2004-08-16 05:30:55 UTC

*** This bug has been marked as a duplicate of 65 ***

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links