Last modified: 2004-08-16 05:32:27 UTC
BUG MIGRATED FROM SOURCEFORGE http://sourceforge.net/tracker/index.php?func=detail&aid=978668&group_id=34373&atid=411192 Originally submitted by Ilguiz Latypov (ilgiz) 2004-06-24 04:29 This bug description may require changing the browser's Character Encoding to UTF-8 in order to see Unicode characters. The Unicode link at http://en.wikipedia.org/wiki/Tatar_language was entered as [[:tt:] It is displayed correctly as: tt: and the resulting link is correctly encoded in UTF-8: http://tt.wikipedia.org/wiki/%C5%9E%FCr%E4le However, on clicking the link, I am redirected to a different URL http://tt.wikipedia.org/wiki/%C3%85%C5%BE%C3%BCr%C3%A4le This URL rewriting happens on the server, as the following log shows: ============================== ilgiz@ei:~$ telnet tt.wikipedia.org 80 Trying 207.142.131.248... Connected to tt.wikipedia.org. Escape character is '^]'. GET /wiki/%C5%9E%FCr%E4le HTTP/1.1 Host: tt.wikipedia.org HTTP/1.0 301 Moved Permanently Date: Thu, 24 Jun 2004 02:23:29 GMT Server: Apache/1.3.29 (Unix) PHP/4.3.4 X-Powered-By: PHP/4.3.4 Vary: Accept-Encoding,Cookie Expires: -1 Cache-Control: private, s-maxage=0, max-age=0, must-revalidate Last-Modified: Thu, 24 Jun 2004 02:23:29 GMT Location: http://tt.wikipedia.org/wiki/%C3%85%C5%BE%C3%BCr%C3%A4le Content-Type: text/html X-Cache: MISS from wikipedia.org Connection: close ============================== It seems that the URL rewriting attempts to encode the link names from a supposed character set into another one. Because the original link was UTF-8 encoded, the additional translation corrupted the link. ------------------------- Additional comments ------------------------ Date: 2004-06-24 07:40 Sender: SF user vibber The following link is corrupt: http://tt.wikipedia.org/wiki/%C5%9E%FCr%E4le The is encoded according to UTF-8, but the literal haracters are encoded according to 8-bit ISO 8859-1. Thus, when received the wiki sees that the incoming URL is *not* valid UTF-8 and tries to upconvert it from ISO 8859-1. This correctly converts the ut of course the is further corrupted. To workaround the inconsistency, use numerical references or hard- coded % codes in the link. ------------------------------------------------- Date: 2004-06-24 14:32 Sender: SF user ilgiz Thanks for getting into the root cause. I've changed the rest of the link into numeric HTML entity characters, and it worked. I now realize the server should have treated all the characters I entered in the edit field as Unicode only. This would be possible if the ISO-8859-1 character coding of the :en: pages was replaced with UTF-8. In trying to do the best, the server recognized that the first Unicode character U+015E in the POST data couldn't be kept in the page's ISO-8859-1 code page and correctly presented it as an HTML entity &350;. However, the rest of the Unicode characters in the input field (U+00FC, ü and U+00E4, ä) could be represented by the ISO-8859-1 characters, and the server returned their ISO-8859-1 encoding in the saved page text, in agreement with the page's character coding. In fact, it could be the Firefox browser's problem triggered by the presence of characters outside the page's character coding in the form of numeric HTML entities. The browser should have correctly encoded the text of link into UTF-8, as you mentioned, regardless of the page's character encoding. When I move the mouse over the link, the browser incorrectly shows the first character as 2 ISO-8859-1 characters (probably derived from the UTF presentation). Besides, the browser keeps the ISO-8859-1 encoding even when I click the link. ------------------------------------------------- Date: 2004-06-24 18:40 Sender: SF user vibber Firefox has nothing to do with it; the link is inherently wrong and would work under no circumstances.
*** This bug has been marked as a duplicate of 65 ***