Last modified: 2011-05-15 00:53:15 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T13097, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 11097 - Page URLs are messed up when using accented letters (non-ascii) mixed with escaped characters.
Page URLs are messed up when using accented letters (non-ascii) mixed with es...
Status: RESOLVED FIXED
Product: MediaWiki
Classification: Unclassified
General/Unknown (Other open bugs)
unspecified
All All
: Low minor (vote)
: ---
Assigned To: Nobody - You can work on this!
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2007-08-28 15:50 UTC by Sébastien Leblanc
Modified: 2011-05-15 00:53 UTC (History)
2 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Sébastien Leblanc 2007-08-28 15:50:17 UTC
Title says it all. When refreshing a page with differing escaping patterns, it gets all messed up.

Example (from first to last):

Cat%C3%A9gorie:28_ao%C3%BBt_2007 = Catégorie:28_août_2007

Cat%C3%A9gorie:28_août_2007 -> Cat%C3%83%C2%A9gorie:28_ao%C3%BBt_2007 = Catégorie:28 août 2007

Cat%C3%83%C6%92%C3%82%C2%A9gorie:28_ao%C3%BBt_2007 (bunch of garbage, Catégorie:28 août 2007)

I don't know if this is caused by MediaWiki itself or maybe the browser. Or even yet, apached may cause this...
Comment 1 Sébastien Leblanc 2007-08-28 15:56:47 UTC
Well it appears it is a Firefox bug, as Iexplore does not escape page URLs at all, and Opera behaves properly when escaping URLs. Maybe MediaWiki is employing an incorrect method, thus forcing Firefox to display an inappropriate behaviour? Who knows! I'll investigate this, and I'll tag the bug as invalid, if needed.
Comment 2 Platonides 2007-08-28 16:39:27 UTC
Mediawiki expects the url to be in utf-8. When you provide data which is not utf-8 (in another encoding), it ''tries'' to convert it, usually assuming it's iso-8859-1 and converting it to utf-8. It produces the redirect from typing Catégorie:28_août_2007 to Cat%C3%A9gorie:28_ao%C3%BBt_2007.

However, if you add more iso-8859 non-ascii7 characters to the utf title, it's no longer a valid utf-8 title. Thus, it performs again the translation to utf-8 which is not what the user wanted, but is consistent (Catégorie in utf-8 = Catégorie in iso-8859-1).
Comment 3 Sébastien Leblanc 2007-08-28 17:45:54 UTC
Well, I guess it is officially invalid.

I found a fix for Firefox, our favourite browser (yes, even yours)

 Code: network.standard-url.encode-utf8 = true

Change it in about:config.

You can also set network.standard-url.escape-utf8 to false, in order to make it behave like Iexplore. (no escaped chars at all (%C3%BB and the likes))

However, this way, you'll have a hard time typing international urls in languages containing many accented letters, like Swedish or French. :op

Have a nice day!
Comment 4 Brion Vibber 2007-08-28 18:50:35 UTC
Yes, this depends on the browser, browser configuration, operating system, and OS language setting. :P

Generally on modern Mac and Linux systems you'll probably always see UTF-8 since that's usually the locale encoding default; on Windows, modern version of IE will usually send UTF-8 by default where Firefox will depend on the configuration.

Currently we do some autodetection on the incoming data: if it's valid UTF-8, we keep it intact. If it's not, we assume it's a language-specific alternate encoding and try converting it.

With Firefox in a setting where it uses the local encoding, or other similar weird combinations, you may end up with hybrid data like the above, which don't correctly match either way. Now, at least for titles which are fairly commonly put in on the command line it _might_ make sense to try a more adaptive conversion, which would accept valid UTF-8 sequences and perform conversion only on invalid sequences. Since valid UTF-8 sequences are verrrrry rarely found in legitimate non-UTF-8 text, it's unlikely to break purely locally-encoded URLs.

I'll go ahead and keep the bug open.
Comment 5 Brion Vibber 2008-07-30 23:35:59 UTC
Bump -- test this with Firefox 3. Is it still using the locale encoding for URL submissions? I'd imagine it's more UTF-8 friendly now as they fairly aggressively decode UTF-8 chars for attractive display in the URL bar.
Comment 6 Platonides 2011-05-15 00:53:15 UTC
I'm pretty sure it's not a problem anymore. So closed by upstream.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links