Last modified: 2011-05-15 00:53:15 UTC
Title says it all. When refreshing a page with differing escaping patterns, it gets all messed up.
Example (from first to last):
Cat%C3%A9gorie:28_ao%C3%BBt_2007 = Catégorie:28_août_2007
Cat%C3%A9gorie:28_août_2007 -> Cat%C3%83%C2%A9gorie:28_ao%C3%BBt_2007 = CatÃ©gorie:28 août 2007
Cat%C3%83%C6%92%C3%82%C2%A9gorie:28_ao%C3%BBt_2007 (bunch of garbage, CatÃƒÂ©gorie:28 août 2007)
I don't know if this is caused by MediaWiki itself or maybe the browser. Or even yet, apached may cause this...
Well it appears it is a Firefox bug, as Iexplore does not escape page URLs at all, and Opera behaves properly when escaping URLs. Maybe MediaWiki is employing an incorrect method, thus forcing Firefox to display an inappropriate behaviour? Who knows! I'll investigate this, and I'll tag the bug as invalid, if needed.
Mediawiki expects the url to be in utf-8. When you provide data which is not utf-8 (in another encoding), it ''tries'' to convert it, usually assuming it's iso-8859-1 and converting it to utf-8. It produces the redirect from typing Catégorie:28_août_2007 to Cat%C3%A9gorie:28_ao%C3%BBt_2007.
However, if you add more iso-8859 non-ascii7 characters to the utf title, it's no longer a valid utf-8 title. Thus, it performs again the translation to utf-8 which is not what the user wanted, but is consistent (Catégorie in utf-8 = CatÃ©gorie in iso-8859-1).
Well, I guess it is officially invalid.
I found a fix for Firefox, our favourite browser (yes, even yours)
Code: network.standard-url.encode-utf8 = true
Change it in about:config.
You can also set network.standard-url.escape-utf8 to false, in order to make it behave like Iexplore. (no escaped chars at all (%C3%BB and the likes))
However, this way, you'll have a hard time typing international urls in languages containing many accented letters, like Swedish or French. :op
Have a nice day!
Yes, this depends on the browser, browser configuration, operating system, and OS language setting. :P
Generally on modern Mac and Linux systems you'll probably always see UTF-8 since that's usually the locale encoding default; on Windows, modern version of IE will usually send UTF-8 by default where Firefox will depend on the configuration.
Currently we do some autodetection on the incoming data: if it's valid UTF-8, we keep it intact. If it's not, we assume it's a language-specific alternate encoding and try converting it.
With Firefox in a setting where it uses the local encoding, or other similar weird combinations, you may end up with hybrid data like the above, which don't correctly match either way. Now, at least for titles which are fairly commonly put in on the command line it _might_ make sense to try a more adaptive conversion, which would accept valid UTF-8 sequences and perform conversion only on invalid sequences. Since valid UTF-8 sequences are verrrrry rarely found in legitimate non-UTF-8 text, it's unlikely to break purely locally-encoded URLs.
I'll go ahead and keep the bug open.
Bump -- test this with Firefox 3. Is it still using the locale encoding for URL submissions? I'd imagine it's more UTF-8 friendly now as they fairly aggressively decode UTF-8 chars for attractive display in the URL bar.
I'm pretty sure it's not a problem anymore. So closed by upstream.