Last modified: 2011-04-18 21:18:32 UTC
Created attachment 6637 [details] pcap file showing the problem I have noticed that if i request the URL http://de.wikinews.org/wiki/Nobelpreis_für_Physik_für_„die_Meister_des_Lichts“ using the text mode browser links, i get a old (outdated) revision of the page. I have tracked this issue and found that it is caused by links sending the special characters in the URL unencoded, directly as 8bit utf-8, and not in %xy encoding. If i change the URL to use %xy encoding (http://de.wikinews.org/wiki/Nobelpreis_f%C3%BCr_Physik_f%C3%BCr_%E2%80%9Edie_Meister_des_Lichts%E2%80%9C). However as it seems, mediawiki actually can handle requests with utf-8 in the url, but for some strange reason it returns a old page revision when requesting that way. I will attach a pcap-trace which shows first a request using links and the a request using lynx (lynx does the %xy encoding). You will notice the different page revisions returned.
Looks like a problem with the squids cache not being purged for that encoding.
Mark, do we know whether Squid normalizes percent-encoded chars vs raw chars in URLs when determining canonical URLs for caching? MediaWiki redirects you to the canonical URL for not-quite-canonical page view URLs in order to ensure consistent caching, but I have the vaguest recollection that our detection is post-percent decoding so we're not necessarily doing that right already. If Squid would be caching them separately, then we might need to fix that up in MediaWiki to be more aggressive about the redirecting.
Hmm, i've just noticed that bugzilla seems to have a bug as well as can be seen in my bugreport. The url does not get linked correctly, the trailing “ is missing...
I've just had a look at the code, and it seems that Squid does not do canonizing of URLs w.r.t. percent-decoding. There is a function url_decode_hex() in url.c which supports this, but it's only used for Gopher (yay ;). I strongly suspect that it's caching them seperately, so indeed MediaWiki may need to be adapted for that.
I'm not sure how redirecting could fix this, unless you want to redirect all URLs without percent-encoding to URLs with percent-encoding, which seems ugly? Technically both URLs are exactly the same, so in my opinion this shall be fixed in squid.