Last modified: 2014-07-21 15:48:29 UTC
Created attachment 14056 [details] requests logged on 2012-06-09 for hour 19:00 Instead of HTML percent encodings, pages are sometimes requested through Javascript-encoded URLs. The difference is that "\x", rather than the "%" symbol, is used to indicate the start of an escape sequence. These requests are not decoded by the Mediawiki software. For example, a request for https://en.wikipedia.org/w/index.php?title=Robinson_Can%C3%B3 is correctly decoded (the "%C3%B3" is transformed to an accented "o"), whereas a request for https://en.wikipedia.org/w/index.php?title=Robinson_Can\xC3\xB3 is not decoded and we're told the page doesn't exist. As I noted at https://en.wikipedia.org/wiki/Wikipedia:Redirects_for_discussion/Log/2013_December_9#.5Cx22Weird_Al.5Cx22_Yankovic there's been a tremendous increase in the amount of this traffic reaching the WMF projects, from about one request per hour in September 2011 to millions of requests per day in November 2013. Perhaps it would be desirable to transform "\x" to "%" before passing URLs to rawurldecode() so that these requests will reach the intended pages.
Are you sure the requests are not being handled ? Isn't it just that the log is written differently for those requests ?
I mean I see people are reasoning that https://en.wikipedia.org/w/index.php?title=Robinson_Can\xC3\xB3 should be reachable trough their browser. But that is not correct I think. It is the technical representation of the input https://en.wikipedia.org/w/index.php?title=Robinson_Canó (a unicode url that is NOT percent encoded) This technical representation is however not a valid input method in browser URL fields if I remember correctly. I suspect people are making assumptions based on an incorrect interpretation of the logs.
In summary: * Entries in the log of apache that look like: Robinson_Can\xC3\xB3 which is a UTF-8 encoded (Likely a representation of the not percent encoded request containing Robinson_Canó, [possibly even an IRI request?]) * Log entries are NOT canonical on this front. A request for Robinson_Canó is logged differently then a request for Robinson_Can%C3%B3. * The statistics of stats.grok.se might not handle these properly (collating them, ignoring them, or just not accessible ?) * Someone else made a tool to detect red links, that does make the \x entries accessible/visible. * Someone is making mass redirects of \x entries to what they consider to be 'proper' entries. This seems to cause effect in the statistics, but I would say that if the statistics/tools are broken, you are only influencing the statistics most likely, not per se actually fixing something * There seems to have been a large increase of these kinds of requests (newer browsers or google/bing.com changing their defaults can easily account for this). * You cannot input a utf-8 sequence in the url field of a browser (because there is no need for this, you would just input ó). * People can't figure out who is wrong and who is right. Does that sum it up a bit ?
(In reply to comment #0) > Created attachment 14056 [details] > requests logged on 2012-06-09 for hour 19:00 If you logged something 18 months ago, why do you file a bug report now?
Created attachment 14061 [details] logged requests for titles containing "Robinson_Can" (case-insensitive), from 18 November 2013 and the first hour of 19 November 2013 (from "zcat pagecounts-2013111*z | grep -i Robinson_Can")
The first attachment is an extract from http://dumps.wikimedia.org/other/pagecounts-raw/2012/2012-06/pagecounts-20120609-190000.gz , a log provided by the WMF of incoming requests for that hour. I've uploaded another attachment, which shows how requests for Robinson_Can\xC3\xB3, Robinson_Can%C3%B3 and Robinson_Canó appear as separate entries in the logs.
Someone has put a redirect at my Robinson_Can\xC3\xB3 example page, but this bug can be confirmed by noting the "redirected from" or by comparing the responses to these two URLs: https://commons.wikipedia.org/w/index.php?title=File:\x22Holy_Sheykh_Cotton\x22_\x281890\x29_-_TIMEA.jpg https://commons.wikipedia.org/w/index.php?title=File:%22Holy_Sheykh_Cotton%22_%281890%29_-_TIMEA.jpg The first brings up an error page, whereas the second gets decoded and brings up a content page.
There's a magical \x syntax hidden in Parser.php, I believe, for Esperanto and such. It was a workaround (a hack) for browsers that used to handle Unicode poorly, as I recall. I'm reminded of it in this bug report. I'm not sure this is a valid bug.
Change 103241 had a related patch set uploaded by QChris: Add test to guard against encoding mangling of filter https://gerrit.wikimedia.org/r/103241
(In reply to comment #8) > There's a magical \x syntax hidden in Parser.php, I believe, for Esperanto That's true, but not related here.
(In reply to comment #0) > Instead of HTML percent encodings, pages are sometimes requested through > Javascript-encoded URLs. There are indeed some requests to \x-encoded URLs. But they are mostly confused bots/clients. They are far from being page views, and they are really few. For example in October 2013 we had 20 such request in total in the sampled-1000 logs. However, you are correct that we see a lot of \x encoded URLs in webstatscollector output. Webstatscollector processes udp2log data unaltered (see comment #9). It seems \x-encoded URLs all stem from SSL endpoints, and it looks as if those SSL endpoints would throw misencoded URL requests into udp2log stream. Since that is a sufficiently different issue, I filed bug 58876 about it. A solution of bug 58876 will not address the current call for MediaWiki to decode \x-encoded URLs. But it will make \x-encoded URLs disappear from the webstatscollector output (thereby also dissappear from stats.grok.se, and other consumers).
(In reply to comment #10) > (In reply to comment #8) >> There's a magical \x syntax hidden in Parser.php, I believe, for Esperanto > > That's true, but not related here. True, I wasn't really replying to anyone in particular. I was just reminded of it here. :-) This particular bug falls into the category of "should we try to catch various URL munging?" I think. For example, we probably get _a lot_ of requests that inappropriately omit a trailing ) or inappropriately include a trailing > or ,. Should we try to auto-correct those requests as well? Dunno.
(In reply to comment #11) > A solution of bug 58876 will not address the current call for > MediaWiki to decode \x-encoded URLs. But it will make \x-encoded URLs > disappear from the webstatscollector output (thereby also dissappear > from stats.grok.se, and other consumers). The fix for bug 58876 just went live, so \x encoded Urls should soon mostly dissappear.
Change 103241 merged by Ottomata: Add test to guard against encoding mangling of filter https://gerrit.wikimedia.org/r/103241