Last modified: 2014-01-14 19:25:36 UTC
SSL endpoints log %-encoded URLs logged as \x-encoded URLs When requesting %-encoded URLs like https://ru.wikipedia.org/wiki/1092_%D0%B3%D0%BE%D0%B4 (note: “https”) we get a log line for http://ru.wikipedia.org/wiki/1092_%D0%B3%D0%BE%D0%B4 (%-encoded) from the cache, but the SSL endpoint additionally adds a log entry using the URL https://ru.wikipedia.org/wiki/1092_\xD0\xB3\xD0\xBE\xD0\xB4 (\x-encoded). The latter, \x-encoded URL cannot be fetched, and distorts logs. I'd prefer if we have no \x-encoded URLs in our logs. Should we: * try to fix the SSL endpoints to not log distorted URLs, or * stop having ssl endpoints in the udp2log log stream altogether (Currently, https requests get two entries in the log stream. One from the SSL endpoint, and one from the responding cache) ?
Actual request and log entries: * request: ___________________________________________________________ christian@spencer // 0 // 21:32:57 cwd: ~/tmp/encoding-test LC_ALL=C wget https://ru.wikipedia.org/wiki/1092_год --2013-12-22 21:34:14-- https://ru.wikipedia.org/wiki/1092_%D0%B3%D0%BE%D0%B4 Resolving ru.wikipedia.org... 91.198.174.192 Connecting to ru.wikipedia.org|91.198.174.192|:443... connected. HTTP request sent, awaiting response... 200 OK Length: unspecified [text/html] Saving to: `1092_\320\263\320\276\320\264.2' [ <=> ] 80,537 510K/s in 0.2s 2013-12-22 21:34:15 (510 KB/s) - `1092_\320\263\320\276\320\264' saved [80537] * Corresponding log entries from udp2log stream: amssq57.esams.wikimedia.org 4663480343 2013-12-22T20:34:15 0.596358538 $WIKIMEDIA_IP miss/200 80537 GET http://ru.wikipedia.org/wiki/1092_%D0%B3%D0%BE%D0%B4 - text/html; charset=UTF-8 - $MY_IP Wget/ (linux-gnu) - - ssl3002 454288692 2013-12-22T20:34:15.377 0.950 $MY_IP -/200 81682 GET https://ru.wikipedia.org/wiki/1092_\xD0\xB3\xD0\xBE\xD0\xB4 NONE/wikimedia - - - Wget/%20(linux-gnu) - -
The problem does not show on sampled-1000, mobile, zero stream but is visible on unsampled streams that do not filter to hosts. So for example the edit stream, and webstatscollector output (and hence stats.grok.se). Especially the exposure of this problem through webstatscollector, seems problematic, as people start to add redirects for the non-existing but seemingly requested \x encoded URLs. :-/ (See bug 58316)
Prioritization and scheduling of this bug is tracked on Mingle card https://wikimedia.mingle.thoughtworks.com/projects/analytics/cards/1351
Change 105449 had a related patch set uploaded by QChris: Log correctly encoded url with parameters for nginx https://gerrit.wikimedia.org/r/105449
Change 105449 merged by Ottomata: Log correctly encoded url with parameters for nginx https://gerrit.wikimedia.org/r/105449