Last modified: 2014-10-20 20:20:22 UTC
One of the longstanding issues with Webstatscollector is that it counts redirects at the HTTP level. So for example: - Requesting a page with a lower case first letter [1], - Requesting a page from the desktop site on a mobile device [2], or - Requesting to www.wikipedia.org (first part is www, not a language) [3] causes two requests to the caches, and webstatscollector counts both, although actually only a single page is shown to the user. Thereby too high numbers get reported. Since we're about the deploy a new webstatscollector anyways, and this double counting should not be too hard to fix, let's get it fixed too. (Note that redirects above the HTTP level are not affected. So for example http://en.wikipedia.org/wiki/Michael_J_Fox (no dot after the J) is, was and will be one request, although it shows the content of http://en.wikipedia.org/wiki/Michael_J._Fox (dot after the J). Such redirects at Wiki level are not affected.) [1] _________________________________________________________________ christian@spencer // jobs: 0 // time: 13:13:36 // exit code: 0 cwd: ~ wget -O /dev/null 'http://en.wikipedia.org/wiki/main_page' --2014-10-08 13:13:39-- http://en.wikipedia.org/wiki/main_page Resolving en.wikipedia.org... 91.198.174.192 Connecting to en.wikipedia.org|91.198.174.192|:80... connected. HTTP request sent, awaiting response... 301 Moved Permanently Location: http://en.wikipedia.org/wiki/Main_page [following] --2014-10-08 13:13:39-- http://en.wikipedia.org/wiki/Main_page Reusing existing connection to en.wikipedia.org:80. HTTP request sent, awaiting response... 200 OK Length: unspecified [text/html] Saving to: `/dev/null' [ <=> ] 67,779 --.-K/s in 0.1s 2014-10-08 13:13:39 (472 KB/s) - `/dev/null' saved [67779] [2] _________________________________________________________________ christian@spencer // jobs: 0 // time: 13:13:39 // exit code: 0 cwd: ~ wget -O /dev/null --user-agent 'iPhone' 'http://en.wikipedia.org/wiki/Main_Page' --2014-10-08 13:13:44-- http://en.wikipedia.org/wiki/Main_Page Resolving en.wikipedia.org... 91.198.174.192 Connecting to en.wikipedia.org|91.198.174.192|:80... connected. HTTP request sent, awaiting response... 302 Found Location: http://en.m.wikipedia.org/wiki/Main_Page [following] --2014-10-08 13:13:44-- http://en.m.wikipedia.org/wiki/Main_Page Resolving en.m.wikipedia.org... 91.198.174.204 Connecting to en.m.wikipedia.org|91.198.174.204|:80... connected. HTTP request sent, awaiting response... 200 OK Length: unspecified [text/html] Saving to: `/dev/null' [ <=> ] 22,002 --.-K/s in 0.05s 2014-10-08 13:13:44 (416 KB/s) - `/dev/null' saved [22002] [3] _________________________________________________________________ christian@spencer // jobs: 0 // time: 13:13:44 // exit code: 0 cwd: ~ wget -O /dev/null 'http://www.wikipedia.org/wiki/Main_Page' --2014-10-08 13:13:49-- http://www.wikipedia.org/wiki/Main_Page Resolving www.wikipedia.org... 91.198.174.192 Connecting to www.wikipedia.org|91.198.174.192|:80... connected. HTTP request sent, awaiting response... 301 Moved Permanently Location: http://en.wikipedia.org/wiki/Main_Page [following] --2014-10-08 13:13:49-- http://en.wikipedia.org/wiki/Main_Page Resolving en.wikipedia.org... 91.198.174.192 Reusing existing connection to www.wikipedia.org:80. HTTP request sent, awaiting response... 200 OK Length: unspecified [text/html] Saving to: `/dev/null' [ <=> ] 67,565 --.-K/s in 0.1s 2014-10-08 13:13:49 (471 KB/s) - `/dev/null' saved [67565]
(In reply to christian from comment #0) > Since we're about the deploy a new webstatscollector anyways, and this > double counting should not be too hard to fix, let's get it fixed too. +1. https://meta.wikimedia.org/w/index.php?title=Research_talk:Page_view&oldid=10069001#Special_namespace_and_actual_problems (I'll miss stats for Special:MyLanguage, but that was a dirty trick). Are we talking of 301 and 302 or something more?
(In reply to Nemo from comment #1) > I'll miss > stats for Special:MyLanguage, [...] Yup. I'll miss stats for Special:Random :-( > Are we talking of 301 and 302 or something more? 301, 302, and 303. 303 basically only affects bots on wikidata. But there, some requests [1] see two 303s, before content gets sent. [1] _________________________________________________________________ christian@spencer // jobs: 0 // time: 16:34:01 // exit code: 0 cwd: ~ wget -O /dev/null --header='Accept: text/html' 'https://www.wikidata.org/entity/Q507970' --2014-10-08 16:34:02-- https://www.wikidata.org/entity/Q507970 Resolving www.wikidata.org... 91.198.174.192 Connecting to www.wikidata.org|91.198.174.192|:443... connected. HTTP request sent, awaiting response... 303 See Other Location: https://www.wikidata.org/wiki/Special:EntityData/Q507970 [following] --2014-10-08 16:34:03-- https://www.wikidata.org/wiki/Special:EntityData/Q507970 Reusing existing connection to www.wikidata.org:443. HTTP request sent, awaiting response... 303 See Other Location: https://www.wikidata.org/wiki/Q507970 [following] --2014-10-08 16:34:03-- https://www.wikidata.org/wiki/Q507970 Reusing existing connection to www.wikidata.org:443. HTTP request sent, awaiting response... 200 OK Length: unspecified [text/html] Saving to: `/dev/null' [ <=> ] 81,443 --.-K/s in 0.1s 2014-10-08 16:34:04 (593 KB/s) - `/dev/null' saved [81443]
I'm sure we can count special page requests separately if we want them...
Oh. Counting of Special pages won't change per se. It only those Special pages that happen to come with 301, 302, or 303 HTTP status codes. So for example Special:Search, or Special:Export come with HTTP status code 200. They'll still be counted as usual.
Change 165351 had a related patch set uploaded by QChris: Release fix that stops counting [uU]ndefined and redirects https://gerrit.wikimedia.org/r/165351
Change 165631 had a related patch set uploaded by QChris: Stop counting 301, 302, 303 HTTP status codes https://gerrit.wikimedia.org/r/165631
Change 165725 had a related patch set uploaded by QChris: Stop counting 301, 302, 303 HTTP status codes https://gerrit.wikimedia.org/r/165725
Change 165748 had a related patch set uploaded by QChris: [webstatscollector] Add condition to not count redirects https://gerrit.wikimedia.org/r/165748
Change 165631 merged by jenkins-bot: Stop counting 301, 302, 303 HTTP status codes https://gerrit.wikimedia.org/r/165631
Change 165351 merged by Ottomata: Release fix that stops counting [uU]ndefined and redirects https://gerrit.wikimedia.org/r/165351
Change 165725 merged by QChris: Stop counting 301, 302, 303 HTTP status codes https://gerrit.wikimedia.org/r/165725
Fix has been deployed on 2014-10-15 ~19:01 and is effective. The last affected files are http://dumps.wikimedia.org/other/pagecounts-raw/2014/2014-10/pagecounts-20141015-200000.gz [1] http://dumps.wikimedia.org/other/pagecounts-raw/2014/2014-10/projectcounts-20141015-200000 [1] http://dumps.wikimedia.org/other/pagecounts-all-sites/2014/2014-10/pagecounts-20141015-190000.gz http://dumps.wikimedia.org/other/pagecounts-all-sites/2014/2014-10/projectcounts-20141015-190000 The first files without the [uU]ndefined counts are http://dumps.wikimedia.org/other/pagecounts-raw/2014/2014-10/pagecounts-20141015-210000.gz [1] http://dumps.wikimedia.org/other/pagecounts-raw/2014/2014-10/projectcounts-20141015-210000 [1] http://dumps.wikimedia.org/other/pagecounts-all-sites/2014/2014-10/pagecounts-20141015-200000.gz http://dumps.wikimedia.org/other/pagecounts-all-sites/2014/2014-10/projectcounts-20141015-200000 [1] When restarting collector and filter for the C implementation of webstatscollector, where was a period (<2 minutes) where the new collector and the old filter have been running. Hence, during this perioud a few redirects made it the 20:00:00 file.
Are retroactive adjustments of stats.wikimedia.org pageview stats expected?
Change 165748 merged by Ottomata: [webstatscollector] Add condition to not count redirects https://gerrit.wikimedia.org/r/165748