Last modified: 2014-03-13 11:15:44 UTC
In the example URL, "No detailed statistics for anonymous users are available for this wiki (performance reasons)". This and several other reports are perfectly ok and useful, but disabled only because the machine they've been run on for several years wasn't powerful enough. I'm sure the WMF can now easily provide Erik with a spare server no longer used elsewhere with a faster CPU, or perform whatever micro-improvements are needed to give us back these crucial stats, even before Kraken and all the other analytics beasts are freed.
The main bottleneck was memory, as the list of anon ip's was huge, and perl hashes are pretty memory intensive. Even when stat1 has much more memory than bayes the list of ip's of course has grown over time as well. So what I need to do is collect anon edits in a flat file and sort/aggregate after dump parsing is complete. Then this could stay integral part of current job. I added a task in Asana: https://app.asana.com/0/711699995949/2544338304587, but low priority
[mass-moving wikistats reports from Wikimedia→Statistics to Analytics→Wikistats to have stats issues under one Bugzilla product (see bug 42088) - sorry for the bugspam!]
This would be awesome. I was looking for one of the anon stats last month.
Today's post by Erik made me miss them a lot. :( This bug is seriously impairing our ability to understand what's happening on our projects and what new pieces of research mean. <http://infodisiac.com/blog/2013/03/monthly-edits-on-wikimedia-wikis-still-on-the-rise/comment-page-1/#comment-2724>
Revert stats can be deduced from stub dumps as these now contains checksums. It just hasn't happened yet. In a wider perspective: I'm hoping the stub dumps can be extended with the few meta data missing that would fill in the blanks. Not exactly trivial, but we could forget about full dumps for wikistats. Things we miss: Does the article contain an internal link? (disregarding links in templates for pragmatic reasons) Now we have different article counts in wikistats when processing stub or full archive dump. Word count (wikistats first strips headers, html and the like, and tries to be (too) smart about (some) non western languages (using a conversion factor to deduce word counts from glyph count). External links, image counts (I guess we could skip both, less requested than metrics above)
Thanks for the comment, Erik. I understand that this is hard, but stat1 has something like 8 times the CPUs and 10 times the RAM bayes had, and mostly idle. While we wait for the a permanent solution, having the stats updated every 2 or 3 years would still be very nice and a great improvement.
Nemo, again it comes down to backlog in coding. I can't run the full dump and partial dump concurrently. They will overwrite each others' files. For largest dumps one month is not even enough to run full dump. I'll make a list of open items for dump scripts soon, so we can prioritize.
I see. Prioritizing is good: I'm trying to suggest things that don't add to the backlog; I know that just asking MOAR is stupid. If making them progress at the same time is not possible with more coding, and another server is not available, then I say that delaying the normal updates for a month or two is an acceptable cost to pay in order to fill the last 2/3 years of blanks for the full stats.
Mingle feature request: https://mingle.corp.wikimedia.org/projects/analytics/cards/276
Also https://mingle.corp.wikimedia.org/projects/analytics/cards/349
I'm getting sick of this bug... Erik, if I run full counts on full dumps on my own, would I then have CSVs that you can use to fill the blanks on the main wikistats, at least for things like character count etc. (the ten empty columns in the main "Monthly counts & Quarterly rankings" tables)? I don't think I'll be able to do it soon, but if there is a prospective concrete usage I may.
Trying to understand how this is set in the code (as part of bug 62566),,, It's currently a bit confusing: https://gerrit.wikimedia.org/r/#/c/118436/ If I understand correctly, almost all of it is controlled by the -e / edits_only flag, with some interaction with -u / reverts_only and some bits which have no configuration flag (yet) but are simply commented in the code.
This bug is also addressed in bug 60826 More comments in bug 62566