Last modified: 2014-03-13 11:15:44 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T44318, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 42318 - Restore WikiStats features disabled for mere performance reasons
Restore WikiStats features disabled for mere performance reasons
Status: NEW
Product: Analytics
Classification: Unclassified
Wikistats (Other open bugs)
unspecified
All All
: Normal major
: ---
Assigned To: Nobody - You can work on this!
http://stats.wikimedia.org/EN/TablesW...
: analytics
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2012-11-21 09:24 UTC by Nemo
Modified: 2014-03-13 11:15 UTC (History)
6 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Nemo 2012-11-21 09:24:02 UTC
In the example URL, "No detailed statistics for anonymous users are available for this wiki (performance reasons)". This and several other reports are perfectly ok and useful, but disabled only because the machine they've been run on for several years wasn't powerful enough.

I'm sure the WMF can now easily provide Erik with a spare server no longer used elsewhere with a faster CPU, or perform whatever micro-improvements are needed to give us back these crucial stats, even before Kraken and all the other analytics beasts are freed.
Comment 1 Erik Zachte 2012-11-22 14:47:26 UTC
The main bottleneck was memory, as the list of anon ip's was huge, and perl hashes are pretty memory intensive. Even when stat1 has much more memory than bayes the list of ip's of course has grown over time as well. 

So what I need to do is collect anon edits in a flat file and sort/aggregate after dump parsing is complete. Then this could stay integral part of current job. 

I added a task in Asana: https://app.asana.com/0/711699995949/2544338304587, but low priority
Comment 2 Andre Klapper 2012-12-03 13:59:45 UTC
[mass-moving wikistats reports from Wikimedia→Statistics to Analytics→Wikistats to have stats issues under one Bugzilla product (see bug 42088) - sorry for the bugspam!]
Comment 3 SJ 2012-12-14 22:52:42 UTC
This would be awesome.  I was looking for one of the anon stats last month.
Comment 4 Nemo 2013-03-09 09:31:22 UTC
Today's post by Erik made me miss them a lot. :( 
This bug is seriously impairing our ability to understand what's happening on our projects and what new pieces of research mean.
<http://infodisiac.com/blog/2013/03/monthly-edits-on-wikimedia-wikis-still-on-the-rise/comment-page-1/#comment-2724>
Comment 5 Erik Zachte 2013-03-09 16:56:27 UTC
Revert stats can be deduced from stub dumps as these now contains checksums. It just hasn't happened yet.

In a wider perspective:

I'm hoping the stub dumps can be extended with the few meta data missing that would fill in the blanks. Not exactly trivial, but we could forget about full dumps for wikistats. 

Things we miss:

Does the article contain an internal link? (disregarding links in templates for pragmatic reasons) Now we have different article counts in wikistats when processing stub or full archive dump. 

Word count (wikistats first strips headers, html and the like, and tries to be (too) smart about (some) non western languages (using a conversion factor to deduce word counts from glyph count). 

External links, image counts (I guess we could skip both, less requested than metrics above)
Comment 6 Nemo 2013-03-09 20:59:20 UTC
Thanks for the comment, Erik. I understand that this is hard, but stat1 has something like 8 times the CPUs and 10 times the RAM bayes had, and mostly idle. While we wait for the a permanent solution, having the stats updated every 2 or 3 years would still be very nice and a great improvement.
Comment 7 Erik Zachte 2013-03-10 10:55:10 UTC
Nemo, again it comes down to backlog in coding. I can't run the full dump and partial dump concurrently. They will overwrite each others' files. For largest dumps one month is not even enough to run full dump. I'll make a list of open items for dump scripts soon, so we can prioritize.
Comment 8 Nemo 2013-03-10 11:08:01 UTC
I see. Prioritizing is good: I'm trying to suggest things that don't add to the backlog; I know that just asking MOAR is stupid.
If making them progress at the same time is not possible with more coding, and another server is not available, then I say that delaying the normal updates for a month or two is an acceptable cost to pay in order to fill the last 2/3 years of blanks for the full stats.
Comment 9 Erik Zachte 2013-03-16 12:13:14 UTC
Mingle feature request: https://mingle.corp.wikimedia.org/projects/analytics/cards/276
Comment 11 Nemo 2014-02-19 07:34:54 UTC
I'm getting sick of this bug... Erik, if I run full counts on full dumps on my own, would I then have CSVs that you can use to fill the blanks on the main wikistats, at least for things like character count etc. (the ten empty columns in the main "Monthly counts & Quarterly rankings" tables)?
I don't think I'll be able to do it soon, but if there is a prospective concrete usage I may.
Comment 12 Nemo 2014-03-13 09:32:31 UTC
Trying to understand how this is set in the code (as part of bug 62566),,,
It's currently a bit confusing: https://gerrit.wikimedia.org/r/#/c/118436/
If I understand correctly, almost all of it is controlled by the -e / edits_only flag, with some interaction with -u / reverts_only and some bits which have no configuration flag (yet) but are simply commented in the code.
Comment 13 Erik Zachte 2014-03-13 11:14:59 UTC
This bug is also addressed in bug 60826

More comments in bug 62566

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links