Last modified: 2014-06-11 16:08:11 UTC
The current implementation of the blocks metric is slightly inconsistent. In raw requests, the metric returns an array, but the default aggregator (proportion) operates on the expectation of a boolean. I recommend we split blocks into two separate metrics: 1) is_blocked (limited to indefinite blocks, always returning a boolean or an undefined value, and implemented with the same parameters as the threshold metric, with a default t=24) 2) blocks (returning a block count and the associated metadata) is_blocked should retain proportion as a default aggregator Once this is done, we should also reconsider the best output format for blocks as it currently combines in the same array different types (int, timestamp) and I am not sure this is the most useful response. The new blocks metric should return a concise summary of an account's overall blocks history (including both temporary and indefinite blocks), the most appropriate aggregator will need to be defined accordingly. See also BZ ticket #48341 for an issue affecting blocks metric aggregation.
(In reply to comment #0) > The current implementation of the blocks metric is slightly inconsistent. In > raw requests, the metric returns an array, but the default aggregator > (proportion) operates on the expectation of a boolean. > > I recommend we split blocks into two separate metrics: > > 1) is_blocked (limited to indefinite blocks, always returning a boolean or an > undefined value, and implemented with the same parameters as the threshold > metric, with a default t=24) > > 2) blocks (returning a block count and the associated metadata) > > is_blocked should retain proportion as a default aggregator > > Once this is done, we should also reconsider the best output format for > blocks > as it currently combines in the same array different types (int, timestamp) > and > I am not sure this is the most useful response. The new blocks metric should > return a concise summary of an account's overall blocks history (including > both > temporary and indefinite blocks), the most appropriate aggregator will need > to > be defined accordingly. > > See also BZ ticket #48341 for an issue affecting blocks metric aggregation. This makes sense to me. Looking at indefinite blocks versus current blocks is valid because the purpose of the blocks metric is to tell us how many users in a given cohort were rejected by Wikipedia.
On the same note, I'm working on a series of regular expressions that can efficiently categorise blocks. At the moment it's largely accurate for indef blocks from the ipblocks table, and is divided into four categories: -vandalism/other bad-faith actions -Username problems -Spam -Sockpuppetry -Things not covered by the other categories ("misc"). I'm going to spend some cycles at the hackathon refining them a bit further and running them against the block log to make sure they're compatible; I think the goal after that is to, at some point, work them into UserMetrics and provide a way of accurately bucketing blocked users, providing some slightly more granular data.
(In reply to comment #2) > On the same note, I'm working on a series of regular expressions that can > efficiently categorise blocks. At the moment it's largely accurate for indef > blocks from the ipblocks table, and is divided into four categories: > > -vandalism/other bad-faith actions > -Username problems > -Spam > -Sockpuppetry > -Things not covered by the other categories ("misc"). > > I'm going to spend some cycles at the hackathon refining them a bit further > and > running them against the block log to make sure they're compatible; I think > the > goal after that is to, at some point, work them into UserMetrics and provide > a > way of accurately bucketing blocked users, providing some slightly more > granular data. Adding type/reason would be a wonderful future enhancement. I know how difficult it must be to accurately parse the block log, but knowing the different types is of great use.
Actually it's pretty simple, he says after ~20 hours of work on the ipblocks table. logging WHERE log_type = 'block' will be more fun.
[moving tickets as per bug 65903]
This bug has been made invalid through the transition to Wikimetrics. Since User Metrics is no longer actively maintained, I will mark these old bugs as Invalid. Duly noted that the discussion here is interesting and should inform future work in this area.