Last modified: 2013-06-05 14:04:55 UTC
Normally only new pages are available for patrolling or autopatrol, but I see many entries in the patrol log where bots create a new claim and this is marked as autopatrolled (since bots hav the autopatrol flag). An example: 10:39, 19 April 2013 VIAFbot (talk | contribs | block) automatically marked revision 27230780 of page Colin McWilliam (Q5145398) patrolled You can see in the history of the article, http://www.wikidata.org/w/index.php?title=Q5145398&action=history that this edit added a claim but certainly did not create the page. Here's the claim creation itself: http://www.wikidata.org/w/index.php?title=Q5145398&oldid=27230780&diff=prev The side effect of this is that the logging table is filled with these things. It's already up to almost 27 million log entries, the vast majority of them bots marking themselves as autopatrolled. In comparison, en wp has around 48.5 million log entries, and it's been running a whole lot longer with a much larger editor base. If there is some compelling reason for having this patrol setup, then it should at least be documented in giant letters someplace obvious.
For reference, http://bugzilla.wikimedia.org/41907 is the request for enabling RC patrol on wikidata
RC patrol is useful for anyone trying to fight vandalism since it immediately removes trusted or already checked edits. I think it would be more useful to remove useless log entries like "automatically marked revision 27230780 of page Colin McWilliam (Q5145398) patrolled" which I doubt anyone checks or cares about.
Echoing Lego's comment, if there's a way to turn off the log function for *automatic* patrolling, but not manual, that'd be great. In fact, even without any server-related concerns that'd be great, since a log entry accompanying every edit essentially makes the patrol logs impossible to navigate. (There are a handful of circumstances where it's important to know who patrolled a page, e.g. if an obviously vandalistic page has been marked as patrolled and you want to know who did that, so you can explain things to them or pull their rights if necessary.)
Related URL: https://gerrit.wikimedia.org/r/62785 (Gerrit Change Ic999454d001c38dea08746d1e8184f0163cb7330)
I'm not sure what the point is of disabling auto patrol logging entirely. That means patrolling tools will be unable to discover the log entry for a patrolled edit. If "claim" should not be subject to auto-patrolling or patrolling, then that should be disabled instead. Solving the "claim" auto patrol log problem, by disabling the logging for it entirely seems an odd way to solve the problem.
OK, I take your point. OTOH is there any real need for patrolling tools to discover lots and lots of log entries for autopatrolled bot entries? Maybe these can be excluded from the log and the rest left in.
(In reply to comment #5) > I'm not sure what the point is of disabling auto patrol logging entirely. > That > means patrolling tools will be unable to discover the log entry for a > patrolled > edit. I'm sorry, but I don't quite follow. When do you ever need to see the log entries for *auto*patrolled edits? Don't they just duplicate the page history?
[Speaking with my "Product Manager for Admin Tools" hat on.] (In reply to comment #5) > I'm not sure what the point is of disabling auto patrol logging entirely. The point is to save Wikidata from falling over because the DB can't scale. (Note, BTW, that the proposal is only to disable autopatrol logging for Wikidata, not other wikis; you can see the default setting for MW itself in the commit.) > That means patrolling tools will be unable to discover the log entry for > a patrolled edit. Indeed. We have lost a lot of MW core functionality over the years because of our inability to design a system that can scale arbitrarily; this is not the first, and sadly won't be the last. > If "claim" should not be subject to auto-patrolling or patrolling, then that > should be disabled instead. The fault is not with Wikibase (which uses the entirely-reasonable concept of letting wiki users edit things in the same way as on core MW), but with MW core's design not being thought-through in terms of scalability. We already know that the revisions table's growth is a problem; patrolling logs cause a second table to also be a problem. > Solving the "claim" auto patrol log problem, by disabling the logging for it > entirely seems an odd way to solve the problem. I appreciate that this is disruptive for users of the patrolling logs, most notably the CVU tools, but this is a change made for site stability, and we must accept it.
(In reply to comment #8) > [Speaking with my "Product Manager for Admin Tools" hat on.] > > (In reply to comment #5) > > Solving the "claim" auto patrol log problem, by disabling the logging for it > > entirely seems an odd way to solve the problem. > > I appreciate that this is disruptive for users of the patrolling logs, most > notably the CVU tools, but this is a change made for site stability, and we > must accept it. Can we set up a "rolling" log instead? Like have the log entries vanish after 30 days (recentchanges table length). After that point you can't tell whether the edit was patrolled or not, so it would be pointless to know who patrolled it. This has the advantage of not breaking anything (hopefully), and being able to provide the necessary features that patrolling does, while still reducing the log table.
(In reply to comment #9) > Can we set up a "rolling" log instead? Like have the log entries vanish after > 30 days (recentchanges table length). After that point you can't tell whether > the edit was patrolled or not, so it would be pointless to know who patrolled > it. > > This has the advantage of not breaking anything (hopefully), and being able > to > provide the necessary features that patrolling does, while still reducing the > log table. I'm hesitant to set up an entire separate logging structure just for patrolling. At that point we might as well just make patrolling a recentchange itself and og it to the recentchanges table.
(In reply to comment #8) > [Speaking with my "Product Manager for Admin Tools" hat on.] > > (In reply to comment #5) > > I'm not sure what the point is of disabling auto patrol logging entirely. > > The point is to save Wikidata from falling over because the DB can't scale. > (Note, BTW, that the proposal is only to disable autopatrol logging for > Wikidata, not other wikis; you can see the default setting for MW itself in > the commit.) > > The fault is not with Wikibase (which uses the entirely-reasonable concept of > letting wiki users edit things in the same way as on core MW), but with MW > core's design not being thought-through in terms of scalability. We already > know that the revisions table's growth is a problem; patrolling logs cause a > second table to also be a problem. > How come this is a problem new with Wikidata? We have close to a 1,000 of wikis with many thousands of wiki-admins, stewards, bots, reviewers, rollbackers and patrollers etc. all who make lots of edits that are autopatrolled. (In reply to comment #7) > I'm sorry, but I don't quite follow. When do you ever need to see the log > entries for *auto*patrolled edits? Don't they just duplicate the page > history? Yes, on a healthy wiki every revision would have a patrol entry at some point (either autopatrol or patrol by another user). This is nothing new. I can imagine this being a scalability problem, but I don't see how that only becomes a problem now. And if it is, I imagine we'll need a solution for other all other wikis as well (commons, enwiki, ..). Perhaps operations thinks that could be deferred to later, but if this is as important as some people make it seem, I imagine it is as much as problem elsewhere as for wikidata and we'll need single solution for all very soon. Is that worth boldly sacrificing the integrity of the database (inconsistently log entries missing for actions taken, that are usually there for the same action by other users and on all other wikis). > > If "claim" should not be subject to auto-patrolling or patrolling, then that > > should be disabled instead. > > > Solving the "claim" auto patrol log problem, by disabling the logging for it > > entirely seems an odd way to solve the problem. > > I appreciate that this is disruptive for users of the patrolling logs, most > notably the CVU tools, but this is a change made for site stability, and we > must accept it. Maybe you mistunderstood, but I don't see how this relates to the cited statement. I am suggesting that if "claim" creations should not be reviewed through the patrolling system, what's stopping Wikibase from preventing the patrol entry in the first place? Perform the creation like other unpatrollable actions (such as uploads, they create an unpatrollable recentchanges entry and no autopatrol entry). I think it would be unfortunate if claims are not patrollable but since that seems already accepted, I'm merely suggesting we don't also disable logging for autopatrols outside this area (e.g. edits to regular pages, talk pages, categories, user pages, project pages etc.)
> I am suggesting that if "claim" creations should not be reviewed > through the patrolling system, what's stopping Wikibase from preventing the > patrol entry in the first place? Claim creation is a regular edit to an Item page. The RC entry is generated upon save, that is not under the control of the Wikibase extension. I suppose we could hack in and try to suppress patrolling based on some magic property of some edits. But I feel this introduces even more inconsistency (why do some edits require patrolling, and others don't?) Furthermore, Claim creation/changes by users without the Autopatroll right should still be patrolled, so suppressing patrolling for this type of edit is not desired. > Yes, on a healthy wiki every revision would have a patrol entry at some point > (either autopatrol or patrol by another user). This is nothing new. This is indeed an expectation we would break. But I don't see how, why or where this assumption is important or even relevant. Do you have an example?
So here is (part of) why the situation is different on wikidata than anywhere else. 1) Wikidata actually has more edits/sec than anywhere, including en wp. 2) Almost all of those edits are autopatrolled and wind up in the log. 3) On en wp a much tinier proportion of edits wind up in the log, since they don't use RCPatrol. The number of large projects with RCPatrol on and with lots of bot edits in a short period of time must be, well... one, and that's the one with the issue :-D If we want RCPatrol to scale then we need to rethink the ever-expanding log; even a 30 day retention is better than what we have now. I still claim that bot edits being autopatrolled and then logged is a waste of resources.
See bug 17237 for a solution based on discussion from Amsterdam Hackathon 2013 between Daniel, Tim and Timo.
+1 from me for that approach. It covers all my concerns.
(In reply to comment #14) > See bug 17237 for a solution based on discussion from Amsterdam Hackathon > 2013 between Daniel, Tim and Timo. So this means that, if needed, we can proceed with this as a temporary hack in the Wikibase extension before we re-work master as part of the to-be-scheduled occasional MW core re-working that Tim agreed to (what are the Ops/growth issues and how quickly can we fix bug 17237?).
(In reply to comment #16) > So this means that, if needed, we can proceed with this as a temporary hack > in the Wikibase extension Which temporary hack are you referring to? I'm only aware of Ic999454d, which makes logging autopatroll events optional in core. I think we can and should go ahead with that. For now, the default should be to log autopatrolled events, and this should only be disabled for wikidata.org to avoid flooding the log. Once we have the patrolling info in the revision table, the log entries for autopatroll events are redundant, and might be turned off per default.
(In reply to comment #16) > So this means that, if needed, we can proceed with this as a temporary hack > in > the Wikibase extension before we re-work master as part of the > to-be-scheduled > occasional MW core re-working that Tim agreed to (what are the Ops/growth > issues and how quickly can we fix bug 17237?). The dump-related ops issues can be worked around for now with a functional if not awesome hack, for now.
(In reply to comment #17) > (In reply to comment #16) > > So this means that, if needed, we can proceed with this as a temporary hack > > in the Wikibase extension > > Which temporary hack are you referring to? I'm only aware of Ic999454d, which > makes logging autopatroll events optional in core. I think we can and should > go ahead with that. > That is indeed the temporary hack James was referring to (I was sitting next to him when he wrote that). It is temporary because as soon as the _bot and _patrolled fields are moved to the revision table we shall remove logging of autopatrol from core entirely as I'm pretty sure there is no longer an acceptable use-case for them (especially as long as they remain to be logged as the same log_type and log_action as non-auto patrols - ergo it will fix bug 25799). Keeping it around under a feature flag seems pointless and only encourages a bad user experience for patrollers. > For now, the default should be to log autopatrolled events, and this should > only be disabled for wikidata.org to avoid flooding the log. Once we have the > patrolling info in the revision table, the log entries for autopatroll events > are redundant, and might be turned off per default. As James said, this is acceptable – assuming we've considered the feasibility of adding the patrolling info to the revision table soon enough for wikidata not to explode. That it will happen has pretty much been agreed on already, whether it is worth it to do this temporary hack first (thus semi-permanently losing some data about events in the database) or whether it is feasible to get this revision table change through before the problems becomes critical for wikidata.
So, just to be clear, the plan is to commit a temporary hack into core just so a single WMF wiki can shorten their logs until somebody gets around to properly fixing the bug? That doesn't sound like the cleanest solution.
(In reply to comment #20) > So, just to be clear, the plan is to commit a temporary hack into core just > so a single WMF wiki can shorten their logs until somebody gets around to > properly fixing the bug? That doesn't sound like the cleanest solution. As both James and I have said, we have a plan in place to address this in a way that is acceptable to us (software developers & product managers) and will not cause problems to users of wikidata and/or users active in countervandalism network. In fact, it'll make things better and allow for other interesting new features. However since making major schema changes requires a significant amount of coordination, database switches, and what not, it is not in our hands to make that happen. This is mostly up to platform operations. So, depending on whether our plan can be executed before wikidata explodes we will have to settle on an intermediate solution. The solution proposed in earlier comments before mine (disabling autopatrol logging on wikidata) is in my opinion not ideal, but it could be worse. I think it is acceptable if and only if it is temporary and only until we finish the larger schema change.
Verified at the Hackathon