Last modified: 2014-08-28 15:27:54 UTC
I don't see a reason anyone would want to use the unindexed 'revision' vs 'revision_userindex'. We can rename 'revision' to revision_noindex, and revision_userindex to revision. Same for the logging table. The problem this causes is tools/users running queries on revision without reading the docs about revision_userindex, and both getting terrible performance themselves and slowing down the server as well.
Agreed. I don't see any good reason why the non-indexed thing behind "revision" should occupy such a privileged name. Is it actually useful for querying? If we can keep "revision_userindex" and make "revision" an alias to it, we can preserve backwards compatibility as well.
According to doc revisions with suppresed username (by os, steward) )are currently only available from revision and not revision_noindex. But i agree that the indexed version should be the default one because most people don't take care when writing sql queries and the special case to have really all revision available is less important because
(In reply to merl from comment #2) > According to doc revisions with suppresed username (by os, steward) )are > currently only available from revision and not revision_noindex. I think you mean revision_userindex :)
The naming convention was selected on the "least surprise" principle; there are a number of rows not available in the _%index views and those rows would be missing without explanation. That said, I'd have no fundamental objection to switching the names around if you think that will make things clearer. I wish MariaDB could be told 'use that view if the query contains a where clause on those columns' :-)
Yeah, I think the rename will actually give people the 'least surprise' - I think running a query you expect to be fast but ends up being super slow and then having to go dig around why is more surprising.
That's not what I meant (performance); currently selection on 'revision' will give you the same result rows as making that same query in production whereas use of 'revision_userindex' may not.
Let me just say that I'm very surprised that either of the tables is missing rows that the other has. Why would "userindex" suggest that some rows have been filtered?
The rows that are filtered in the _userindex views are where the user id and user named have been NULLed because of suppression. You'd never *notice* the missing rows when using _userindex because you'd use that view while having a WHERE clause on those columns and the rows wouldn't have been returned /anyways/. If, on the other hand, you have a where clause on the page columns (to recover a article history), then the rows with suppressed usernames would still show in their proper place as you'd get in production / on-wiki.
Marc, as a regular user of this data, that's very unintuitive. Also: SELECT COUNT(*) FROM revision_userindex WHERE rev_user IS NULL;
That last paragraph should have ended "... if you are using the revision table, but mysteriously missing if you are using revision_userindex".
rev_user may be NULL for reasons /other/ that suppression. :-)
(In reply to Aaron Halfaker from comment #9) > Marc, as a regular user of this data, that's very unintuitive. Indeed it is, albeit clearly documented at: https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Help#Tables_for_revision_or_logging_queries_involving_user_names_and_IDs This is what I meant by 'least surprise'; if you use the revision view, you get the expected (every row) result. If you have a where clause on rev_user or rev_user_text (that isn't IS NULL for the obvious reason) then you *also* get every row you would have gotten from revision. The unintuitive result occurs /only/ if you use revision_userindex while not also selecting on the value of rev_user. This is why the _userindex table was not made the default.
Honestly this is a case where we just need to train our users to use the correct view. Nothing should be changed on the server side. As it should reflect production on production tables. From what I have seen from tools labs our documentation/how to/support system is in need of a complete overhaul and a lot of improvements need to be made. This is an easy case where the two tables play different roles and the users who use them should know the differences. Im not sure where the indexes are (rev_user, rev_user_text, multi-colum, or dual indexes) but depending on what your looking for those indexes may or may not help you.
After careful consideration, the necessity of having revision provide the same view as in production wins over the potential confusion of having some rows missing unless you query a different view.