Last modified: 2014-06-17 08:07:16 UTC
Currently the database dumps are configured to skip certain tables altogether due to privacy concerns. For example, the user table is marked private entirely. The user table contains a number of columns that are not private (e.g., user_id, user_name, user_registration, user_editcount). These should be publicly dumped. The same is true for the ipblocks table, which has a lot of columns that can be safely exposed. There may be other tables that can be safely exposed, I haven't done a full audit.
(In reply to comment #0) > The user table contains a number of columns that are not private (e.g., > user_id, user_name, user_registration, user_editcount). These should be > publicly dumped. You'd think so, but given the amount of effort that goes into various mechanisms for hiding abusive user names from public lists I can't see them implementing this. ID and registration aren't much use without name, and even editcount has caused people to bitch about privacy and ask for removal when it's appeared on public lists before...
The useful info in those columns (e.g. editcount) can be derived in other ways without those columns.
(In reply to comment #2) > The useful info in those columns (e.g. editcount) can be derived in other ways > without those columns. I think user.user_name and user.user_id are fairly important pieces of information (among others). Re-opening this.
(In reply to comment #3) > I think user.user_name and user.user_id are fairly important pieces of > information (among others). Re-opening this. You are welcome to submit a patch to do this.
(In reply to comment #4) > (In reply to comment #3) > >> I think user.user_name and user.user_id are fairly important pieces of >> information (among others). Re-opening this. > > You are welcome to submit a patch to do this. Yes. So is anyone else. That misses the larger point: don't close bugs as "wontfix" when the bug is valid. It doesn't matter if it's a low priority for the Wikimedia Foundation (or anyone else); there is no deadline.
Moving bug to Datasets product namespace.
Folks with the dumps are probably only interested in names of users that made at least one (non-hidden) edit. That information is available by grabbing user names and ids from the stubs dumps; while it might take a little work to grep it out, reformat it and pass it to uniq it's not prohibitive. Alternatively you could could get it from the pages-logging xml file. You could also get them from the api in a very straightforward way, whether or not they've ever made an edit: http://en.wikipedia.org/w/api.php?action=query&list=allusers If I were going to produce these dumps, it wouldn't be a straight select, because we'd have to check if the user name is hidden or whether the public is permitted to see it. This means a pho script and xml output and probably batched queries for a measly two fields that are already available through other means. So I'm really leaning towards WONTFIX.
(In reply to comment #7) > If I were going to produce these dumps, it wouldn't be a straight select, > because we'd have to check if the user name is hidden or whether the public > is permitted to see it. This means a pho script and xml output and probably > batched queries for a measly two fields that are already available through > other means. So I'm really leaning towards WONTFIX. This bug covers far more than two fields. If you'd like a full list of missing tables and fields from the public dumps, I can put one together. The English Wikipedia has over 19,000,000 registered users. While the MediaWiki Web API can be used to retrieve some of this information, are we really suggesting that polling the API 3,800 times (this assumes batches of 5,000) is the best way to dump the user table? That seems kind of insane.
(In reply to comment #8) > This bug covers far more than two fields. For the user table we are talking about two fields. >If you'd like a full list of missing > tables and fields from the public dumps, I can put one together. If folks want other partial tables, they should request them in separate bugs. The discussion around privacy and/or necessity for each will likely be different. > The English Wikipedia has over 19,000,000 registered users. While the > MediaWiki > Web API can be used to retrieve some of this information, are we really > suggesting that polling the API 3,800 times (this assumes batches of 5,000) > is > the best way to dump the user table? That seems kind of insane. With a script doing the work, as long as it respects maxlag, who cares? Just fire it up, go do other things, and come back at some point to see when it's done. If you are concerned about multiple users doing this same set of requests instead of a single producer providing the list for download, I can see that as a legitimate complaint. But that too would be a different bug: it would be nice if users had a space that they could put data sets that they generate, for re-use by others. I think that could be managed by interested users off-site though, without WMF intervention.
(In reply to comment #9) > (In reply to comment #8) >> This bug covers far more than two fields. > > For the user table we are talking about two fields. No, as I understand it, we operate on a principle of least restriction. Every field that can be safely exposed should be. In the user table, there's no reason to not expose user_id, user_name, user_editcount, and user_registration, as far as I can tell. > If folks want other partial tables, they should request them in separate > bugs. Okay, I'll re-purpose this bug to focus specifically on the user table and will file separate bugs for the other tables. > With a script doing the work, as long as it respects maxlag, who cares? Just > fire it up, go do other things, and come back at some point to see when it's > done. Respectfully, this doesn't make any sense. User IDs and user names are not some minor detail. Why would we provide dumps of every other table (many of which are actually useless) and then leave out the user table, which is one of the most important tables? That doesn't make sense. > If you are concerned about multiple users doing this same set of requests > instead of a single producer providing the list for download, I can see that > as a legitimate complaint. But that too would be a different bug: it would > be nice if users had a space that they could put data sets that they > generate, for re-use by others. I think that could be managed by interested > users off-site though, without WMF intervention. Again, this misses the point. Who or what is served by requiring every individual re-user to generate (or re-generate) this data themselves through time-consuming API queries? The issue surrounding hidden user names is a red herring, in my mind. It makes absolutely no sense to keep data in the database that shouldn't be there. If there are inappropriate user names, they should be changed. Leaving them in the user table indefinitely _guarantees_ that they'll eventually be exposed. (Surely there's a law that covers this.) We should take out the trash.
We have hidden user names, hidden revisions, etc etc. That's not something that's likely to change and certainly not something that's going to get addressed by me in the dumps. But again, if you feel strongly that this data should be removed permanently, it could be discussed on wikitech-l or bugzilla. Given that this is how things are right now, and given that someone can run a script to grab the user names, ids and editcounts and any other publically exposed fields and we wouldn't be particularly more efficient about it, this is one of those cases where the community, if it had a place to put community-generated datasets for download by others, could take care of it themselves. This is what I suggest.
(In reply to comment #10) > Okay, I'll re-purpose this bug to focus specifically on the user table and > will file separate bugs for the other tables. I decided it made more sense to make this bug a tracking bug with dependencies for each of the private tables. For reference: <https://gerrit.wikimedia.org/r/gitweb?p=operations/dumps.git;a=blob;f=xmldumps-backup/worker.py;h=a6256263e5c7d80ee662e75bd67436adee72b3df;hb=ariel#l600>.
Considering that https://www.mediawiki.org/wiki/Research_Data_Proposals was created in 2010 and never went anywhere, probably also because researchers never found a way to agree on what was highest priority (except one bolded item, which was done by [[m:WikiTeam]] instead), how can we proceed here? We need an easy way that is also tracked long-term (nothing is going to happen overnight anyway). The easiest solution is probably to just create one bug for each item and then ask researchers on wiki-research-l and elsewhere to vote (in bugzilla) on those they'd need/use.