Last modified: 2014-09-25 13:17:29 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T60196, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 58196 - Tool Labs: Provide anonymized view of the user_properties table
Tool Labs: Provide anonymized view of the user_properties table
Status: ASSIGNED
Product: Wikimedia Labs
Classification: Unclassified
tools (Other open bugs)
unspecified
All All
: High major
: ---
Assigned To: Luis Villa (WMF Legal)
:
: 56491 (view as bug list)
Depends on:
Blocks: tool-missing-ts-feat
  Show dependency treegraph
 
Reported: 2013-12-09 05:17 UTC by Kunal Mehta (Legoktm)
Modified: 2014-09-25 13:17 UTC (History)
23 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Kunal Mehta (Legoktm) 2013-12-09 05:17:15 UTC
Toolserver has a limited and anonimized view of the user_properties table available. This is used for reports like https://en.wikipedia.org/wiki/Wikipedia:Database_reports/User_preferences

mysql> describe user_properties;
+-------------+----------------+------+-----+---------+-------+
| Field       | Type           | Null | Key | Default | Extra |
+-------------+----------------+------+-----+---------+-------+
| up_user     | int(11)        | NO   |     | 0       |       |
| up_property | varbinary(255) | YES  |     | NULL    |       |
| up_value    | blob           | YES  |     | NULL    |       |
+-------------+----------------+------+-----+---------+-------+
3 rows in set (0.00 sec)

mysql> describe user_properties_anonym;
+-------------------------+----------------+------+-----+---------+-------+
| Field                   | Type           | Null | Key | Default | Extra |
+-------------------------+----------------+------+-----+---------+-------+
| up_property             | varbinary(255) | YES  |     | NULL    |       |
| up_value                | blob           | YES  |     | NULL    |       |
| ts_user_touched_cropped | varbinary(8)   | NO   |     |         |       |
+-------------------------+----------------+------+-----+---------+-------+
3 rows in set (0.00 sec)

Only preferences that are considered to be public are replicated, the list from enwiki_p is:
mysql> select distinct(up_property) from user_properties;
+----------------+
| up_property    |
+----------------+
| disablemail    |
| fancysig       |
| gender         |
| language       |
| nickname       |
| skin           |
| timecorrection |
| variant        |
+----------------+
Comment 1 MZMcBride 2013-12-09 05:20:31 UTC
Coren says there's a requirement that the Wikimedia Foundation legal team sign off on changes of this nature. I'm not able to verify this requirement. The process seems to be largely undocumented, though as far as I'm aware, there was never any such requirement during the Toolserver's reign: it was merely left to root discretion.
Comment 2 Marc A. Pelletier 2013-12-19 19:09:44 UTC
Adding Luis Villa so that he can chime in.
Comment 3 Luis Villa (WMF Legal) 2013-12-19 22:30:07 UTC
I don't think we've formally set out a policy that new field replication has to run through legal, but it's a good habit to be in.

Legoktm, when you say "Only preferences that are considered to be public are replicated, the list from enwiki_p is..." is that the list of things currently considered public? Or the list of things that could be made public?
Comment 4 Kunal Mehta (Legoktm) 2013-12-20 01:36:06 UTC
(In reply to comment #3)
> Legoktm, when you say "Only preferences that are considered to be public are
> replicated, the list from enwiki_p is..." is that the list of things
> currently
> considered public? Or the list of things that could be made public?

Since it's already replicated onto the Toolserver, I assume they are currently considered to be public.

For example, if you go to Special:EmailUser, you can tell if a user has disabled email or not (disableemail). fancysig/nickname are just a user's signature, which is public as soon as a user adds ~~~~ to any page. gender is also exposed by the software and the API.
Comment 5 Luis Villa (WMF Legal) 2013-12-20 01:39:32 UTC
If a field is already exposed by the API/website, then that's fine. Just wasn't clear which of those fields that was already true for.
Comment 6 Marc A. Pelletier 2013-12-20 01:45:16 UTC
(In reply to comment #4)
> For example, if you go to Special:EmailUser, you can tell if a user has
> disabled email or not (disableemail).

That's not actually correct; you can know whether someone has disabled their email /or/ never confirmed one to begin with (i.e.: it's emailconfirmed && !disableemail).
Comment 7 Kunal Mehta (Legoktm) 2013-12-20 02:14:19 UTC
(In reply to comment #6)
> (In reply to comment #4)
> > For example, if you go to Special:EmailUser, you can tell if a user has
> > disabled email or not (disableemail).
> 
> That's not actually correct; you can know whether someone has disabled their
> email /or/ never confirmed one to begin with (i.e.: it's emailconfirmed &&
> !disableemail).

Ah, didn't know that, thanks. Though, given that the AbuseFilter has been giving out emailconfirmed timestamps (bug 20005), it's probably a moot point.
Comment 8 Marc A. Pelletier 2013-12-20 03:34:27 UTC
(In reply to comment #7)
> Ah, didn't know that, thanks. Though, given that the AbuseFilter has been
> giving out emailconfirmed timestamps (bug 20005), it's probably a moot point.

That seems to me a good reason to fix 20005 more than anything (especially since that isn't in the dumps).
Comment 9 Silke Meyer (WMDE) 2014-02-04 16:53:50 UTC
So, am I right in assuming there is no legal barrier to make that table available? 
Legoktm, which of your tools depends on this to migrate?
Comment 10 Kunal Mehta (Legoktm) 2014-02-04 18:59:07 UTC
(In reply to comment #9)

> Legoktm, which of your tools depends on this to migrate?

Database reports, specifically [[Wikipedia:Database reports/User preferences]].
Comment 11 Dan Michael Heggø 2014-03-13 21:11:06 UTC
The gender gap plots by emijrp also depends on this:

https://toolserver.org/~emijrp/wmcharts/

(they don't work atm. for some other reason)
Comment 12 Luis Villa (WMF Legal) 2014-03-13 21:31:28 UTC
Sorry, this got lost in the shuffle. To respond to Silke: I'm still not entirely clear which fields we're talking about replicating. If the list from en_wikip: 

+----------------+
| up_property    |
+----------------+
| disablemail    |
| fancysig       |
| gender         |
| language       |
| nickname       |
| skin           |
| timecorrection |
| variant        |
+----------------+

Is what we're actually talking about, then yes - no legal barrier to making that information available.
Comment 13 Krinkle 2014-03-13 22:45:51 UTC
I'm hoping we can get permission to provide this view on Tool Labs without a whitelist of keys like that and instead allow all, or at least a more liberal subset.

The main request is to provide what Toolserver provided which is:

An anonimized view of the user_properties table with an extra column that shows the date (not the time) of when the associated user was last active (which allows consumers of the data to focus on (or reject) recently active users vs. abandoned user accounts).

The raw user_properties database in Wikimedia production contains:

- up_user: The user_id of the user, this must of course not be exposed
   because that would allow consumers of the data to know which preferences
   belong to the same user, and in addition, since user ids are already public,
   to also know which user it belongs to).

- up_property: The preference names (disablemail, gender, language, skin, gadget-Foo, gadget-Bar etc.)

- up_value: The value of the preference. For most preferences these are boolean values (0 of off, 1 for on). Others are one of a finite set of values (like "language" and "skin"). And then there are some that take free numerical input (like "rclimit" which can be any number, e.g. 10 or 500, for the number of results to show by default on Special:Search or Special:RecentChanges). And lastly, there are some that take free textual input (like nickname and fancysig). This last category is usually identifying to a person. So I could understand why one might keep these hidden (though Luis already said those two were fine). I imagine it's fine because they're public (people use their signature and nickname all the time), and anonymised (not connected to other preferences of that same user).

Of these, the anonimized view would expose:

- up_property
- up_value

This will allow tools to know how popular certain preferences are (e.g. How many people use the Monobook skin, How many people change their interface language to French on German Wikipedia). I think however that this view should not be restricted to certain preferences. Especially because:

 1) Extensions and core add new preferences all the time.

 2) Gadgets are enabled through preferences as well, so if we do use a
    whitelist, it should whitelist any preference starting with "gadget-*".

 3) BetaFeatures also comes and goes with new preferences.
    The case of BetaFeatures is interesting, because it actually tracks
    and exposes the usage of its own preferences publicly (doesn't even
    require making a Tool of some kind).


In addition to these two columns, it would be very useful to have some kind of indicator to interpret these values. Toolserver did this by adding a shadow column showing a trimmed part of the timestamp of when the associated user was last active (trimmed to only show the date in YYYY-MM-DD, not the exact time of day). With this, statistics about preferences are much more useful due to the large number of abandoned accounts polluting the results.

Here's a query showing which keys are currently exposed in Toolserver's  user_properties_anonym table:

krinkle at willow.toolserver.org$ sql nlwiki_p
mysql> SELECT DISTINCT(up_property) FROM user_properties_anonym;
+-----------------------------------------+
| up_property                             |
+-----------------------------------------+
| ajaxsearch                              |
| autopatrol                              |
| ccmeonemails                            |
| cols                                    |
| contextchars                            |
| contextlines                            |
| date                                    |
| diffonly                                |
| disablemail                             |
| disablesuggest                          |
| echo-email-format                       |
| echo-email-frequency                    |
| echo-show-alert                         |
| echo-subscriptions-email-article-linked |
| echo-subscriptions-email-edit-thank     |
| echo-subscriptions-email-mention        |
| echo-subscriptions-email-reverted       |
| echo-subscriptions-web-article-linked   |
| echo-subscriptions-web-edit-thank       |
| echo-subscriptions-web-mention          |
| echo-subscriptions-web-reverted         |
| editfont                                |
| editondblclick                          |
| editsection                             |
| editsectiononrightclick                 |
| editwidth                               |
| enotifminoredits                        |
| enotifrevealaddr                        |
| enotifusertalkpages                     |
| enotifwatchlistpages                    |
| extendwatchlist                         |
| externaldiff                            |
| externaleditor                          |
| fancysig                                |
| forceeditsummary                        |
| gadget-Achtergrondkleur                 |
| gadget-Berichtbalk                      |
| gadget-Diffkleuren                      |
| gadget-Extra_bewerkingsknoppen          |
| gadget-fixhoofdbetekenis                |
| gadget-Geavanceerde_bijdragen           |
| gadget-HideFundraiser                   |
| gadget-HotCat                           |
| gadget-IProject                         |
| gadget-Navigatie_popups                 |
| gadget-purgelink                        |
| gadget-ReferenceTooltips                |
| gadget-Twinkle                          |
| gadget-UTCLiveClock                     |
| gender                                  |
| hideminor                               |
| hidepatrolled                           |
| highlightbroken                         |
| hover                                   |
| imagesize                               |
| justify                                 |
| language                                |
| math                                    |
| minordefault                            |
| newpageshidepatrolled                   |
| nocache                                 |
| norollbackdiff                          |
| numberheadings                          |
| previewonfirst                          |
| previewontop                            |
| quickbar                                |
| rcdays                                  |
| rclimit                                 |
| rememberpassword                        |
| searcheverything                        |
| searchlimit                             |
| searchNs-1                              |
| showhiddencats                          |
| showjumplinks                           |
| shownumberswatching                     |
| showtoc                                 |
| showtoolbar                             |
| skin                                    |
| stubthreshold                           |
| thumbsize                               |
| timecorrection                          |
| underline                               |
| usebetatoolbar                          |
| usebetatoolbar-cgd                      |
| useeditwarning                          |
| uselivepreview                          |
| usenewrc                                |
| variant                                 |
| vector-noexperiments                    |
| visualeditor-enable                     |
| watchcreations                          |
| watchdefault                            |
| watchdeletion                           |
| watchlistdays                           |
| watchlisthideanons                      |
| watchlisthidebots                       |
| watchlisthideliu                        |
| watchlisthideminor                      |
| watchlisthideown                        |
| watchlisthidepatrolled                  |
| watchmoves                              |
| wllimit                                 |
+-----------------------------------------+
102 rows in set


Are there any preferences we would explicitly not want to expose?
Comment 14 Luis Villa (WMF Legal) 2014-03-13 23:03:57 UTC
Ah! Thanks for explaining more clearly, Krinkle. The first comment now makes a lot more sense to me :) 

One question on the anonymization - this data is still per-wiki, right? Would last-active dates at the day-level allow deanonymization for those users?
Comment 15 Liangent 2014-03-14 06:35:23 UTC
(In reply to Luis Villa (WMF Legal) from comment #14)
> Ah! Thanks for explaining more clearly, Krinkle. The first comment now makes
> a lot more sense to me :) 
> 
> One question on the anonymization - this data is still per-wiki, right?
> Would last-active dates at the day-level allow deanonymization for those
> users?

Those are two different requests: one table (view) containing user identifiable info (user id) with fewer properties includes, and another anonymized one with more properties.
Comment 16 Luis Villa (WMF Legal) 2014-03-14 18:57:06 UTC
Retitling then :) 

For the filtered table: I am OK with this in principle, but would like to better understand (1) how we choose which fields to filter and (2) how we prevent new fields from leaking.

For the anonymized table: Again, OK in principle; it would be good (but not a must-have) to know more about the impact on users of small wikis.
Comment 17 Krinkle 2014-03-17 00:58:19 UTC
So if I understand correctly, the whitelist of properties you approved in comment 12 are in fact allowed for identifiable queries (e.g. with user_id/user name attached to it), as opposed to the anonymized view. That's great news. I wasn't expecting a non-anonymized view of the properties table at all. Having one with this limited set of properties can be useful indeed.

And when you say "OK in principle" for the anonymized view, that means all properties?

Good point on the user activity measure being potentially identifying on small wikis. I suppose a month crop of the timestamp (YYYY-MM) would still be good enough for practical applications and be a lot less likely to reveal identifying information.

Though no matter how much we crop the timestamp (even if we'd crop it to the month or year), one could in theory still retrieve the preferences of one user by performing queries in a loop until the next day, month or year starts and correlate that to the one user currently editing on that wiki.. The only way I can think of that would mitigate that is forcing the current matching crop of the date being delayed (e.g. all users active today on 2014-03-17 would be listed under 2014-03-16 until this day is over, so that all active users have had a chance to be active that day).
Comment 18 Luis Villa (WMF Legal) 2014-03-18 00:19:19 UTC
(In reply to Krinkle from comment #17)
> So if I understand correctly, the whitelist of properties you approved in
> comment 12 are in fact allowed for identifiable queries (e.g. with
> user_id/user name attached to it), as opposed to the anonymized view.

Yes, that is correct, for those categories. 

As far as implementation, we'll want to review new fields as they are added, which is why I prefer a whitelist approach (review what is released; hard to accidentally release data accidentally) to a blacklist approach (review what is not released; easy to release data accidentally by not adding it to the blacklist).

> And when you say "OK in principle" for the anonymized view, that means all
> properties?

Yes, based on my understanding that they are already available on toolserver in user_properties_anonym and/or via the API. If they aren't already public (via API or toolserver) we should probably review one-by-one, and/or do a more careful scrutiny of the anonymization procedures. Maybe that deserves a separate bug?
Comment 19 Liangent 2014-03-18 01:04:23 UTC
(In reply to Luis Villa (WMF Legal) from comment #18)
> (In reply to Krinkle from comment #17)
> > So if I understand correctly, the whitelist of properties you approved in
> > comment 12 are in fact allowed for identifiable queries (e.g. with
> > user_id/user name attached to it), as opposed to the anonymized view.
> 
> Yes, that is correct, for those categories. 
> 
> As far as implementation, we'll want to review new fields as they are added,
> which is why I prefer a whitelist approach (review what is released; hard to
> accidentally release data accidentally) to a blacklist approach (review what
> is not released; easy to release data accidentally by not adding it to the
> blacklist).
> 
> > And when you say "OK in principle" for the anonymized view, that means all
> > properties?
> 
> Yes, based on my understanding that they are already available on toolserver
> in user_properties_anonym and/or via the API. If they aren't already public
> (via API or toolserver) we should probably review one-by-one, and/or do a
> more careful scrutiny of the anonymization procedures. Maybe that deserves a
> separate bug?

anonym on Toolserver is also whitelist based. I don't have access to the original list ts_global.user_properties_anonym_whilelist, but it's possible to sniff it with select distinct up_property from user_properties_anonym; (a slow query).
Comment 20 Luis Villa (WMF Legal) 2014-03-18 17:35:40 UTC
Do we know how that whitelist was created?
Comment 21 Krinkle 2014-03-19 05:19:07 UTC
Comparison for nlwiki.

nlwiki at wmf> DISTINCT(up_property) FROM user_properties
+--------------------------------------------+
| ajaxsearch
| autopatrol
| betafeatures-auto-enroll
| betafeatures-geonotahack
| betafeatures-vector-typography-update
| ccmeonemails
| cirrussearch-default
| cols
| contextchars
| contextlines
| date
| diffonly
| disablemail
| disablesuggest
| echo-email-format
| echo-email-frequency
| echo-email-last-batch
| echo-show-alert
| echo-subscriptions-email-article-linked
| echo-subscriptions-email-edit-thank
| echo-subscriptions-email-education-program
| echo-subscriptions-email-mention
| echo-subscriptions-email-reverted
| echo-subscriptions-web-article-linked
| echo-subscriptions-web-edit-thank
| echo-subscriptions-web-education-program
| echo-subscriptions-web-mention
| echo-subscriptions-web-reverted
| editfont
| editondblclick
| editsection
| editsectionondblclick
| editsectiononrightclick
| editwidth
| enotifminoredits
| enotifrevealaddr
| enotifusertalkpages
| enotifwatchlistpages
| ep_bulkdelcourses
| ep_bulkdelorgs
| ep_showdyk
| ep_showtoplink
| extendwatchlist
| externaldiff
| externaleditor
| fancysig
| forceeditsummary
| gadget-Achtergrondkleur
| gadget-Berichtbalk
| gadget-Diffkleuren
| gadget-Direct-link-to-Commons
| gadget-Externe_links
| gadget-Extra_bewerkingsknoppen
| gadget-Geavanceerde_bijdragen
| gadget-GoogleTrans
| gadget-HideFundraiser
| gadget-HotCat
| gadget-IProject
| gadget-LocalLiveClock
| gadget-Navigatie_popups
| gadget-OldDiff
| gadget-OpenStreetMapFrame
| gadget-OpenStreetMapFrameUit
| gadget-ProtectionTemplates
| gadget-ReferenceTooltips
| gadget-Sneltoetsen
| gadget-Twinkle
| gadget-UTCLiveClock
| gadget-Uitleg_verwijderlijst
| gadget-VectorSearchNav
| gadget-Verbergding
| gadget-WhatLeavesHere
| gadget-Zeusmodus
| gadget-fixhoofdbetekenis
| gadget-hoofdbetekenis-titelwijziging
| gadget-labelfilter
| gadget-mijnKladblok
| gadget-purgelink
| gadget-searchFocus
| gadget-updatemarker
| gadget-verwijderreden
| gender
| hideminor
| hidepatrolled
| highlightbroken
| hover
| imagesize
| justify
| language
| math
| mfWatchlistFilter
| mfWatchlistView
| minordefault
| moodbar-disable
| multimedia-viewer
| newpageshidepatrolled
| nickname
| nocache
| norollbackdiff
| numberheadings
| prefershttps
| previewonfirst
| previewontop
| quickbar
| rcdays
| rclimit
| rcshowwikidata
| rememberpassword
| rows
| searchNs-1
| searchNs0
| searchNs1
| searchNs10
| searchNs100
| searchNs101
| searchNs11
| searchNs12
| searchNs13
| searchNs14
| searchNs15
| searchNs2
| searchNs3
| searchNs4
| searchNs4294967295
| searchNs446
| searchNs447
| searchNs5
| searchNs6
| searchNs7
| searchNs8
| searchNs828
| searchNs829
| searchNs9
| searcheverything
| searchlimit
| showhiddencats
| showjumplinks
| shownumberswatching
| showtoc
| showtoolbar
| skin
| stubthreshold
| thumbsize
| timecorrection
| uls-compact-links
| uls-enable
| uls-preferences
| underline
| usebetatoolbar
| usebetatoolbar-cgd
| useeditwarning
| uselivepreview
| usenewrc
| userjs-already-set-common-preferences
| variant
| vector-collapsiblenav
| vector-noexperiments
| vector-simplesearch
| visualeditor-betatempdisable
| visualeditor-enable
| visualeditor-enable-mwmath
| watchcreations
| watchdefault
| watchdeletion
| watchlistdays
| watchlisthideanons
| watchlisthidebots
| watchlisthideliu
| watchlisthideminor
| watchlisthideown
| watchlisthidepatrolled
| watchlisttoken
| watchmoves
| wllimit
| wlshowwikibase
+--------------------------------------------+
176 rows in set (4.95 sec)

nlwiki_p at toolserver> DISTINCT(up_property) FROM user_properties_anonym
+-----------------------------------------+
| up_property
+-----------------------------------------+
| ajaxsearch
| autopatrol
| ccmeonemails
| cols
| contextchars
| contextlines
| date
| diffonly
| disablemail
| disablesuggest
| echo-email-format
| echo-email-frequency
| echo-show-alert
| echo-subscriptions-email-article-linked
| echo-subscriptions-email-edit-thank
| echo-subscriptions-email-mention
| echo-subscriptions-email-reverted
| echo-subscriptions-web-article-linked
| echo-subscriptions-web-edit-thank
| echo-subscriptions-web-mention
| echo-subscriptions-web-reverted
| editfont
| editondblclick
| editsection
| editsectiononrightclick
| editwidth
| enotifminoredits
| enotifrevealaddr
| enotifusertalkpages
| enotifwatchlistpages
| extendwatchlist
| externaldiff
| externaleditor
| fancysig
| forceeditsummary
| gadget-Achtergrondkleur
| gadget-Berichtbalk
| gadget-Diffkleuren
| gadget-Direct-link-to-Commons
| gadget-Externe_links
| gadget-Extra_bewerkingsknoppen
| gadget-fixhoofdbetekenis
| gadget-Geavanceerde_bijdragen
| gadget-GoogleTrans
| gadget-HideFundraiser
| gadget-hoofdbetekenis-titelwijziging
| gadget-HotCat
| gadget-IProject
| gadget-labelfilter
| gadget-LocalLiveClock
| gadget-mijnKladblok
| gadget-Navigatie_popups
| gadget-OldDiff
| gadget-OpenStreetMapFrame
| gadget-OpenStreetMapFrameUit
| gadget-ProtectionTemplates
| gadget-purgelink
| gadget-ReferenceTooltips
| gadget-searchFocus
| gadget-Sneltoetsen
| gadget-Twinkle
| gadget-Uitleg_verwijderlijst
| gadget-updatemarker
| gadget-UTCLiveClock
| gadget-VectorSearchNav
| gadget-Verbergding
| gadget-verwijderreden
| gadget-WhatLeavesHere
| gadget-Zeusmodus
| gender
| hideminor
| hidepatrolled
| highlightbroken
| hover
| imagesize
| justify
| language
| math
| minordefault
| newpageshidepatrolled
| nocache
| norollbackdiff
| numberheadings
| previewonfirst
| previewontop
| quickbar
| rcdays
| rclimit
| rememberpassword
| searcheverything
| searchlimit
| searchNs-1
| showhiddencats
| showjumplinks
| shownumberswatching
| showtoc
| showtoolbar
| skin
| stubthreshold
| thumbsize
| timecorrection
| underline
| usebetatoolbar
| usebetatoolbar-cgd
| useeditwarning
| uselivepreview
| usenewrc
| variant
| vector-noexperiments
| visualeditor-enable
| watchcreations
| watchdefault
| watchdeletion
| watchlistdays
| watchlisthideanons
| watchlisthidebots
| watchlisthideliu
| watchlisthideminor
| watchlisthideown
| watchlisthidepatrolled
| watchmoves
| wllimit
+-----------------------------------------+
122 rows in set (51.05 sec)

INTERSECT: 

| ajaxsearch
| autopatrol
| ccmeonemails
| cols
| contextchars
| contextlines
| date
| diffonly
| disablemail
| disablesuggest
| echo-email-format
| echo-email-frequency
| echo-show-alert
| echo-subscriptions-email-article-linked
| echo-subscriptions-email-edit-thank
| echo-subscriptions-email-mention
| echo-subscriptions-email-reverted
| echo-subscriptions-web-article-linked
| echo-subscriptions-web-edit-thank
| echo-subscriptions-web-mention
| echo-subscriptions-web-reverted
| editfont
| editondblclick
| editsection
| editsectiononrightclick
| editwidth
| enotifminoredits
| enotifrevealaddr
| enotifusertalkpages
| enotifwatchlistpages
| extendwatchlist
| externaldiff
| externaleditor
| fancysig
| forceeditsummary
| gadget-Achtergrondkleur
| gadget-Berichtbalk
| gadget-Diffkleuren
| gadget-Direct-link-to-Commons
| gadget-Externe_links
| gadget-Extra_bewerkingsknoppen
| gadget-Geavanceerde_bijdragen
| gadget-GoogleTrans
| gadget-HideFundraiser
| gadget-HotCat
| gadget-IProject
| gadget-LocalLiveClock
| gadget-Navigatie_popups
| gadget-OldDiff
| gadget-OpenStreetMapFrame
| gadget-OpenStreetMapFrameUit
| gadget-ProtectionTemplates
| gadget-ReferenceTooltips
| gadget-Sneltoetsen
| gadget-Twinkle
| gadget-UTCLiveClock
| gadget-Uitleg_verwijderlijst
| gadget-VectorSearchNav
| gadget-Verbergding
| gadget-WhatLeavesHere
| gadget-Zeusmodus
| gadget-fixhoofdbetekenis
| gadget-hoofdbetekenis-titelwijziging
| gadget-labelfilter
| gadget-mijnKladblok
| gadget-purgelink
| gadget-searchFocus
| gadget-updatemarker
| gadget-verwijderreden
| gender
| hideminor
| hidepatrolled
| highlightbroken
| hover
| imagesize
| justify
| language
| math
| minordefault
| newpageshidepatrolled
| nocache
| norollbackdiff
| numberheadings
| previewonfirst
| previewontop
| quickbar
| rcdays
| rclimit
| rememberpassword
| searchNs-1
| searcheverything
| searchlimit
| showhiddencats
| showjumplinks
| shownumberswatching
| showtoc
| showtoolbar
| skin
| stubthreshold
| thumbsize
| timecorrection
| underline
| usebetatoolbar
| usebetatoolbar-cgd
| useeditwarning
| uselivepreview
| usenewrc
| variant
| vector-noexperiments
| visualeditor-enable
| watchcreations
| watchdefault
| watchdeletion
| watchlistdays
| watchlisthideanons
| watchlisthidebots
| watchlisthideliu
| watchlisthideminor
| watchlisthideown
| watchlisthidepatrolled
| watchmoves
| wllimit

DIFF 1: 
| betafeatures-auto-enroll
| betafeatures-geonotahack
| betafeatures-vector-typography-update
| cirrussearch-default
| echo-email-last-batch
| echo-subscriptions-email-education-program
| echo-subscriptions-web-education-program
| editsectionondblclick
| ep_bulkdelcourses
| ep_bulkdelorgs
| ep_showdyk
| ep_showtoplink
| mfWatchlistFilter
| mfWatchlistView
| moodbar-disable
| multimedia-viewer
| nickname
| prefershttps
| rcshowwikidata
| rows
| searchNs0
| searchNs1
| searchNs10
| searchNs100
| searchNs101
| searchNs11
| searchNs12
| searchNs13
| searchNs14
| searchNs15
| searchNs2
| searchNs3
| searchNs4
| searchNs4294967295
| searchNs446
| searchNs447
| searchNs5
| searchNs6
| searchNs7
| searchNs8
| searchNs828
| searchNs829
| searchNs9
| uls-compact-links
| uls-enable
| uls-preferences
| userjs-already-set-common-preferences
| vector-collapsiblenav
| vector-simplesearch
| visualeditor-betatempdisable
| visualeditor-enable-mwmath
| watchlisttoken
| wlshowwikibase

DIFF 2: 

(none)
Comment 22 Luis Villa (WMF Legal) 2014-03-24 16:24:47 UTC
To confirm: INTERSECT here is the whitelist; DIFF 1 is properties that never made it onto the whitelist?
Comment 23 Krinkle 2014-03-27 20:17:11 UTC
(In reply to Luis Villa (WMF Legal) from comment #22)
> To confirm: INTERSECT here is the whitelist; DIFF 1 is properties that never
> made it onto the whitelist?

That is correct.

INTERSECT are properties present in both Toolserver's anonymized view and in wikimedia production databases.

DIFF 1 are properties present in production and not visible in Toolserver anonymized view.

DIFF 2 are properties only present in Toolserver's anonymized view (none).


The Toolserver's non-anonimized filtered view has a much stricter whitelist:

nlwiki_p at toolserver> SELECT DISTINCT(up_property) FROM user_properties;
+----------------+
| up_property    |
+----------------+
| disablemail    |
| fancysig       |
| gender         |
| language       |
| nickname       |
| skin           |
| timecorrection |
| variant        |
+----------------+
8 rows in set
Comment 24 Silke Meyer (WMDE) 2014-03-28 09:16:04 UTC
Hello! What's the status on this?
Comment 25 Nemo 2014-04-04 13:05:15 UTC
*** Bug 56491 has been marked as a duplicate of this bug. ***
Comment 26 Nemo 2014-04-04 13:07:16 UTC
(In reply to silke.meyer from comment #24)
> Hello! What's the status on this?

No movement in 4 months: resetting status.
Comment 27 Marc A. Pelletier 2014-04-04 13:09:36 UTC
There's no reason to change the status of the bug, it remains pending an all-clear from Legal and a short implementation away.
Comment 28 Nemo 2014-04-04 13:22:54 UTC
ASSIGNED != pending
Comment 29 Nemo 2014-04-08 21:09:04 UTC
FYI, a "ghost infrastructure" makes this data available to select users, per http://lists.wikimedia.org/pipermail/analytics/2014-April/001768.html : folks should/can ask [[wikitech:EventLogging]] access if they need it.
Comment 30 Luis Villa (WMF Legal) 2014-04-18 22:43:39 UTC
I'm cloning this bug so that we can have two discussions - one about the anonymized data, one about the filtered data. The discussion about a filtered view (with the handful of properties described in the original bug description and in comment #23) is now in bug 64115. I will continue the other discussion in my next comment.
Comment 31 Luis Villa (WMF Legal) 2014-04-18 23:00:31 UTC
So, let's keep this bug focused on the question of the "anonymized" table; i.e., the table that currently has most (but not all) rows, with userID removed but last active timestamp available. 

The current proposal is to provide this table on labs. It has also been suggested that:

1) we remove the whitelist altogether and provide everything; and/or
2) we sample down the timestamp to YYYY-MM so that it is more difficult to map "last active" to editors on small wikis.

I still have a few concerns:

1) I still don't know how we created the original whitelist, or how we plan to create the new one. (Note that if we trust our anonymization, having a whitelist probably isn't necessary.)

2) I'm still concerned about the impact on anonymity of small wikis. Rounding to months helps, but could still be problematic in at least some cases (as Krinkle pointed out).

I do not have a great solution to these questions. Trying to brainstorm/think out loud here: maybe we could provide aggregate totals for all editors, for active editors, and for new editors (based on the standard analytics definition of active editors/new editors), instead of access to individually anonymized rows?
Comment 32 Liangent 2014-04-19 00:51:11 UTC
(In reply to Luis Villa (WMF Legal) from comment #31)
> So, let's keep this bug focused on the question of the "anonymized" table;
> i.e., the table that currently has most (but not all) rows, with userID
> removed but last active timestamp available. 
> 
> The current proposal is to provide this table on labs. It has also been
> suggested that:
> 
> 1) we remove the whitelist altogether and provide everything; and/or
> 2) we sample down the timestamp to YYYY-MM so that it is more difficult to
> map "last active" to editors on small wikis.
> 
> I still have a few concerns:
> 
> 1) I still don't know how we created the original whitelist, or how we plan
> to create the new one. (Note that if we trust our anonymization, having a
> whitelist probably isn't necessary.)

Note that some user_properties rows are not actually preferences. eg. watchlisttoken.

> 
> 2) I'm still concerned about the impact on anonymity of small wikis.
> Rounding to months helps, but could still be problematic in at least some
> cases (as Krinkle pointed out).

So ... wikis are already grouped as small, medium and large, then YYYYMMDD in large, YYYYMM in medium and YYYY in small wikis?

https://noc.wikimedia.org/conf/highlight.php?file=large.dblist
https://noc.wikimedia.org/conf/highlight.php?file=medium.dblist
https://noc.wikimedia.org/conf/highlight.php?file=small.dblist

> I do not have a great solution to these questions. Trying to
> brainstorm/think out loud here: maybe we could provide aggregate totals for
> all editors, for active editors, and for new editors (based on the standard
> analytics definition of active editors/new editors), instead of access to
> individually anonymized rows?
Comment 33 Tisza Gergő 2014-04-30 23:13:11 UTC
Is the order of rows randomized? Being able to correlate other preferences with, say, fancysig based on row number would be bad.

As for timestamp anonimization, showing the (truncated) date of the last edit of the user which did not happen in the current day(/month/year) should be pretty straightforward.
Comment 34 Luis Villa (WMF Legal) 2014-05-01 00:14:51 UTC
Sorry for the slow response, Liangent - email slipped through the cracks for some reason. Q: what's the criteria for small/medium/large? And how often is that updated? 

And Gergo's q is a good one too.
Comment 35 Tisza Gergő 2014-05-01 01:00:49 UTC
(In reply to Tisza Gergő from comment #33)
> As for timestamp anonimization, showing the (truncated) date of the last
> edit of the user which did not happen in the current day(/month/year) should
> be pretty straightforward.

Nevermind, I see this comes from user_touched, not the edit log, and we do not keep a past record of that.

Maybe instead of a live view, this table could be generated by a query which runs at the end of the day; that way no information is leaked by the timing of the change. Although that still does not solve the issue for small wikis since we probably want to run it more often than once a month.
Comment 36 Liangent 2014-05-04 13:29:59 UTC
(In reply to Luis Villa (WMF Legal) from comment #34)
> Sorry for the slow response, Liangent - email slipped through the cracks for
> some reason. Q: what's the criteria for small/medium/large? And how often is
> that updated? 

I have no idea about the criteria, but I guess generally wikis only grow larger, and if the list is not updated in time, we're just anonymizing more, not less.
Comment 37 Nemo 2014-05-04 17:08:40 UTC
Small/medium/big wikis have 10/100/1000 thousands total pages or more. You can see the lists at https://noc.wikimedia.org/conf/ The lists were originally made as rough indication of how expensive certain heavy database operations are (you can easily find the bugs in question).
Comment 38 Krinkle 2014-05-04 22:54:22 UTC
Changing the timestamp in this table based on wiki size would be a mistake in my opinion. It would make the data too arbitrary and hard to maintain. It also probably won't scale well into the MySQL mapping and relevant configuration.

The reason we don't want to provide the full timestamp is because it exposes rather detailed information that would be trivial to map to an individual on most wikis (even large wikis) based on when an edit has happened.

Shortening this to a month minimised this ability greatly, reducing it to only being able to tell that those with the current month as their last-active timestamp (YYYY-MM) are one of the users listed on Special:ActiveUsers.

In theory this could still be exploited by being a patient person and waiting for the moment one month has ended. Then the first user to become active in the new month would be the only person with a newer month timestamp.

For this particular exploit it is irrelevant how much we crop the timestamp (year, month, date, it's all the same, it only reduces how often one can make such attempt and with that the usefulness the data in the first place).

If we consider that possibility important enough to warrant a solution, then that solution will work equally for year, month and day. I don't think there is any reason to trim it further than month, nor is there any reason to bring small/medium/large into the balance.
Comment 39 Tim Landscheidt 2014-05-05 01:13:36 UTC
(In reply to Krinkle from comment #38)
> [...]

> In theory this could still be exploited by being a patient person and
> waiting for the moment one month has ended. Then the first user to become
> active in the new month would be the only person with a newer month
> timestamp.

> [...]

Could this be mitigated by excluding data for the current month?
Comment 40 Krinkle 2014-05-05 02:57:35 UTC
(In reply to Tim Landscheidt from comment #39)
> (In reply to Krinkle from comment #38)
> > [...]
> 
> > In theory this could still be exploited by being a patient person and
> > waiting for the moment one month has ended. Then the first user to become
> > active in the new month would be the only person with a newer month
> > timestamp.
> 
> > [...]
> 
> Could this be mitigated by excluding data for the current month?

Essentially, yes. See my comment 17:

(citing comment #17 from Krinkle)
> [..] no matter how much we crop the timestamp (even if we'd crop it to the
> month or year), one could in theory still retrieve the preferences of one
> user by performing queries in a loop until the next day, month or year
> starts and correlate that to the one user currently editing on that wiki..
>
> The only way I can think of that would mitigate that is forcing the current
> matching crop of the date being delayed (e.g. all users active today on
> 2014-03-17 would be listed under 2014-03-16 until this day is over, so that
> all active users have had a chance to be active that day).

However "excluding data" isn't trivial. It'll make statistics and sum queries act very counter intuitive. If you'd exclude all rows where timestamp == current YYYY-MM, that would essentially hide all values of active users, and make them re-appear next month and back and forth like that.

To avoid strange statistical variance like that, I'd recommend making timestamp of the currently running month appear as last month's. That way no data is excluded. This seems like a simple enough variance that tools should be able to work with without much hassle.
Comment 41 Tisza Gergő 2014-05-05 04:03:26 UTC
> To avoid strange statistical variance like that, I'd recommend making timestamp of the currently running month appear as last month's.

I don't see how that would help (other than making the window of opportunity smaller). You could still poll the data on the first day of the month to get an accurate edit timestamp. Since the attack is based on reading *when* the timestamp changes and correlating that with edit times, *what* the timestamp changes to is irrelevant. If you want users of the table to have a precision of T, you could protect users who edit more often than T by setting the timestamps to some deterministically chosen point within T so that changes within that interval are suppressed, but information would still leak about users who edit less often. For T = 1 day, that's hardly acceptable, even on large wikis.

We could randomize the last few bits of the date, that would make it hard to tell when it changes, but it would still be doable via statistical means... really the only reliable way to mitigate this is to use a temporary table instead of a view, and only update it once a day/week/whatever so that the timing of the field changing becomes completely disconnected from the timing of the edit.
Comment 42 Luis Villa (WMF Legal) 2014-05-07 18:16:42 UTC
(In reply to Tisza Gergő from comment #35)
> Maybe instead of a live view, this table could be generated by a query which
> runs at the end of the day; that way no information is leaked by the timing
> of the change. Although that still does not solve the issue for small wikis
> since we probably want to run it more often than once a month.

If we only provide one month granularity in the answer, then it can be run only once a month, right? Sounds like the use cases here don't need precise/fine-grained answers to this question?
Comment 43 Eran Roz 2014-06-04 11:40:15 UTC
What is the current status of the bug?
It becomes very importantly to fix it in the very soon future, as the toolserver is going to be down in 3 weeks

http://lists.wikimedia.org/pipermail/wikitech-ambassadors/2014-June/000722.html
Comment 44 Helder 2014-06-04 13:27:53 UTC
Changing to major per previous comment.
Comment 45 Luis Villa (WMF Legal) 2014-06-04 18:35:30 UTC
Current status of the bug is "questions have been posed about privacy concerns". It would be helpful to have feedback from people who use the field; in particular, my question in comment 42 and response to Krinkle's last statement in comment 40 would be good.
Comment 46 Eran Roz 2014-06-04 22:11:55 UTC
Hi Luis, thanks for the update.

(In reply to Krinkle from comment #40)
> 
> To avoid strange statistical variance like that, I'd recommend making
> timestamp of the currently running month appear as last month's. That way no
> data is excluded. This seems like a simple enough variance that tools should
> be able to work with without much hassle.

I think it is good solution. The timestamp is used mainly for filtering active vs non-active users.


(In reply to Luis Villa (WMF Legal) from comment #42)
> (In reply to Tisza Gergő from comment #35)
> > Maybe instead of a live view, this table could be generated by a query which
> > runs at the end of the day; that way no information is leaked by the timing
> > of the change. Although that still does not solve the issue for small wikis
> > since we probably want to run it more often than once a month.
> 
> If we only provide one month granularity in the answer, then it can be run
> only once a month, right? Sounds like the use cases here don't need
> precise/fine-grained answers to this question?

*My use case: the user_properties_anonym table is used for generating[[meta:Gadgets]] - statistics on the most popular gadgets accross Wikimedia projects.
* (other example for use case) User preferences statistics are provided by MZMcBride in [[en:Wikipedia:Database reports/User preferences]]
So I belive one month resolution is fine for statistics use cases, and running it once a month could be enougth.
As Krinkle wrote above, month resolution (YYYY-MM) make it hard to associate it to specific user.


However for very very small wikis I think it isn't possible to provide such statistics: refering to http://stats.wikimedia.org/EN/TablesWikipediansEditsGt5.htm (Wikipedias by active users) some languages such as Tsonga or Oromo have only one active user. I think there should be a threshold on the number of active users in a wiki.
Comment 47 Nemo 2014-06-05 05:59:03 UTC
Active editors are not the only users changing preferences on a wiki, there are also many global users changing preferences everywhere (or in many places); only user_touched knows that.
Comment 48 Silke Meyer (WMDE) 2014-06-11 16:34:54 UTC
Hi Luis, do the comments 46 and 47 provide the answers you need? If not, who can help? We are ... sort of in a hurry by now.
Comment 49 Silke Meyer (WMDE) 2014-06-12 07:26:28 UTC
Summary of what was discussed during office hour (https://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-office/20140611.txt, from [17:17:25] on):

* This will probably take longer for WMF Legal, so it's unlikely that it will be done before Toolserver shutdown.
* As far as we know, "live" tools depended on #64115 (filtered view) which has been resolved. But no live tools seem to depend on this one. Please correct me, if I am wrong!

Still, it would be interesting to get an update from Luis. :)
Comment 50 Nemo 2014-06-12 08:07:04 UTC
(In reply to Silke Meyer (WMDE) from comment #49)
> * As far as we know, "live" tools depended on #64115 (filtered view) which
> has been resolved. But no live tools seem to depend on this one. Please
> correct me, if I am wrong!

http://toolserver.org/~liangent/gadget_usage/
Comment 51 Luis Villa (WMF Legal) 2014-06-12 19:14:25 UTC
If the users are OK with the month-filtered view, and exclusion of small wikis, that seems like a workable compromise to me.
Comment 52 Liangent 2014-06-13 16:22:57 UTC
(In reply to Luis Villa (WMF Legal) from comment #51)
> If the users are OK with the month-filtered view, and exclusion of small
> wikis, that seems like a workable compromise to me.

Well to speed up the process and keep compatibility at the same time, it would be better to add the view now, with YYYYMMDD rewritten as YYYYMM01 on larger wikis, and as 19700101 on small wikis, and we can keep discussing afterwards...

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links