Last modified: 2014-06-11 20:50:44 UTC
Given the usefulness of having metadata including sha of deleted files, and that it is available on the toolserver it should be exposed on labs.
That information is not available to normal users on the project, and therefore requires an okay by Legal to clear. Toolserver had imperfectly sanitized replication, and there were quite a few things available there that never should have been without clearance. :-) Adding Luis to the bug so that they can opine.
Is https://www.mediawiki.org/wiki/Manual:Filearchive_table the best place to figure out what is actually in the relevant table? And do we want all fields or just some?
I would prefer as much as possible, the only field that should contain information that is sensitive is fa_description
The current toolserver view seems to be everything but fa_description and fa_sha1. * fa_description should be left out as it might contain private info * fa_sha1 is quite recent (1.21) so I think we just never added it at the Toolserver mysql> describe filearchive; +----------------------+--------------------------------------------------------------------------------------------------------+------+-----+---------+-------+ | Field | Type | Null | Key | Default | Extra | +----------------------+--------------------------------------------------------------------------------------------------------+------+-----+---------+-------+ | fa_id | int(11) | NO | | 0 | | | fa_name | varbinary(255) | NO | | | | | fa_archive_name | varbinary(255) | YES | | | | | fa_storage_group | varbinary(16) | YES | | NULL | | | fa_storage_key | varbinary(64) | YES | | | | | fa_deleted_user | int(11) | YES | | NULL | | | fa_deleted_timestamp | varbinary(14) | YES | | | | | fa_deleted_reason | blob | YES | | NULL | | | fa_size | int(8) unsigned | YES | | 0 | | | fa_width | int(5) | YES | | 0 | | | fa_height | int(5) | YES | | 0 | | | fa_metadata | mediumblob | YES | | NULL | | | fa_bits | int(3) | YES | | 0 | | | fa_media_type | enum('UNKNOWN','BITMAP','DRAWING','AUDIO','VIDEO','MULTIMEDIA','OFFICE','TEXT','EXECUTABLE','ARCHIVE') | YES | | NULL | | | fa_major_mime | enum('unknown','application','audio','image','text','video','message','model','multipart') | YES | | unknown | | | fa_minor_mime | varbinary(32) | YES | | unknown | | | fa_user | int(5) unsigned | YES | | 0 | | | fa_user_text | varbinary(255) | YES | | | | | fa_timestamp | varbinary(14) | YES | | | | | fa_deleted | tinyint(1) unsigned | NO | | 0 | | +----------------------+--------------------------------------------------------------------------------------------------------+------+-----+---------+-------+ 20 rows in set (0.00 sec)
I know we've seen crazy things be put in filenames before - is that oversightable? Otherwise, agree that fa_sha1 should not be problematic.
Oversight no longer exists, but pretty much anything can be rev_del'ed if that is what you are referring to. However I have never seen a case of a file name being problematic.
I think it was James who told me that there have been crazy file names in the past, but that may be a fever dream - James? With regards fa_description: is that normally publicly visible? I.e., would sensitive information in it be rev_del'd as part of normal site moderation/oversight? Because with other sensitive fields, one option is to simply respect revdel and keep it from being propagated.
Blocks toolserver migration.
There is, IMO, a plausible issue with the SHA but I don't know whether it is relevant for legal: its primary use case is (of course) to note files which have been previously uploaded then deleted, but it therefore necessarily allows any third party to determine whether any specific file they have the hash to has been uploaded in the past. Could this be used by, say, a government agency to find who uploaded some files that they were displeased with?
Can't they already do that by simply uploading the file instead of the SHA?
At best they could tell that some file with the same /name/ existed; the SHA will confirm content. AFAIK, uploading doesn't check against deleted files' SHAs.
I(In reply to Marc A. Pelletier from comment #11) > At best they could tell that some file with the same /name/ existed; the SHA > will confirm content. AFAIK, uploading doesn't check against deleted files' > SHAs. I may be wrong but I believe it does (and tells you that the same file is uploaded at X and I 'think' that one was deleted before though I'd have to double check that.
*** This bug has been marked as a duplicate of bug 57697 ***
(In reply to Marc A. Pelletier from comment #11) > AFAIK, uploading doesn't check against deleted files' SHAs. It does. And it tells you the title. From the title, look up the (public) logs and you have that user.