Last modified: 2014-06-11 17:33:24 UTC
Original bug title: Make limited information from filearchive available to everyone Reasoning: When it comes to identifying copyright violations and [[WP:Sock puppetry]], it is essentially helpful if you can check whether a file has been previously uploaded without uploading the file into the stash yourself. Demand: title and size, filterable by sha1 ( fasha1=HEXHASH&faprop=title|size ) What about privacy? Not an issue. If you upload to a file to the stash, you are able to obtain this information anyway.
Dumb question: what's the use case for this? (Ideally I'd also like to understand the use case for the existing functionality as well, but one thing at a time...)
(In reply to comment #1) > Dumb question: what's the use case for this? see Reasoning. +Let me give you 3 examples: User uploads copyright violation. Patroller marks file for deletion. Admin deletes file. User uploads same file again. Patroller can now sha1lookup whether a similar file did exist before at https://commons.wikimedia.org/w/index.php?title=Commons:User_scripts/File_Analyzer&withJS=MediaWiki:FileAnalyzer.js and identify the user(s) who uploaded that file. Bot coder and bot are not administrators. Bot uploads a batch of very huge files. But some were previously deleted and should not be uploaded again. Bot could check SHA1 before uploading to save bandwidth. File is marked for transfer from en.wikipedia to Commons. Bot/Tool could check whether this file was previously deleted at Commons and refuse the transfer. ... Please let me know if this was convincing enough or whether you would like to get more feedback from Commons users. Or are you asking for a technical explanation of SHA1 and that kind of stuff? Sorry, here at bugzilla, it's always a bit difficult to get it right because I never know to whom I am talking without googleing.
(In reply to comment #2) > I never know to whom I am talking without googleing. I've bookmarked https://wikimediafoundation.org/wiki/Staff?showall=1 for that :)
Examples were perfect, thanks - understand the use case much better now. I'm fine with this from a privacy perspective, as long as it respects suppression of titles (which should also be respected if you do a full file upload - I understand that isn't currently the case, have filed bug 59167 for that.) [Also, I've tweaked my settings to say a little bit about who I am, hope that helps (though I suppose that might make you *more* likely to explain SHA1, which I definitely don't need!) ]
From a (non-sysop) bot writing perspective, it would be great to be able to get an array of previous deletions for an queried SHA-1. At the moment pywikipediabot passes back a name of a matching file, but not all matches. I suggest that the deleted file names are passed back (incredibly useful info when these contain reference numbers from the original source, such as Flickr photo ids) *unless* there were a reason to suppress the filename from the deletion log. Other basic information (dates, uploader, editors) would be great for a bot to take action on, or make decisions about. Scenarios include a bot taking different actions based on whether it sees its name as a past uploader or whether upload dates fall within the dates of a recent batch upload project. There may be privacy issues on some data elements (such as listing all past editors or uploaders), however I think we should expect to be able to automatically distinguish between ordinary deleted material (such as copyvios) and files which were deleted due to respect/privacy concerns.
Not related to 58791, removing dependency.