Last modified: 2014-01-03 15:46:40 UTC
This issue was converted from https://jira.toolserver.org/browse/DBQ-65. Summary: Find most common hashes for Commons' deleted images Issue type: Task - A task that needs to be done. Priority: Major Status: Done Assignee: Mr.Z-man <mrzmanwikimail@gmail.com> ------------------------------------------------------------------------------- From: Mike.lifeguard <mike.lifeguard+ts@gmail.com> Date: Tue, 16 Jun 2009 18:31:09 ------------------------------------------------------------------------------- I'd like to have a list of the most common sha1 hashes for Commons' deleted images. Specifically, I'm interested in vandalistic images, so the query could be limited to a) those with \b(?:vandal(?:ism)?|goatse|shock image|...)\b in the delete reason and/or b) the last ~10,000 deletions or something (if that'd actually make a difference) For output, I'd like a list of the top 100 sha1 hashes and a file name for each hash so I can find the file. So far as I know, the hashes aren't stored for deleted files.
------------------------------------------------------------------------------- From: DaB. <dab@ts.wikimedia.org> Date: Fri, 19 Jun 2009 20:49:33 ------------------------------------------------------------------------------- You can find a list at http://toolserver.org/~dab/queries/zman1.txt, the query was SELECT CONCAT("# ", fa_storage_key, "[[:image:",fa_name,"]] "), COUNT(fa_storage_key) as anzahl FROM filearchive WHERE fa_archive_name IS NULL AND fa_deleted_timestamp>"20080000000000" AND fa_storage_key!="" GROUP BY fa_storage_key ORDER BY anzahl DESC LIMIT 100; It listed the storage_key (that's a sha1-hash in a mediawikiformat), a image-name of a deleted picture with this key, and the count for all pictures that was delete after 1.1.2008. I hope it help you.
------------------------------------------------------------------------------- From: Mr.Z-man <mrzmanwikimail@gmail.com> Date: Fri, 19 Jun 2009 20:55:35 ------------------------------------------------------------------------------- The hashes are stored for deleted files, concatenated with the file extension as fa_storage_key in the filearchive table. #### Most common of all deleted files: Query: SELECT SUBSTRING(fa_storage_key, 1, 31) AS hash, COUNT(*), fa_name FROM filearchive WHERE fa_storage_key IS NOT NULL AND fa_storage_key != '' GROUP BY hash ORDER BY COUNT(*) DESC LIMIT 100; Results: http://toolserver.org/~alexz/DBQ-65/all.txt #### From the last 6 months: Query: SELECT SUBSTRING(fa_storage_key, 1, 31) AS hash, COUNT(*), fa_name FROM filearchive WHERE fa_storage_key IS NOT NULL AND fa_storage_key != '' AND fa_deleted_timestamp > 20090119000000 GROUP BY hash ORDER BY COUNT(*) DESC LIMIT 100; Results: http://toolserver.org/~alexz/DBQ-65/recent.txt #### Vandalism noted in deletion reason: Query: SELECT SUBSTRING(fa_storage_key, 1, 31) AS hash, COUNT(*), fa_name FROM filearchive WHERE fa_storage_key IS NOT NULL AND fa_storage_key != '' AND fa_deleted_reason RLIKE "[[:<:]](vandal(ism)?|goatse|shock)[[:>:]]" GROUP BY hash ORDER BY COUNT(*) DESC LIMIT 100; Resuts: http://toolserver.org/~alexz/DBQ-65/vandal.txt #### Recent vandalism deletions: Query: SELECT SUBSTRING(fa_storage_key, 1, 31) AS hash, COUNT(*), fa_name FROM filearchive WHERE fa_storage_key IS NOT NULL AND fa_storage_key != '' AND fa_deleted_timestamp > 20090119000000 AND fa_deleted_reason RLIKE "[[:<:]](vandal(ism)?|goatse|shock)[[:>:]]" GROUP BY hash ORDER BY COUNT(*) DESC LIMIT 100; Results: http://toolserver.org/~alexz/DBQ-65/recentvandal.txt
This bug was imported as RESOLVED. The original assignee has therefore not been set, and the original reporters/responders have not been added as CC, to prevent bugspam. If you re-open this bug, please consider adding these people to the CC list: Original assignee: (none) CC list: mike.lifeguard+bugs@gmail.com, wikimedia-bugzilla@dabpunkt.eu