Last modified: 2008-03-19 21:48:21 UTC
it would be very usefull to be able to search for images by a hash (the exact type of hash doesn't bother me too much md5 or sha1 would be fine) this hash should also be displayed on the image description page somewhere. the point of this is if i see an image in the commons that says "from german wikipedia" and the uploader has renamed it i want to be able to find the image in the german wikipedia.
This feature would also help with duplicate files under different names, if extended a bit. People upload a file not knowing that it's already there, because the first one wasn't categorized very well or the duplicate uploader just doesn't look thoroughly enough. There's however no reason that people would have to do this searching manually. On each upload of a file MediaWiki could: 1) Generate a hash of the uploaded file 2) Check if the generated hash is already known, ie. if the file is a duplicate * This part would be the only necessary database query for a hash search feature. Then, depending on configuration based on analysis of possible false hash collisions and such, it could then: 3a) Display a warning to the user that the file already exists or 3b) Display an error to the user that the file already exists, and reject the file This would require counting a hash, or even multiple hashes with different methods, for all revisions of all existing files. Duplicate detection would not work properly while hashes are being generated and added to the database. Hashes for deleted files or revisions would also be useful for generating different warnings when someone uploads a file already deleted before, but its implementation might be more complicated.
what would also be usefull is to generate hashes for all thumbnails that are generated. As often the kind of people who copy images without proper attribution are the kind of people who copy a thumbnail rather than the full res image.
*** This bug has been marked as a duplicate of 5763 ***
Note there is now a properly-indexed SHA-1 hash field on the image table in recent versions. I have the vague recollection that there's a way to do lookups by hash in the API, but not in the UI at present. Dupe file warnings are also not currently made.
(In reply to comment #4) > I have the vague recollection that there's a way to do lookups > by hash in the API, but not in the UI at present. > api.php?action=query&list=allimages&aisha1=123abc
[[Special:FileDuplicateSearch]] introduced with r32180. A link on the image description page to Special:FileDuplicateSearch/filename.ext added too. Bug 11984 filed for dupe file warning at time of upload.
And bug 13434 is also filed :)