Last modified: 2008-03-18 23:17:06 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T7763, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 5763 - Store the hash of uploaded files to allow duplicate checking, etc.
Store the hash of uploaded files to allow duplicate checking, etc.
Status: RESOLVED FIXED
Product: MediaWiki
Classification: Unclassified
Uploading (Other open bugs)
unspecified
All All
: Normal enhancement with 8 votes (vote)
: ---
Assigned To: Nobody - You can work on this!
:
Depends on:
Blocks: 1459
  Show dependency treegraph
 
Reported: 2006-04-29 21:57 UTC by Platonides
Modified: 2008-03-18 23:17 UTC (History)
2 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments
My patch for it (1.71 KB, patch)
2006-04-29 21:59 UTC, Platonides
Details

Description Platonides 2006-04-29 21:57:32 UTC
I have been regularly checking & tagging the file uploads at my eswiki (main
project), and thus i have seen almost all images uploaded on last times.

I realized that we have not any good way to compare files. Sometimes you see a
file you're pretty sure has been uploaded before, but you have no method to
found it. If it's been uploaded with the same name you can think is the same
(can't be sure!). But usually has another name. Even if you have seen it on the
same session you'd need to wath the images to find it.

The same goes when a uploaded file says: "from X wikipedia". You need to go
there, download and wath the file to see if it fits.

As a conclusion, i decided we needed a file hash to uploads. Then, there cames a
new question: where do i store it?


The image table seems a good place, creating a new (indexable) field for it.

This has two problems:
-We need to change the table fields.
-We don't record the has of deleted images. No information on reupload :(

The final solution could be a new table able to relation, but i didn't want to
make drastic changes on table design.

So i tried to make it simple and simply put the md5 hash on the logs. Pros: It's
a minor change. Cons: It's not a big change, so we can't use all power this
feature gives us, BUT it's more than nothing. :-)

The patch i did for it (against r1495) is attached. A new $md5desc variable is
defined to have the description with the hash to stamp on the logs. That applies
to the Special:Log and the 'page history'. It also truncates the description
wich appears on these logs if it's too large. It was also previosly truncated,
but couldn't find where, was it truncated by the db?

Note that as at Special we can't search by descrition, we can't make a log
complete search to see if an image was previously uploaded, but we can use the
browser-search feature to check in recent ones, and also have a bot logging them
to make them easier to fetch.

Bots could also use this data for interwiki image comparing. On the TODO list
(bad to do as it's implemented) there would also be a check to the previous
image hash to reject it if new version is the same as previous.

On the SoC proposals, hashing was also requested, though in a more extensive plan.



P.D. While writing this, an image was uploaded that i'm sure that was before.
Searching, it was, in fact, deleted three days ago (with the same name, of
course only my memory can atestiguate it's the same).
Comment 1 Platonides 2006-04-29 21:59:52 UTC
Created attachment 1627 [details]
My patch for it

Patch to r1495. Needs $wgGetImageMd5 to be defined at the LocalSettings.
Comment 2 Rob Church 2006-04-29 22:09:26 UTC
I would urge adding an img_hash field, and storing such a hash of the file
there. This would facilitate use in other locations, e.g. the aforementioned
Summer of Code idea.
Comment 3 Shinjiman 2006-04-30 05:55:06 UTC
besides the MD5 hash, the another hashing methods like SHA1 is quite considered. :)
Comment 4 Platonides 2006-04-30 10:32:41 UTC
I agree, Rob Church, but as i explained, i tried to keep it simple. If you make
changes to the tables, you'll have to change more clasess, maybe make some kind
of system migration.

Plus, you'd be more rejealous of it, and i could have made more bugs.

I did the first step, but there's still too much way to do on this.

Shinjiman, agree. You can see i said "decided we needed a file hash". I used md5
as it's more common  (47.800.000 md5 vs 18.700.000 sha1 on google), but if you
change md5_file for sha1_file on the code, it'll give you the sha-1 instead :)
Comment 5 Rob Church 2006-04-30 15:01:19 UTC
(In reply to comment #4)
> I agree, Rob Church, but as i explained, i tried to keep it simple. If you make
> changes to the tables, you'll have to change more clasess, maybe make some kind
> of system migration.

So? Functionality should never be unreasonably sacrificed for the sake of
performance or workload.

> Plus, you'd be more rejealous of it, and i could have made more bugs.

What the hell does this mean?
Comment 6 Platonides 2006-04-30 20:19:39 UTC
(In reply to comment #5)
> What the hell does this mean?
a) More complexity => Easier to make errors (bugs) + I'm no expert in wikimedia
coding.
b) You are trusted enough by the community, I need to get the patch reviewed &
accepted.

> So? Functionality should never be unreasonably sacrificed for the sake of
performance or workload.
I expect the above has made clearer my reasons. Take into account this is my
first code submission.

If you think adding an img_hash field urges, you can add it and make it work
doing nothing (unused). Then i can try to complement it to actually work.

Note that even the approach is subject to discussion, as that wouldn't take into
account search on deleted images ^^  Maybe this should be discussed elsewhere?
Comment 7 Tim Starling 2006-05-17 23:27:02 UTC
Add an img_hash field. Change ImagePage.php to display the hash from img_hash
where appropriate. Add the MD5 hash to the log comment on upload, not to
img_description. Use "MD5" in the user interface, not "Md5", and put such
strings in the language file, don't hard-code them. An indexed hash for deleted
images can wait until we have a deleted image archive.
Comment 8 Tim Starling 2006-05-18 02:00:08 UTC
Apparently MD5 collisions can now be found in under a minute on a desktop PC,
with any chosen IV, since March 2006. There is public source code available to
generate these collisions. It's probably time we started migrating away from it.
The author of the March 2006 paper seems to think that SHA-1 and SHA-2 may be
similarly vulnerable, but nonetheless they might be the most practical
alternatives for the time being.
Comment 9 Platonides 2006-05-18 06:56:53 UTC
We're searching for methods for detecting the same image, not image using
through detected hash. I doubt that code to generate collisions still generates
valid images, but it's worth to know it. Any link?

Php provides sha1_file() funtion too, so no problem. There's no sha2_file()
function. There're extension that provide it, but we probably don't want to need
more php extensions than indispensable.


Tim, i guess you're showing the steps to do. Again, How is a new field added? I
could touch my table myself but it'd break everyone else's ;)

Yes, i know about about language files. If i dared to hardcode it was because i
don't think there are _translated_ names for it. And also because it was a bit
simpler. ;)
Comment 10 Rob Church 2006-05-18 07:02:43 UTC
(In reply to comment #9)

> Php provides sha1_file() funtion too, so no problem. There's no sha2_file()
> function. There're extension that provide it, but we probably don't want to need
> more php extensions than indispensable.

So use the sha1_file() function.

> Tim, i guess you're showing the steps to do. Again, How is a new field added? I
> could touch my table myself but it'd break everyone else's ;)

1. Update the table definitions in the maintenance folder (all of them)
2. Add a patch file in SQL format to the archive folder
3. Alter maintenance/updaters.inc and add the new field as demonstrated there

This means that "everyone else" can run the update scripts and expect it to work.

> Yes, i know about about language files. If i dared to hardcode it was because i
> don't think there are _translated_ names for it. And also because it was a bit
> simpler. ;)

A poor excuse. Just add the message and leave the translators to decide if their
language has a word for it. 'MD5' and 'SHA1' don't sound like the sort of thing
that would do, however.

If you're going to do it, do it properly, otherwise it's useless.
Comment 11 Platonides 2006-05-18 20:25:38 UTC
Ok, i think i should [[Wikipedia:Be Bold]] and try it.
Comment 12 Rob Church 2006-07-04 11:24:12 UTC
*** Bug 1459 has been marked as a duplicate of this bug. ***
Comment 13 Philip Ganchev 2006-08-13 08:16:08 UTC
Is it better to expose the hashes to the user, or use them only internally so
that the user only knows that images are being compared?
Comment 14 Aryeh Gregor (not reading bugmail, please e-mail directly) 2006-08-13 20:25:27 UTC
Another question: how about normalization?  If you're using this for image
comparison, it's unnecessarily limited to only permit comparison between
identical formats, identical sizes, and identical compression levels.  A logical
baseline would be a smallish, low-quality JPEG (obviously stripped of metadata),
since the compression artifacts would be important for comparing JPEGs to
lossless formats.  More hits is going to be better than fewer, of course, given
that we aren't looking to *prevent* anyone from saving a duplicate, just giving
the option of cancelling and/or superseding the other image(s).

(In reply to comment #13)
> Is it better to expose the hashes to the user, or use them only internally so
> that the user only knows that images are being compared?

May as well expose them, unless you're going to have some kind of encryption
step using a private key (which seems more than slightly paranoid).  This is an
open-source project, after all; anyone could just make the hashes themselves.
Comment 15 Brion Vibber 2006-08-13 20:26:30 UTC
The hash will be the filename.
Comment 16 peter green 2006-08-13 20:32:44 UTC
best to expose the hashes, much easier to copy a hash from one wiki and use it
to search on another than to save and re-upload the file everywhere.
Comment 17 Aryeh Gregor (not reading bugmail, please e-mail directly) 2006-08-13 21:24:40 UTC
(In reply to comment #15)
> The hash will be the filename.

Storing a normalized hash for further comparison would remain useful.
Comment 18 Brion Vibber 2006-08-13 21:28:45 UTC
A "normalized hash" doesn't sound very practical when it comes to 
images. It is possible to compare similar images, but that's 
going to be something totally unrelated.
Comment 19 Aryeh Gregor (not reading bugmail, please e-mail directly) 2006-08-13 21:32:40 UTC
(In reply to comment #18)
> A "normalized hash" doesn't sound very practical when it comes to 
> images. It is possible to compare similar images, but that's 
> going to be something totally unrelated.

I mean "hash of a normalized image".  If you normalize the image to low-quality
fixed-size JPEG before saving, you'll be able to catch a lot of matches that
wouldn't otherwise show up due to different formats, sizes, compression levels,
even metadata.  Still not perfect, but what is?
Comment 20 Brion Vibber 2006-08-13 21:34:35 UTC
Not just not perfect, but totally impractical. You're not going 
to get cryptographic hashes to match that way, at all. It simply 
wouldn't work, as any 1-bit difference will give you a hugely 
different value.
Comment 21 Aryeh Gregor (not reading bugmail, please e-mail directly) 2006-08-13 23:05:40 UTC
You're right, normalization needs to be much more extreme than just converting
to low-quality JPEG.  I achieved it with two test images, [[Image:Libertatis
Aequilibritas GFDL.jpg]] and [[Image:Libertatis Aequilibritas GFDL.png]], by
reducing both to 10-pixel-wide monochrome with no dithering, after converting
transparency to white; they were then identical except that for some reason they
were negatives of each other, presumably an artifact of the algorithm used. 
This would give a 2^-100 probability of a chance match, which isn't much worse
than the probability of a random MD5 match.
Comment 22 Gregory Maxwell 2007-08-06 17:15:52 UTC
Some useful background for anyone else looking at this issue:

Fuzzy image matching, also called image indexing, perceptual hashing, or image fingerprinting is an area under active research.

A paper you might want to read is: http://www.robots.ox.ac.uk/~vgg/research/affine/det_eval_files/mikolajczyk_pami2004.pdf

Beyond the difficulity of finding descriptors fast lookup also tends to be a problem. Good image descriptors tend to be high dimensionality. Traditional tree based (i.e. kd-tree) approaches fail to produce fast lookups for nearest matches with high dimensionality data. 


For the Wikimedia projects we store SHA1s for deleted images. There is now a set of IRC bots in (#commons-image-uploads2, #wikipedia-en-image-uploads) which check all new uploads against the deleted image SHA1s. They are catching a fair number of reuploads of deleted images.

I'm hoping to add a first-pass fuzzy matching support in the next couple of weeks. I'm not sure how a fuzzy image matching 'similar images' feature can be integrated into mediawiki proper.
Comment 23 Brion Vibber 2008-03-18 23:17:06 UTC
SHA1 hash field got added a while ago. Yay!

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links