Last modified: 2013-08-04 17:12:18 UTC
The AbuseFilter extension can block uploads based on SHA-1 hash. I don't believe the UploadBlacklist extension is still necessary and I propose removing the following lines: from <https://noc.wikimedia.org/conf/CommonSettings.php.txt>: --- include( $IP . '/extensions/UploadBlacklist/UploadBlacklist.php' ); # Upload spam system // SHA-1 hashes of blocked files: # FIXME should check file size too $ubUploadBlacklist = array( // Goatse: 'aebbf277146e497c036937d3c3d6d0cac49a37a8', // 20050901082002!Patoo.jpg // Spam: '7740dab676725bcf6ea58b03b231aa4ec6c7ff34', // Austriaflaggemodern.jpg '1f1c44af6ee4f6e4b6cb48b892e625fa52238bd1', // Nostalgieplattenspielerei.jpg 'e6eb4549756b88e2c69171ffbd278be51c3e2bfe', // Patioboy.jpg 'eeb9b16edb9b5e9c58f47a558589e7eb970f32c0', // Shoessss.jpg, 73464736474847367.jpg '14e4858e63b008a7e087f2b90d3f57c021ab0f78', // Vacuumbigmell.jpg 'f989e303ef505c4706db42d5cdad67841042e2b9', // 998_pre_1.jpg // Ass pus: '27979159b13b819d1bf62e1071a0c2a54b373ed5', // Squish.png '7176aeddf3d7d8aada785721773ffeb7ee7b292e', // 20050905221505!Linguistics_stub.png * '27979159b13b819d1bf62e1071a0c2a54b373ed5', // 20050905235133!Leaf.png 'bb3acc61413ef813453a4b0c0198e30b2cd8fcf9', // Kitty100.jpg '855e55c4925644aeaef262ef25dd00815761c076', // Wikipedia-logo-100px '203bc24e5291e543779201734c49cfd88fcb2445', // Wikipodia-logo.png '14d2a0c0f3081815d04493f72ab5970c51422bc7', // Bung.jpg '3c610bc87d0ba49467c6f2d3cfba4b3321f6b351', // Blue_morpho_butterfly_300x271.png '7176aeddf3d7d8aada785721773ffeb7ee7b292e', // 20050905235450!Blue_morpho_butterfly_300x271.png '7a7f9d7ef52ed8967cb6b0171ef8d45e2a0c68b9', // Leaf.png '1ecfaf883c4130e1827290ad063158d0037631e6', // Wikimedia-button1.png '1c73d6596685175a8af6b08508468252c4dff8e2', // Windbuchencom.jpg '203bc24e5291e543779201734c49cfd88fcb2445', // Leaf.png '95d825bcf01ca3e553f4175dd7238ff12ba1d153', // 20050915055251!New_Orleans_Survivor_Flyover.jpg 'bbd292d917d7fa7dec9a524de77ca39bd8cdf738', // 20050915060435!New_Orleans_Survivor_Flyover.jpg // Some singnet guy 'bed74eef04f5b54884dc650679e5688c7c1f74cb', // Peniscut.jpg ); --- from <https://noc.wikimedia.org/conf/InitialiseSettings.php.txt>: 'UploadBlacklist' => "udp://$wmfUdp2logDest/upload-blacklist", This will help reduce our technical debt.
Bug 44975 is a soft dependency, not a hard dependency.
This likely needs broader discussion.
(In reply to comment #2) > This likely needs broader discussion. http://lists.wikimedia.org/pipermail/wikitech-l/2013-July/070796.html
Might be worth waiting for global AbuseFilters.
https://gerrit.wikimedia.org/r/76229
https://gerrit.wikimedia.org/r/76230
Sounds good to me. The hashes listed in the settings files are more than 8 years ago and there is only a few entries, so I guess that prove UploadBlacklist is not useful anymore :-) I will be happy to see it gone.
(In reply to comment #1) > Bug 44975 is a soft dependency, not a hard dependency. I'd disagree right now. Considering these are currently blocked on *all* wikis, having to go around and add AF rules for each wiki to block these same images would be a huge waste of time. Other than that, totally in favor of killing this.
(In reply to comment #8) > I'd disagree right now. Considering these are currently blocked on *all* > wikis, having to go around and add AF rules for each wiki to block these same > images would be a huge waste of time. Respectfully, I think you're making a fatal assumption here: the current blacklist isn't ever being hit. According to Reedy's examination of the UploadBlacklist logs, there have been 0 hits this year, as I understand it. While the current blacklist is indeed global, there's been no evidence presented that there will be any need to go around adding AbuseFilter rules related to globally blacklisted images to any wiki. (This of course side-steps the point that many wikis disable local uploads altogether and rely on a single wiki [Wikimedia Commons].)
(In reply to comment #9) > (In reply to comment #8) > > I'd disagree right now. Considering these are currently blocked on *all* > > wikis, having to go around and add AF rules for each wiki to block these same > > images would be a huge waste of time. > > Respectfully, I think you're making a fatal assumption here: the current > blacklist isn't ever being hit. According to Reedy's examination of the > UploadBlacklist logs, there have been 0 hits this year, as I understand it. > Not hitting doesn't mean people wouldn't try if the blacklist was gone. Maybe they gave up long ago ;-) > While the current blacklist is indeed global, there's been no evidence > presented that there will be any need to go around adding AbuseFilter rules > related to globally blacklisted images to any wiki. (This of course > side-steps > the point that many wikis disable local uploads altogether and rely on a > single > wiki [Wikimedia Commons].) This is true. Blacklisting on commons would cover a great many cases.
We indeed have a central AbuseFilter database. So I guess it would be all about adding the existing hash in a new global rule :-] I have no idea who can create the new rule though.
Once the blocker for this bug (bug 44975) is finished then we'll add the hashes as global rules.
(In reply to comment #12) > Once the blocker for this bug (bug 44975) is finished then we'll add the > hashes as global rules. No, we really won't. Creating and deploying an AbuseFilter filter (particularly a global filter) requires a demonstration of active abuse. There's no such demonstration here (cf. comment 9).
(In reply to comment #5) > https://gerrit.wikimedia.org/r/76229 Merged and deployed. (In reply to comment #6) > https://gerrit.wikimedia.org/r/76230 Merged and deployed. This bug is resolved/fixed (cf. [[Special:Version]]). Thank you, Reedy!
It's so easy to derive a spammed image by schaning a few random bits in it (including within invisible embedded metadata, such as camera info, or creator software version string, or adding some randomly selected image backgrounds around the bad image) that I think it is superfluous to check the SHA1 digital signature to detect spammed images. SHA1 is the wrong method to identify spammed images, and a better method based on image subsampling, with some distance threashold on color plane values, ignoring all metadata fields, but taking into account the ICC profiles to produce the accurate final color before subsampling, will be much better. Image could be identified by creating identifiable bounding boxes between the most contrasting pixels, in order to eliminate the effect of image realignement with custom internal margins of variable sizes. This done the subsampling can be correctly aligned to a box of 512x512 pixels (if the image is not square, its minimum width/height size will be set between 256 and 512, the maximum will be set to a multiple of 512, creating a horizontal or vertical band of 512x512 squares), and then SHA1 can be computed on subblocks of 8x8 pixels, to compute the number of common subblocks, giving a note for possible copies. Above some threshold, this note will bring an alert for human inspection in a specific category or report showing the two images (one which is identified as spammed or infringing a copyright, and the new image). There exists probably newer algorithms to help matching comparable images. For example Google is able to recognize people faces, or monuments automatically from any photo, using heuristic methods that can correct the effects of difference of light, change of resolution, image cropping, border decorations, slight rotations... Many spammed images are also displaying text in them (e.g. domain names, or tiny URLs), and some OCR may recognize those texts as an additional method to identify spam (we could also forbid the display of external URLs, notably those hosted on tiny URL providers). Are there works somewhere about automatizing recognition of image subjects and a way to develop an extension allowing to compare new incoming images with some wellknown bad images, in a special page where the problematic images will not be publicly downloadable/reusable and so that Commons will not be the distribution vector, notably by phishing emails ? Do we monitor security alerts about phishing emails containing images that could be hosted on Commons or on another wiki? Can we also develop identification mechanims as well for other media types (notably PDF, ePUB, audio and video, without using the basic SHA1 signature ?
Patches welcome. Or even a link to a description of an algorithm (preferably along with analysis of how effective algol is) that can generate some sort of hash from an image that stays the same for things like resizing (or recompressing) the image, and is very efficient to compute. It should be noted that upload blacklist was never an antispam measure. It was meant to prevent malicious (but stupid) users from uploading very disturbingly graphic images.