Last modified: 2012-02-19 07:31:43 UTC
As of now, most file types are disallowed for uploads to the wikipedia. As i understand, this is due to the fact that MSIE can be tricked into executing potentially harmful javascript-code in any file, regardles of mime-type or file extension. This is the case because MSIE will interpret all files that "somehow look like" HTML as such. In the following i would like to propose a solution to this problem. It would be good to be again able to upload files in formats like MIDI and SVG - SVG ist especially tricky, as it allows JavaSCript by specification. This types of files would be hande so the data can be easily edited and re-used by others, an especially because the GFDL calls for the "transparent source" of a document, which is at the moment impossible to provide via Wikipedia. But it is also sad but true that javascript is in general not very secure, be it in HTML or in SVG. Thus, i would suggest doing the following: a) detect and reject all files that "somehow look like" HTML, emulating MSIEs guess - as i hear there is already experimental code for this in the CVS. I re-wrote a trivial version of this guess for the code presented at the URL given below. b) detect and reject all files containg javascript code. This will probably produce some false positives, but that is better than rejecting all but a few formats alltogether. The dection function in the URL below is rather crude, but should work most of the time. c) scan for viruses on upload, using an external virus scanner. The wrapper function i wrote is generic and can deal with any scanner that can be invoked from the command line. The example uses clamav. A prototype, proof-of-concept implementation can be viewed and tested here: http://area23.brightbyte.de/checkfile-test.php The source is available there, but would need some modifications to be integrated into mediawiki (i guess - i have never looked into the MW-source, and i don't plan to). For more information and suggestions, please go to http://de.wikipedia.org/w/wiki/Benutzer:Duesentrieb/checkfile I hope you like it and it's not to hard to put it in. It would be extremely helpful if we could again upload "obscure" things like MIDI and SVG to the Wikipedia. PS: Bug #667 relates to the same problem, but is very specific in that it justs requests a single format to be alowed. It also does not propose a solution.
I would like to add another suggestion: Show an extra warning on the description-page of media in a potentially dangerous format (or rather for any non-image, non-sound format). The message could read something like this: "This file may contain executable code that might damage your system. If you download this file and open it on your computer, potentially harmfult contents may be run. Please make sure you know what you are doing. The Wikimedia Foundation does not take any resposibility for the contents of this file or any harm it may do to your system." This is not only aimed at macro-viruses hidden in doc-fiels etc (which the virus scanner will hopefully find on upload), but maily at "trivial trojans" like batch files containing a "format C:" or some such. In that context, it meight also be good to block all files with the extensions .exe, .bat, .cmd, .reg, .js as well as any files starting with a she-bang (#!), regardles of the guessed mime-type. That is, there should be a list of forbidden extensions in addition to a list of forbidden mime-types (or does that already exist?), and an additional check for the she-bang. If you like, i could add that to the sample code, but it's trivial. When discussing this on wikitech-l, Rowan Collins suggested the following: That leads to an interesting thought: presumably, non-image files are already blocked from being included "inline"; but what about the magic [[Media:...]] links? I know it's pretty dumb to open a file that you don't know what it is, but some people genuinely are, and it would be nice to do everything we can to help them. In other words, perhaps it would make sense to have *three* whitelists: 1) Images: can be uploaded, linked to direct, and included inline 2) Other trusted files (e.g. sounds, videos): can be uploaded, linked to directly with [[Media:]], but not included inline with [[Image:]] 3) Potentially harmful files (e.g. .zip, .doc): can be uploaded (if they pass a virus scan), but not linked to directly; users are always directed to the description page, where they are presented with a warning message.
The Wikimedia Foundation's site (http://wikimediafoundation.org/) should allow uploading of any file immediately, since only trusted users can access it. I can see three problems with virus scanning at this time: * we like free software...but can we get a free virus scanner with decent definition updates? * some people may mistakenly think that it's foolproof * it will take a lot of processor time (expensive) The second problem can be resolved with decent warnings, I suppose. If we are relying on external hosting then we don't want to waste processor time. I don't like virus scanning...some are bound to slip through the net.
Be careful with that warning! Some old MS-DOS manuals contain some rubbish about users not being able to harm their computers whatever they did - people may not take that warning seriously, particularly in places where computers are just being introduced.
(In reply to comment #2) > The Wikimedia Foundation's site (http://wikimediafoundation.org/) > should allow uploading of any file immediately, since only trusted > users can access it. Maybe, maybe not. Trusted users can have conterminated files on their box too. > * we like free software...but can we get a free virus scanner with > decent definition updates? Yes, camav. Also, the code I designed is very felxible, it can easily interface to any scanner that can be run from the command line. > * some people may mistakenly think that it's foolproof Not more so than before - there is no reason to put up a sign "all files are scanned for viruses". The user will not notice the scanning unless a virus is found. > * it will take a lot of processor time (expensive) Not really. We see about one image per minute uploaded to the commons. Scanning a big jpg (1.8 MB) on my box (1.9 GHz P4) takes about 1 sec of CPU time. > I don't like virus scanning...some are bound to slip through the net. Yes. But our alternatives are 1) to ave no net at all and b) to limit allowed formats to a minimum. Note that there are virusses that can infeect JPGs, too (they use a bug in the windows image rendering libs and effect all browsers). Those are not detected right now. I belive may suggestion should be implemented as soon as possible, maybe even for 1.5; I already supplied code the implements the functionality, it should not be too hard to integrate it into mediawiki. Maybe I'll try to do it mayself, but i'm not sure when I get around to it. The disability to supply sources for graphics is frustrating to many users who would like to contribute SVG (we even have a renderer for that!), MIDI and similar formats that can easily be edited by others, unlike bitmap images.
oops, sorry, that should be ClamAV - see http://www.clamav.net/
Created attachment 497 [details] proof-of-concept patch implementing my suggestions. Please read the comment Created a proof of concept patch. This is just for reference - it uses an ugly bitfield logic, which has the advantage of being compatible iwth the current 1.5 DB schema. The Real Thing should however be based on three separate fields in the database, for media class (image, drawing, audio, video, etc) and major and minor mime type, as suggested by brion and tim. This patch consists of three parts (which should probably go into separate patches eventually): 1) Changes to the verification code in SpecialUpload.php: * improved mime detection / verification * detection of javascript * virus scanning 2) Changes to Image.php that allow to determine if a medium is fit to be shown inline, to be linked to directly, or is non-trusted formats that could contain malicious code (i.e. anything else). 3) Changes to ImagePage.php and Linker.php * put a warning on the description page of non-trusted files * disallow direct [[media:...]] links to non-trusted files The patch also contains changes to DefaultSettings, Defines and Languages to support the above. Please have a look at the patch and comment, so I can implement changes when I adopt this to the new and beter database schema.
Created attachment 511 [details] Full patch for this bug. Needs additional files and a database change, files follow.
Created attachment 512 [details] database patch for inserting new columns into the image table (path: maintenance/archives/patch-img_media_type.sql)
Created attachment 513 [details] mime type detection module
Created attachment 514 [details] standard mime.types file (mapping mime types to file extensions), used by MimeMagic.php (path: includes/mime.types)
Created attachment 515 [details] mime.info file defining mime type aliases and media types, used by MimeMeagic.php (path: includes/mime.info)
The attachments I just mad ned to be applied together. Note that this patch changes the columns in the image table (this is done by the patch-img_media_type.sql database patch): img_type is obsolete and replaced by img_media_type, img_major_mime and img_minor_mime. updater.inc was adjusted to apply this database patch, but it has not been tested. Note that the patch introduces several new setup variables into DefaultSettings.php. To take full advantage of the new upload behaviour, set the following in LocalSettings.php: $wgStrictFileExtensions = false; $wgVerifyMimeType= true; $wgAntivirus= "clamav"; #or "f-prot" $wgMimeDetectorCommand= "file -bi"; #on linux Note that if $wgMimeDetectorCommand is net set, the new code tries to use mime_content_type or finfo to detect the mime type, if either is available. Those extensions need to be compiled into PHP, their use has not been tested yet. This patch has been installed and tested successfully on Wegge's test wiki <http://playwiki.wegge.dk/>, you can try it out there. I hope it can be applied to the 1.5 branch soon.
Created attachment 516 [details] Danish translations for the new strings introduced by this patch Besides submitting the LanguageDa.php diff, I can tell that so far the test has given no warnings or errors from the MediaWiki installation, except for those introduced by my forgetfullness in the patching phase. If anybody experiences problems on http://playwiki.wegge.dk I'll be happy to investigate, given an approximate time.
Here's where the new files go (sorry, I should have put that into the description of each attachment). The patches are obviously applied as patches. The other files are, in order: * database patch: maintenance/archives/patch-img_media_type.sql * mime type module: includes/MimeMagic.php * mime type file: includes/mime.types * mime info file: includes/mime.info that should be it :)
There are some things you change that aren't directly related to this feature/functionality: This should probably be enabled at some point, but isn't related to this feature: """" -$wgFileExtensions = array( 'png', 'gif', 'jpg', 'jpeg' ); +$wgFileExtensions = array( 'png', 'gif', 'jpg', 'jpeg', 'ogg' ); """" Changing the Exif read/get functions doesn't have much to do with this: """" @@ -1146,7 +1289,10 @@ * @return array */ function retrieveExifData () { - if ( $this->type !== '2' ) return array (); + global $wgShowEXIF ; + if ( !$wgShowEXIF or !$this->fileExists) return array (); + + if ( $this->getMimeType() !== "image/jpeg" ) return array (); $exif = exif_read_data( $this->imagePath ); foreach($exif as $k => $v) { @@ -1162,11 +1308,14 @@ unset($exif[$k]); } } + return $exif; } function getExifData () { - global $wgRequest; + global $wgRequest, $wgShowEXIF; + + if (!$wgShowEXIF or !isset($this->exif)) return array(); $purge = $wgRequest->getVal( 'action' ) == 'purge'; $ret = unserialize ( $this->metadata ); @@ -1416,7 +1565,7 @@ """" Thera are also some stray whitespace changes: """" @@ -881,6 +1026,7 @@ } wfPurgeSquidServers( $urls ); } + } /** @@ -924,16 +1070,6 """" """" ); - + # Update the current image row """" """" @@ -1162,11 +1308,14 @@ unset($exif[$k]); } } + return $exif; } """" """" @@ -1443,7 +1592,6 @@ return array_key_exists( $name, $titleList ); } - """" """" } + chmod( $this->mSavedFile, 0644 ); return true; } @@ -640,153 +677,207 @@ """" """" + } ?> """" """" } + readfile( $fname ); } """" And regarding this: """" @@ -1,17 +1,20 @@ -- Extra image metadata, added for 1.5 +-- NOTE: as by patch-img_media_type.sql, the img_type +-- column is no longer used and has therefore be removed from this patch + ALTER TABLE /*$wgDBprefix*/image ADD ( img_width int(5) NOT NULL default 0, img_height int(5) NOT NULL default 0, img_bits int(5) NOT NULL default 0, - img_type int(5) NOT NULL default -1 + -- img_type int(5) NOT NULL default -1 ); ALTER TABLE /*$wgDBprefix*/oldimage ADD ( oi_width int(5) NOT NULL default 0, oi_height int(5) NOT NULL default 0, oi_bits int(3) NOT NULL default 0, - oi_type int(3) NOT NULL default 0 + -- oi_type int(3) NOT NULL default 0 ); """" if img_type and oi_type aren't used anymore please add a DELETE statement to patch-img_media_type.sql that removes them, and change maintenance/parserTests.php so that it doesn't create this obsolete table. Otherwise it looks pretty solid, I haven't really tested it on a wide range of input files yet, which I might just do;)
Some further comments after looking at it a bit more closely: * The whole MimeMagic.php file is just a row of functions most of which have to be parsed on each invocation, it should be a class that's loded on demand. * The whole wfInitMimeMaps() function just serves to make arrays like: Array ( [application/ogg] => ogg ogg [application/pdf] => pdf pdf [application/x-javascript] => js js [application/x-shockwave-flash] => swf swf [audio/midi] => mid midi kar mid midi kar [audio/mpeg] => mpga mpa mp2 mp3 mpga mp2 mp3 <snip> Array ( [OFFICE] => Array ( [0] => application/pdf [1] => application/pdf [2] => application/acrobat <snip> Array ( [] => text/rtf [application/x-javascript] => text/javascript [audio/mpeg] => audio/mp3 [application/ogg] => audio/ogg <snip> This could be simplified with something like: $mime = array( 'ogg' => array ( 'type' => AUDIO, 'mime' => array( 'audio/ogg', 'application/x-ogg', 'application/ogg', 'audio/x-ogg' ) ) ); Which would then be a native php datastructure, it would load alot faster and be a bit more understandable, the former will not really be a problem if it's rewritten as a class and invoced only when needed, but currently the whole thing is being parsed in full and arrays generated even when they're not needed such as on pages that don't have images, since Image.php is loaded even then. * There's some usage of @ to hide errors, this is generally a bad thing™ and should indicate that things aren't being done like they should.
(In reply to comment #16) I'd like to point out a few things about avar's comments. In general, he has some valid points, but with others I disagree. It would be good to have some more input on this. > * The whole MimeMagic.php file is just a row of functions most of which have to > be parsed on each invocation, it should be a class that's loaded on demand. It should indeed be loaded on demand, but being a class does not help with that (a class needs to be parsed in full, too, even if it's never instantiated). That is, the MimeMagic.php file should only be *included* on demand (ther's even a note about this in Image.php where it's included). This would require the include statement to be inside a function - I know this is possible, but brings with it other problems, like global variables defined in MimeMagic to be declared global explicitely (otherwise they would end up in the scope of the calling function). This is rather obscure, but could be done. I did not do that because I expected this to raise objections (as I belive doing includes inside functions is obscure and should be avoided). > * The whole wfInitMimeMaps() function just serves to make arrays like: ... > This could be simplified with something like: > > $mime = array( > 'ogg' => array ( > 'type' => AUDIO, > 'mime' => array( > 'audio/ogg', > 'application/x-ogg', > 'application/ogg', > 'audio/x-ogg' > ) > ) > ); > > Which would then be a native php datastructure, it would load alot faster and be > a bit more understandable, I disagree: * Complex nested array structures are less readable and harder to maintain (I tried it, it was like that initially). * the map above would have to be process to build the reverse mappings (mime to ext, mime to mime aliases, etc). Maintaining the reverse mappings by hand seperately is redundant and error prone. Building several maps from a file has the advantag of reducing redundancy and easing maintanance. * Also, mime.type files are a standard structure on *nix systems and it should be possible to use the systems standard mime map (/etc/mime.types) instead of the on provided with mediawiki. * The structure of mime.types and mime.info is much easier to understand and maintain by non-programmers, and they are also more robust against syntax errors. It is true however that native php structures would load faster. Even faster would be serialized php structures, but they are not really maintainable by hand. I belive this to be a non-issue, because those files are never parsed in normal "viewing" operation, but only under three conditions: * a new file is uploaded * an old version of a file is restored * the entry in the image table is upgraded from 1.4 format > the former will not really be a problem if it's > rewritten as a class and invoced only when needed, but currently the whole thing > is being parsed in full and arrays generated even when they're not needed such > as on pages that don't have images, since Image.php is loaded even then. That is not true: the reason the arrays are build inside a function is precisely to only do that on demand. The MimeMagic.php is included but non of the detector functions is ever called, the arrays are not initialized. Including them in the files as native structures would, among other things, cause them to always be parsed. Also, as I said before, if the function is wrapped in a class or not is irrelevant to the fact that it is always parsed in full. > * There's some usage of @ to hide errors, this is generally a bad thing™ and > should indicate that things aren't being done like they should. The @ to hide errors is used *only* for lookups in arrays when it is not known in advance if ther's a value for that key. The same could be done with isdefined, if that is preferred. Supressing errors is necessary in some situation and done throughout the code. please comment, so I can fix issues. I'm currently inclined to do the conditional include in Image.php, but i'll wait for more input for now.
Created attachment 525 [details] new version of MimeMagic.php, code is now wrapped in a class; this version matches the patch that follows. (path: includes/MimeMagic.php)
Created attachment 526 [details] new patch, using the new, OO version of MimeMagic.php New patch that hopefully takes care of avar's concerns. Main changes are * MimeMagic code is now wrapped in a class. One one global instance is needed * To get an instance of MimeMagic, use wfGetMimeMagic(). This function also includes the code file if neccessary. * I also adjusted ParserTest.php to use the new structure of the image table. enjoy!
Created attachment 527 [details] "upgrade" patch in case you already applied the first full patch (from 2005-05-10 20:46 UTC)
(In reply to comment #15) > This should probably be enabled at some point, but isn't related to this feature: > """" > -$wgFileExtensions = array( 'png', 'gif', 'jpg', 'jpeg' ); > +$wgFileExtensions = array( 'png', 'gif', 'jpg', 'jpeg', 'ogg' ); I removed ogg from the default whitelist because many web servers do not have a MIME type set for the .ogg extension. This makes such files more susceptible to various attacks: client-side type autodetection bugs are more easily triggered on text/plain or application/octet-stream, and server-side type detection bugs like interpreting files as executable scripts is easier to trigger when there's not a default file type for it to match. While the various other precautions make these less likely (checking for multiple extensions and blacklisting known bad ones, checking for known client-side script triggers, etc), I prefer a 'secure by default' philosophy. Admins can explicitly add it to the whitelist (as we have) if it's known to be safe in the environment they've created.
> New patch that hopefully takes care of avar's concerns. Main changes are It still changes $wgFileExtensions, the exif handling, whitespace and now has $wgUploadDescriptionPattern (although unused) in DefaultSettings.php, please make sure that the patches you send in only change things directly related to this particular feature, having other changes in there makes it harder to review.
*** Bug 667 has been marked as a duplicate of this bug. ***
*** Bug 1683 has been marked as a duplicate of this bug. ***
*** Bug 1790 has been marked as a duplicate of this bug. ***
Created attachment 542 [details] updated patch, implementing most of brions suggestions. Requires the new version of the other files, see below.
Created attachment 543 [details] MIME detection module, new version (includes/MimeMagic.php)
Created attachment 544 [details] updated mime info (includes/mime.info)
Created attachment 545 [details] updated mime types (includes/mime.types)
Created attachment 546 [details] updated database patch, now also removed the obsolete img_type column. (maintenance/archives/patch-img_media_type.sql)
I just uploded a new version of my patch. I hope I have implemented everything brion and avar suggested. Major changes are: * Databse consistence: -- automatically rebuild memcached image info -- give a descriptive error if the database schema was not updated -- delete obsolete column in the database patch * removed the code for disabling [[media:...]] links * Cleaned up changes to whitespace and code style. At least I tried. * Some changes to ImagePage.php - more info about non-image files, links to source of pre-rendered images, etc. * Tweaked the code to also render TIFF, BMP etc if ImageMagic is available. * PHP code is now detected an rejected * cleaned up mustRender/canRender/allowInlienDisplay Ther has been some more cleanup and minor fixed. Please have a look.
Created attachment 547 [details] MIME detection module, new version (includes/MimeMagic.php) The version by Daniel Kinzler had an undefined variable ($e) in guessTypesForExtension()
Created attachment 548 [details] MIME detection module, new version (includes/MimeMagic.php) Fixes a typo in guessTypesForExtension() (pregt_replace => preg_replace)
attachment 542 [details] contains alot of changes to the enotif code that tim cleaned up/removed recently, please go over the patch manually to make sure it doesn't revert any recent changes. And there's a syntax error on line 258 of MimeMagic.php, you need to add slashes around that regular expression.
Created attachment 549 [details] Changes to Language.php, should be applied to a clean tree (not directly after attachment 542 [details]) replaces the includes/Language.php part of attachment 542 [details], tones down the warning a little and removes the link to Wikipedia.
Created attachment 550 [details] Changes to ImagePage.php, should be applied to a clean tree (not directly after attachment 542 [details]) replaces the includes/ImagePage.php part of attachment 542 [details], moves the links around so that the direct link to the file is always immitiately after the TOC (as it was previously).
Created attachment 554 [details] MIME detection module (includes/MimeMagic.php), fixing bugs mentioned in #32, #33, #34
Created attachment 555 [details] new version of the FULL PATCH, fixing bugs mentioned in #32, #33, #34 Ok, I hope that's all. FYI: this patch includes things mentioned in #32, #33, #34, but not the changes by attachment 549 [details] and attachment 550 [details].
Commited to HEAD, marking this as FIXED.
*** Bug 1410 has been marked as a duplicate of this bug. ***