Last modified: 2008-10-20 18:01:16 UTC
I'm opening this for tracking and comment collection purposes, it is not yet actionable.
A recent foundation-l threat linked to some scaremongering about viruses/trojans on websites open to outside submissions. In this thread it was pointed out that the Wikimedia wikis are not sufficiently strict with uploaded content for some extensions. The risk created by this is probably fairly low but it should be addressed.
For example, a win32 exe uploaded under a number of names:
As far as I can tell our current detection of Ogg and Midi files appear reliable and accurate. I don't see why we can't enforce for those types. Can anyone provide any counter examples or suggested test cases?
We do not appear to correctly detect valid XCFs.
I am not sure where we stand on the other formats.
It _should_ be enforcing for recognized types, but that might only be for internally-recognized ones at the moment. We need to take a peek at that...
My strong recommendation is to rip out the whole MimeDetector stuff with its unreliable external dependencies and just do our own magic detection for all types. That should be enough for many of the basics -- keep out DOS/Windows executables, detect Ogg, PDF, SVG, etc nicely.
I'm not sure how easy it is to cleanly detect the various office formats. The StarOffice/OpenOffice/OpenDoc ones are ZIP-based, which may make it hard to tell legit docs from other ZIP files.
ZIP-based documents are unusual enough so accepting Zip files as them is not a great problem. We got much more articles uploaded as PDFs.
Note that we could easily differenciate them by reading the zip end instead of the first Kb and searching in the central directory for content.xml and mimetype. We can even differentiate the real file reading the mimetype content (which is not supposed to be compressed).
As MIME detection is quite good, eg. [[en:Bitmap file]]s with jpg extension get image/x-ms-bitmap mime, could a extension-mime association be enough?
I would suggest guessing the MIME type from the extension, then using the corresponding media handler to validate it. We could fall back to some generic magic number extraction method for types with no media handler. OggHandler for instance could do some fairly complex validation on uploaded Ogg files, but it needs an appropriate entry point.
We also allow encrypted/DRMed PDFs right now. These should be denied. People are using the PDF protection to bind advertisments into documents, I.e.:
[gmaxwell@cherenkov ~]$ pdfinfo /home/syncin/wikipedia/commons/a/a1/Latin_for_Beginners.pdf
Title: Latin For Beginners
Subject: Latin Grammar
Author: Benjamin L. D'Ooge
Creator: Acrobat 5.0 Image Conversion Plug-in for Windows
Producer: Acrobat 5.0 Image Conversion Plug-in for Windows
CreationDate: Tue Oct 1 10:32:17 2002
ModDate: Mon May 24 10:26:33 2004
Encrypted: yes (print:yes copy:no change:no addNotes:no)
Page size: 612 x 792 pts (letter)
File size: 5839223 bytes
PDF version: 1.5
Will you only allow plain valid SVG then?
There's lots of Inkscape(annotated) SVG on the web ...
I very much agree with tim, although i would suggest to decouple handling of file formats from handling of media types (the same player may be used for different formats, for example). I wrote about that a while back, see http://brightbyte.de/page/Media_handlers and http://brightbyte.de/page/MediaWiki_media_dreams
This has been bugging me for quite some time... if only i wasn't committed to studying right now. I kind of feel the urge to write this :)
> Will you only allow plain valid SVG then?
> There's lots of Inkscape(annotated) SVG on the web ...
Inkscape/Sodipodi SVG should certantly continue to be allowed, but that wouldn't stop us from validating that uploaded ".svg" files are valid XML which meet a number of basis tests of SVGness.
There is a big difference between an extended dialect of SVGs and a windows .exe file renamed to .svg. :)
Just check that SVGs start with <?xml or <svg with an optional utf8 bom at the beginning. This handles 98% of uploaded svgs. It doesn't take into account UTF-16 svgs, (which we don't render) and would likely be the result of a broken editing, so the encoding parameter of the text declaration would also be wrong.
It's great to filter out .svg's renamed as .exe
It would be better even to make sure SVGs are valid (The W3C validator (that you can locally install) is very useful, though you might need to filter out foreign namespace stuff first to allow Inkscape/Sodipodi/other annotations)
Ogg, PDF, MID, ODF, SVG, and XCF all have signature checks at present, as well as signature blacklists for EXE.
Old StarOffice/OpenOffice 1.x formats could be added, but should be considered deprecated at this point.