Last modified: 2014-10-11 13:40:03 UTC
Uploading a .doc file to MW 1.19.1 (which, incidentally, isn't listed in the versions box here), I get the following error: File extension ".doc" does not match the detected MIME type of the file (application/zip). I tried removing "doc" from application/msword in /includes/mime.types and then also adding it to application/zip (recommended at [1]). The former edit did nothing; the latter resulted in: The file is a corrupt or otherwise unreadable ZIP file. It cannot be properly checked for security. I tried keeping the above modifications to /includes/mime.types and adding $wgAllowJavaUploads = true; to LocalSettings.php, as recommended at [2], and uploading of these files works correctly. However, this seems pretty hacky. Surely allowing .doc files in $wgFileExtensions should be enough, and they shouldn't be treated as zip files or jars or anything? -- [1] http://www.mediawiki.org/wiki/Manual_talk:Mime_type_detection#Fix_for_Uploading_MS_Word_2007_.28and_greater.29_Files [2] http://www.mediawiki.org/wiki/Thread:Talk:MediaWiki_1.18/file_upload_error/reply
The problem is that modern .doc files ARE actually zip files. Can you attach a specific file that has this behavior and detail the environment that you are running Mediawiki on (OS, PHP version etc ?) It's probably one of the external tools that is incorrectly identifying this as a zip file, so you will need to tweak something in the environment if you want Mediawiki to be able to properly identify the file.
(In reply to comment #1) > The problem is that modern .doc files ARE actually zip files. Can you attach a > specific file that has this behavior and detail the environment that you are > running Mediawiki on (OS, PHP version etc ?) Attached is a file that gives the "The file is a corrupt or otherwise unreadable ZIP file. It cannot be properly checked for security" error under the following configuration: 'doc' is listed as an extension for both application/msword and application/zip and $wgAllowJavaUploads is false (well, not set in LocalSettings.php, to be precise). Environment is: Windows Server 2008, 64-bit; PHP 5.3.10; MediaWiki 1.19.1. > It's probably one of the external tools that is incorrectly identifying this as > a zip file, so you will need to tweak something in the environment if you want > Mediawiki to be able to properly identify the file. Let me know if I can provide any more information. And thank you for your help!
Created attachment 10854 [details] A sample MS Word document that is giving the described error on upload.
Does this also happen on the currently running version of MW on WMF sites? And does it happen on an install of master?
None of the WMF sites permit the upload of MS Word files (so far as I know). I'm installing from Git now, to let you know if it works on master.
Okay, so uploading the above file to MW 1.20alpha (7ab935b) on PHP 5.3.8 and Apache still gives: The file is a corrupt or otherwise unreadable ZIP file. It cannot be properly checked for security.
$ unzip test.doc Archive: test.doc warning [test.doc]: 6034 extra bytes at beginning or within zipfile (attempting to process anyway) I'm guessing it's these extra bytes, that are confusing our parser.
The file starts with d0 cf 11 e0 a1 b1 1a e1 Which is the header for the old .doc 2003 filetype. This file is probably saved in compatibility mode as a 2003 .doc file, with an internal 2010 .docx file. Apparently our code finds the .zip header before it finds the .doc header.
For future reference, info with even per version signature regexps of .doc files http://beta.domd.info/category/mime-types/applicationmsword
Created attachment 10936 [details] Patch to bypass xml document parser A patch that bypasses the zip format detector in case the file starts with an office Compound Document Format header. This isn't a working patch yet, because the upload subsequently fails in the JAVA detector, with: ZipDirectoryReader: Fatal error: trailing bytes after the end of the file comment I'm not sure if this error is required to be fatal, it will have to be checked with Tim Starling.
Thanks for the patch, Derk-Jan. You can use Developer access https://www.mediawiki.org/wiki/Developer_access to submit this as a Git branch directly into Gerrit: https://www.mediawiki.org/wiki/Git/Tutorial Putting your branch in Git makes it easier for us to review it quickly. Thanks again for your contribution!
*** Bug 34797 has been marked as a duplicate of this bug. ***
Has this patch been submitted in gerrit yet? It looks about right, so I don't think there should be a problem getting it merged. Feel free to add me as a reviewer.
Also http://beta.domd.info/pronom/fmt/40
I've got a small patch that you can look at: gerrit change If30b53dd This WFM, but needs work.
Created attachment 12192 [details] An example java applet that would get through with this patch These tests are important to prevent uploading of java applets. Its fairly easy to make a java applet with the msword header, I've attached a "hello world" example. If you have a jar file handy, just prepend "\xd0\xcf\x11\xe0\xa1\xb1\x1a\xe1" to the beginning and run zip -A foo.jar (Using the standard zip utility that is usually on linux computers). If you really wanted you could probably even make a java applet that is a valid ms word doc. ----- To fix this we would probably need to validate both parts of the file independantly (?). This is very similar to the issue with mixed pdf and odf files.
https://gerrit.wikimedia.org/r/44379 (Gerrit Change If30b53dd9c05d92e64b893471b881ee34590ee5d) | change ABANDONED [by TheDJ]
Patch abandoned over a year ago. fwiw, bug 31930 and bug 54105 suggest there are other false positives in the 'Prevent Java' detection algorithm.