Last modified: 2009-12-06 18:48:37 UTC
Apparently, any file uploaded to Commons that has the character pair "<?" anywhere within the first 1024 bytes gets detected as having the MIME type "application/x-php". This seems a bit excessive. Some examples include: http://commons.wikimedia.org/wiki/Image:Bundesarchiv_Bild_137-002552,_Türkei,_Anatolien,_Taurus,_Feldbahn.jpg http://commons.wikimedia.org/wiki/Commons:Village_pump/Archive/2008Nov#Uploading-Error Note that I don't believe there should be any security issues with serving such files to users: I'm not aware of any user agent that would execute downloaded PHP code, and certainly not one that would use such a hair-trigger check for detecting it. (Ps. This might be a Wikimedia configuration issue: I haven't yet looked at the MIME type detection code closely enough to tell. Filing this for now as a MediaWiki bug, feel free to reclassify.)
On second thought, there might be an issue for wikis running on poorly configured webservers that might try to execute PHP scripts from within the "images" directory. I wouldn't expect that to happen unless the file had the extension ".php" (or possibly ".phpN", where N is a number), though.
Or .phtml or a couple of others. In that case, though, you'd have to check the whole file for "<?", not just the first 1024 bytes.
The overzealous check appears to be in MimeMagic::doGuessMimeType(). It appears that the check, as currently coded, will have a false positive rate of slightly over 1 in 4096 files, assuming a random distribution of octets (which should be reasonable for compressed file formats like JPEG, PNG or Ogg).
Probably needs some adjustment, yes.
What I'm wondering is whether the check really serves any purpose at all. Certainly it can't provide more than "security by obscurity", since any attacker who knows about it can circumvent it trivially. If we simply want to filter out "normal" PHP files, I suppose we could do an octet-frequency check to filter out files that appear to be binary data. Even just requiring that the "<?" be followed by a couple of bytes of printable ASCII ought to reduce the false positive rate noticeably. Anyway, anybody want to grep the logs for "doGuessMimeType: recognized .* as application/x-php" and see how often it's actually triggered (legitimately or by accident)? If my math is right, about one in every four thousand uploads we get should be failing with a mysterious error saying "Files of the MIME type "application/x-php" are not allowed to be uploaded."
I vote to just drop it. It seems to serve no useful purpose.
(In reply to comment #6) > I vote to just drop it. It seems to serve no useful purpose. > It is a real bug, triggered in real use cases. It should be fixed.
(In reply to comment #7) > It is a real bug, triggered in real use cases. It should be fixed. > I am an idiot. +1 vote to fix MimeMagic::doGuessMimeType().
(In reply to comment #1) > On second thought, there might be an issue for wikis running on poorly > configured webservers that might try to execute PHP scripts from within the > "images" directory. I wouldn't expect that to happen unless the file had the > extension ".php" (or possibly ".phpN", where N is a number), though. In the default configuration, and on all Wikimedia configurations as far as I am aware, .php and the others cannot be uploaded anyway. How about removing this check and adding a note in the docs to the effect that anyone who enables upload of such files should be careful to configure their webserver not to execute them?
Created attachment 6491 [details] patch for MimeMagic.php, r55559: check whether file is binary The attached patch fixes this issue for several known image files that were falsely identified, but still successfully detects typical php files being uploaded with an (incorrect) image extension. The patch adds a check to see whether the file header contains three null characters in a row. It's a string that should be present in nearly all binary files, but shouldn't normally be found in text files. It's imperfect and kludge-like -- but so is checking for php files based on the presence of '<?'. And there's no real difference security-wise -- if someone wants to intentionally create a php file that is not recognized by doGuessMimeType, that's already easily possible. This at least resolves the bug until someone wants to do a more thorough re-write of the code.
(In reply to comment #3) > The overzealous check appears to be in MimeMagic::doGuessMimeType(). It > appears that the check, as currently coded, will have a false positive rate of > slightly over 1 in 4096 files, assuming a random distribution of octets (which > should be reasonable for compressed file formats like JPEG, PNG or Ogg). > I encountered this bug twice when I was uploading about 3,145 PNG files. I think this is a quite a serious bug that needs to be looked into.
Problem mitigated in r58682: I just removed the check for three-byte strings ('<? ', "<?\n", "<?\t", '<?='). The shortest string it's checking for is now five bytes, so random false positives should be significantly less common -- I estimate on the order of 1/100,000,000. Is there any reason to keep the check at all, though?
Probably not a huge benefit from the check since it's only a portion of the file anyway. :)
(In reply to comment #3) > It appears that the check, as currently coded, will have > a false positive rate of slightly over 1 in 4096 files, > assuming a random distribution of octets 16 bits, not 12, and you have to multiply by 1024, which gives us a false positive rate for random files on the order of 2^-6 ~= 1.7 %. That was corroborated by a little experiment I did. I measured 1.8 % for a set of metadata-free PNGs created by resizing 1498 JPEGs from my photo collection to 160x120 and converting to PNG. The very low rate seen by azliq7@yahoo.com (comment #11) was apparently the result of large metadata headers. The original JPEG photos I used in my sample all have decidely non-random headers of ~8k.
(In reply to comment #14) > (In reply to comment #3) > > It appears that the check, as currently coded, will have > > a false positive rate of slightly over 1 in 4096 files, > > assuming a random distribution of octets > > 16 bits, not 12, and you have to multiply by 1024, which gives us a false > positive rate for random files on the order of 2^-6 ~= 1.7 %. The check which Simetrical removed in r58682 matched if the first 1024 bytes of the file contained "<?" followed by one of four possible bytes (' ', '\n', '\t' or '='). Thus, the probability of three random bytes matching this check is 4/(2^8)^3 = 1/2^22, and the probability of 1024 random bytes matching it is approximately 1024/2^22 = 1/2^12 = 1/4096. (Taking into account the possibility of multiple matches and the fact that the last 2 out of 1024 positions can't match makes the probability about 1/4104.5. Most of the difference is due to the latter, since multiple matches are very unlikely events, occurring only for about one in every 2^24 files.) Anyway, marking the bug as fixed: r58682 should reduce the false positive rate enough that what's left (like removing the check entirely?) is mainly just code cleanup.
(In reply to comment #15) > The check which Simetrical removed in r58682 matched if the first 1024 bytes of > the file contained "<?" followed by one of four possible bytes (' ', '\n', '\t' > or '='). Thanks for the clarification, that hadn't been clear from the discussion so far. Considering that I managed to catch that one with just a few dozen uploads, I better stay clear from games of chance. ;)
It seems that [http://commons.wikimedia.org/w/index.php?title=Commons:Village_pump&oldid=32560832#Odd_bug ZIP detection has a similar problem]. Would it be possible to amend error messages relating to heuristic file detection with a pointer to a manual page with information about the possibility of false positives, and how to work around them?
(In reply to comment #17) > It seems that > [http://commons.wikimedia.org/w/index.php?title=Commons:Village_pump&oldid=32560832#Odd_bug > ZIP detection has a similar problem]. Yes, it seems that will happen if the last 65558 bytes of the file contain the 4-byte string "PK\x05\x06", which should happen with probability ~1/65514 assuming random data. Unfortunately, making this check significantly more specific (or moving it until later in the identification process) risks allowing malicious hybrid files, such as the well known "GIFAR" exploit, to pass it. It _might_ be possible to tighten it a bit somehow, but doing so safely would require knowledge of not only the ZIP file format but also of the ways in which various common ZIP implementations parse it. (In other words, we want our check to be broad enough to catch anything that some other program might mistake for a ZIP file, even if it doesn't exactly conform to the ZIP spec.) > Would it be possible to amend error > messages relating to heuristic file detection with a pointer to a manual page > with information about the possibility of false positives, and how to work > around them? That would certainly be possible, either globally through translatewiki or locally by wiki sysops. The relevant system message seems to be [[MediaWiki:filetype-badmime]].
(In reply to comment #15) > (In reply to comment #14) > > (In reply to comment #3) > > > It appears that the check, as currently coded, will have > > > a false positive rate of slightly over 1 in 4096 files, > > > assuming a random distribution of octets > > > > 16 bits, not 12, and you have to multiply by 1024, which gives us a false > > positive rate for random files on the order of 2^-6 ~= 1.7 %. > > The check which Simetrical removed in r58682 matched if the first 1024 bytes of > the file contained "<?" followed by one of four possible bytes (' ', '\n', '\t' > or '='). Thus, the probability of three random bytes matching this check is > 4/(2^8)^3 = 1/2^22, and the probability of 1024 random bytes matching it is > approximately 1024/2^22 = 1/2^12 = 1/4096. > > (Taking into account the possibility of multiple matches and the fact that the > last 2 out of 1024 positions can't match makes the probability about 1/4104.5. > Most of the difference is due to the latter, since multiple matches are very > unlikely events, occurring only for about one in every 2^24 files.) > > Anyway, marking the bug as fixed: r58682 should reduce the false positive rate > enough that what's left (like removing the check entirely?) is mainly just code > cleanup. > I've reopened the bug since the previous bug fix did not solve the problem completely. I encountered this bug again when I was uploading 1,000 PNG files. Perhaps we should consider removing the check?
(In reply to comment #19) > I've reopened the bug since the previous bug fix did not solve the problem > completely. > I encountered this bug again when I was uploading 1,000 PNG files. > That's probably because the bugfix isn't live on Wikipedia yet. Re-closing.
Well, obviously. I used my own wiki to test the patch.
What string did the file have in it that triggered the false positive?
Ps. Discussion about ZIP file detection should properly go under bug 20924.