Last modified: 2009-12-06 18:48:37 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T18583, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 16583 - MIME type detection of "application/x-php" gives false positives on any file with "<?" in it
MIME type detection of "application/x-php" gives false positives on any file ...
Status: RESOLVED FIXED
Product: MediaWiki
Classification: Unclassified
File management (Other open bugs)
unspecified
All All
: Normal normal (vote)
: ---
Assigned To: Nobody - You can work on this!
http://commons.wikimedia.org/wiki/Ima...
: patch
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2008-12-07 19:17 UTC by Ilmari Karonen
Modified: 2009-12-06 18:48 UTC (History)
8 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments
patch for MimeMagic.php, r55559: check whether file is binary (2.28 KB, patch)
2009-08-24 19:27 UTC, nephele
Details

Description Ilmari Karonen 2008-12-07 19:17:31 UTC
Apparently, any file uploaded to Commons that has the character pair "<?" anywhere within the first 1024 bytes gets detected as having the MIME type "application/x-php".  This seems a bit excessive.  Some examples include:

http://commons.wikimedia.org/wiki/Image:Bundesarchiv_Bild_137-002552,_Türkei,_Anatolien,_Taurus,_Feldbahn.jpg
http://commons.wikimedia.org/wiki/Commons:Village_pump/Archive/2008Nov#Uploading-Error

Note that I don't believe there should be any security issues with serving such files to users: I'm not aware of any user agent that would execute downloaded PHP code, and certainly not one that would use such a hair-trigger check for detecting it.

(Ps. This might be a Wikimedia configuration issue: I haven't yet looked at the MIME type detection code closely enough to tell.  Filing this for now as a MediaWiki bug, feel free to reclassify.)
Comment 1 Ilmari Karonen 2008-12-07 19:21:58 UTC
On second thought, there might be an issue for wikis running on poorly configured webservers that might try to execute PHP scripts from within the "images" directory.  I wouldn't expect that to happen unless the file had the extension ".php" (or possibly ".phpN", where N is a number), though.
Comment 2 Aryeh Gregor (not reading bugmail, please e-mail directly) 2008-12-07 19:25:54 UTC
Or .phtml or a couple of others.  In that case, though, you'd have to check the whole file for "<?", not just the first 1024 bytes.
Comment 3 Ilmari Karonen 2008-12-13 03:56:05 UTC
The overzealous check appears to be in MimeMagic::doGuessMimeType().  It appears that the check, as currently coded, will have a false positive rate of slightly over 1 in 4096 files, assuming a random distribution of octets (which should be reasonable for compressed file formats like JPEG, PNG or Ogg).
Comment 4 Brion Vibber 2008-12-15 20:11:36 UTC
Probably needs some adjustment, yes.
Comment 5 Ilmari Karonen 2009-04-15 23:44:38 UTC
What I'm wondering is whether the check really serves any purpose at all.  Certainly it can't provide more than "security by obscurity", since any attacker who knows about it can circumvent it trivially.

If we simply want to filter out "normal" PHP files, I suppose we could do an octet-frequency check to filter out files that appear to be binary data.  Even just requiring that the "<?" be followed by a couple of bytes of printable ASCII ought to reduce the false positive rate noticeably.

Anyway, anybody want to grep the logs for "doGuessMimeType: recognized .* as application/x-php" and see how often it's actually triggered (legitimately or by accident)?  If my math is right, about one in every four thousand uploads we get should be failing with a mysterious error saying "Files of the MIME type "application/x-php" are not allowed to be uploaded."
Comment 6 Aryeh Gregor (not reading bugmail, please e-mail directly) 2009-04-17 11:25:08 UTC
I vote to just drop it.  It seems to serve no useful purpose.
Comment 7 Firefishy 2009-07-07 16:08:11 UTC
(In reply to comment #6)
> I vote to just drop it.  It seems to serve no useful purpose.
> 

It is a real bug, triggered in real use cases. It should be fixed.
Comment 8 Firefishy 2009-07-07 16:17:40 UTC
(In reply to comment #7)
> It is a real bug, triggered in real use cases. It should be fixed.
> 
I am an idiot.

+1 vote to fix MimeMagic::doGuessMimeType().
Comment 9 Gurch 2009-07-07 17:09:25 UTC
(In reply to comment #1)
> On second thought, there might be an issue for wikis running on poorly
> configured webservers that might try to execute PHP scripts from within the
> "images" directory.  I wouldn't expect that to happen unless the file had the
> extension ".php" (or possibly ".phpN", where N is a number), though.

In the default configuration, and on all Wikimedia configurations as far as I am aware, .php and the others cannot be uploaded anyway. How about removing this check and adding a note in the docs to the effect that anyone who enables upload of such files should be careful to configure their webserver not to execute them?

Comment 10 nephele 2009-08-24 19:27:16 UTC
Created attachment 6491 [details]
patch for MimeMagic.php, r55559: check whether file is binary

The attached patch fixes this issue for several known image files that were falsely identified, but still successfully detects typical php files being uploaded with an (incorrect) image extension.

The patch adds a check to see whether the file header contains three null characters in a row.  It's a string that should be present in nearly all binary files, but shouldn't normally be found in text files.  It's imperfect and kludge-like -- but so is checking for php files based on the presence of '<?'.  And there's no real difference security-wise -- if someone wants to intentionally create a php file that is not recognized by doGuessMimeType, that's already easily possible.

This at least resolves the bug until someone wants to do a more thorough re-write of the code.
Comment 11 azliq7 2009-11-06 16:30:39 UTC
(In reply to comment #3)
> The overzealous check appears to be in MimeMagic::doGuessMimeType().  It
> appears that the check, as currently coded, will have a false positive rate of
> slightly over 1 in 4096 files, assuming a random distribution of octets (which
> should be reasonable for compressed file formats like JPEG, PNG or Ogg).
> 

I encountered this bug twice when I was uploading about 3,145 PNG files. 
I think this is a quite a serious bug that needs to be looked into.
Comment 12 Aryeh Gregor (not reading bugmail, please e-mail directly) 2009-11-06 21:08:55 UTC
Problem mitigated in r58682: I just removed the check for three-byte strings ('<? ', "<?\n", "<?\t", '<?=').  The shortest string it's checking for is now five bytes, so random false positives should be significantly less common -- I estimate on the order of 1/100,000,000.  Is there any reason to keep the check at all, though?
Comment 13 Brion Vibber 2009-11-07 15:27:03 UTC
Probably not a huge benefit from the check since it's only a portion of the file anyway. :)
Comment 14 Wolfram Schmied 2009-11-27 00:39:22 UTC
(In reply to comment #3)
> It appears that the check, as currently coded, will have 
> a false positive rate of slightly over 1 in 4096 files, 
> assuming a random distribution of octets

16 bits, not 12, and you have to multiply by 1024, which gives us a false positive rate for random files on the order of 2^-6 ~= 1.7 %. That was corroborated by a little experiment I did. I measured 1.8 % for a set of metadata-free PNGs created by resizing 1498 JPEGs from my photo collection to 160x120 and converting to PNG.

The very low rate seen by azliq7@yahoo.com (comment #11) was apparently the result of large metadata headers. The original JPEG photos I used in my sample all have decidely non-random headers of ~8k.

Comment 15 Ilmari Karonen 2009-11-27 13:19:18 UTC
(In reply to comment #14)
> (In reply to comment #3)
> > It appears that the check, as currently coded, will have 
> > a false positive rate of slightly over 1 in 4096 files, 
> > assuming a random distribution of octets
> 
> 16 bits, not 12, and you have to multiply by 1024, which gives us a false
> positive rate for random files on the order of 2^-6 ~= 1.7 %.

The check which Simetrical removed in r58682 matched if the first 1024 bytes of the file contained "<?" followed by one of four possible bytes (' ', '\n', '\t' or '=').  Thus, the probability of three random bytes matching this check is 4/(2^8)^3 = 1/2^22, and the probability of 1024 random bytes matching it is approximately 1024/2^22 = 1/2^12 = 1/4096.

(Taking into account the possibility of multiple matches and the fact that the last 2 out of 1024 positions can't match makes the probability about 1/4104.5.  Most of the difference is due to the latter, since multiple matches are very unlikely events, occurring only for about one in every 2^24 files.)

Anyway, marking the bug as fixed: r58682 should reduce the false positive rate enough that what's left (like removing the check entirely?) is mainly just code cleanup.
Comment 16 Wolfram Schmied 2009-11-27 13:57:55 UTC
(In reply to comment #15)
> The check which Simetrical removed in r58682 matched if the first 1024 bytes of
> the file contained "<?" followed by one of four possible bytes (' ', '\n', '\t'
> or '=').

Thanks for the clarification, that hadn't been clear from the discussion so far. Considering that I managed to catch that one with just a few dozen uploads, I better stay clear from games of chance. ;)
Comment 17 Wolfram Schmied 2009-11-30 02:05:59 UTC
It seems that [http://commons.wikimedia.org/w/index.php?title=Commons:Village_pump&oldid=32560832#Odd_bug ZIP detection has a similar problem]. Would it be possible to amend error messages relating to heuristic file detection with a pointer to a manual page with information about the possibility of false positives, and how to work around them? 
Comment 18 Ilmari Karonen 2009-11-30 08:24:58 UTC
(In reply to comment #17)
> It seems that
> [http://commons.wikimedia.org/w/index.php?title=Commons:Village_pump&oldid=32560832#Odd_bug
> ZIP detection has a similar problem].

Yes, it seems that will happen if the last 65558 bytes of the file contain the 4-byte string "PK\x05\x06", which should happen with probability ~1/65514 assuming random data.  Unfortunately, making this check significantly more specific (or moving it until later in the identification process) risks allowing malicious hybrid files, such as the well known "GIFAR" exploit, to pass it.  It _might_ be possible to tighten it a bit somehow, but doing so safely would require knowledge of not only the ZIP file format but also of the ways in which various common ZIP implementations parse it.  (In other words, we want our check to be broad enough to catch anything that some other program might mistake for a ZIP file, even if it doesn't exactly conform to the ZIP spec.)

> Would it be possible to amend error
> messages relating to heuristic file detection with a pointer to a manual page
> with information about the possibility of false positives, and how to work
> around them? 

That would certainly be possible, either globally through translatewiki or locally by wiki sysops.  The relevant system message seems to be [[MediaWiki:filetype-badmime]].


Comment 19 azliq7 2009-12-04 07:15:25 UTC
(In reply to comment #15)
> (In reply to comment #14)
> > (In reply to comment #3)
> > > It appears that the check, as currently coded, will have 
> > > a false positive rate of slightly over 1 in 4096 files, 
> > > assuming a random distribution of octets
> > 
> > 16 bits, not 12, and you have to multiply by 1024, which gives us a false
> > positive rate for random files on the order of 2^-6 ~= 1.7 %.
> 
> The check which Simetrical removed in r58682 matched if the first 1024 bytes of
> the file contained "<?" followed by one of four possible bytes (' ', '\n', '\t'
> or '=').  Thus, the probability of three random bytes matching this check is
> 4/(2^8)^3 = 1/2^22, and the probability of 1024 random bytes matching it is
> approximately 1024/2^22 = 1/2^12 = 1/4096.
> 
> (Taking into account the possibility of multiple matches and the fact that the
> last 2 out of 1024 positions can't match makes the probability about 1/4104.5. 
> Most of the difference is due to the latter, since multiple matches are very
> unlikely events, occurring only for about one in every 2^24 files.)
> 
> Anyway, marking the bug as fixed: r58682 should reduce the false positive rate
> enough that what's left (like removing the check entirely?) is mainly just code
> cleanup.
> 

I've reopened the bug since the previous bug fix did not solve the problem completely.
I encountered this bug again when I was uploading 1,000 PNG files.

Perhaps we should consider removing the check? 
Comment 20 Roan Kattouw 2009-12-04 10:14:18 UTC
(In reply to comment #19)
> I've reopened the bug since the previous bug fix did not solve the problem
> completely.
> I encountered this bug again when I was uploading 1,000 PNG files.
> 
That's probably because the bugfix isn't live on Wikipedia yet. Re-closing.
Comment 21 azliq7 2009-12-04 11:35:16 UTC
Well, obviously. I used my own wiki to test the patch.
Comment 22 Aryeh Gregor (not reading bugmail, please e-mail directly) 2009-12-04 14:43:04 UTC
What string did the file have in it that triggered the false positive?
Comment 23 Ilmari Karonen 2009-12-06 18:48:37 UTC
Ps. Discussion about ZIP file detection should properly go under bug 20924.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links