Last modified: 2014-05-30 11:10:31 UTC
Currently, requests for thumbnails are attempted, then possibly failed (e.g. due to memory limits reached, or imagemagick crashing), then retried until the end of time, as long as someone requests such a URL. There are some known offenders, like a 3MB animated DNA Helix image I've been seeing on imagescalers for months. There are also incidents such as this: -rw-r--r-- 1 apache apache 19K May 25 11:46 localcopy_0ef74f0b53b6-1.svg -rw-r--r-- 1 apache apache 19K May 25 21:30 localcopy_ca923e0f99a6-1.svg -rw-r--r-- 1 apache apache 19K May 26 10:48 localcopy_572f861613b9-1.svg -rw-r--r-- 1 apache apache 19K May 26 11:18 localcopy_def9292ef643-1.svg -rw-r--r-- 1 apache apache 19K May 26 11:19 localcopy_0572c50518be-1.svg -rw-r--r-- 1 apache apache 19K May 26 12:01 localcopy_304cb07a6b45-1.svg -rw-r--r-- 1 apache apache 19K May 26 13:14 localcopy_fb22d6ef358b-1.svg -rw-r--r-- 1 apache apache 19K May 26 13:19 localcopy_16568ad8072b-1.svg -rw-r--r-- 1 apache apache 19K May 26 13:44 localcopy_0a9a32f4877c-1.svg -rw-r--r-- 1 apache apache 19K May 26 14:23 localcopy_813c0c9416bc-1.svg -rw-r--r-- 1 apache apache 19K May 26 15:58 localcopy_969e57b3aec0-1.svg -rw-r--r-- 1 apache apache 19K May 26 16:34 localcopy_de13c2c069d6-1.svg -rw-r--r-- 1 apache apache 19K May 26 17:54 localcopy_6e9b4a4aafbc-1.svg -rw-r--r-- 1 apache apache 19K May 27 09:09 localcopy_15e927467d8f-1.svg -rw-r--r-- 1 apache apache 19K May 27 10:09 localcopy_8e31a32bf69a-1.svg -rw-r--r-- 1 apache apache 19K May 27 12:37 localcopy_9fb1ac64e9d2-1.svg -rw-r--r-- 1 apache apache 19K May 27 12:38 localcopy_7f2b88f782a3-1.svg -rw-r--r-- 1 apache apache 19K May 28 02:46 localcopy_72abccb50bdf-1.svg -rw-r--r-- 1 apache apache 19K May 28 02:46 localcopy_aa5f4680fab2-1.svg -rw-r--r-- 1 apache apache 19K May 28 02:50 localcopy_8023d355393a-1.svg -rw-r--r-- 1 apache apache 19K May 28 11:19 localcopy_f2dab828e134-1.svg -rw-r--r-- 1 apache apache 19K May 29 16:05 localcopy_0ebbf8feb6a7-1.svg -rw-r--r-- 1 apache apache 19K May 29 18:45 localcopy_c5f2dbd0d36d-1.svg -rw-r--r-- 1 apache apache 19K May 29 18:58 localcopy_4a87fe4aa75f-1.svg -rw-r--r-- 1 apache apache 19K May 29 20:40 localcopy_b9df425c39f7-1.svg -rw-r--r-- 1 apache apache 19K May 29 20:42 localcopy_68fafb94b201-1.svg -rw-r--r-- 1 apache apache 19K May 29 20:43 localcopy_28e26296e19c-1.svg -rw-r--r-- 1 apache apache 19K May 29 20:51 localcopy_83c7795c9538-1.svg -rw-r--r-- 1 apache apache 19K May 29 20:54 localcopy_923d3ff45ad6-1.svg -rw-r--r-- 1 apache apache 19K May 29 20:55 localcopy_ff8795390f2e-1.svg -rw-r--r-- 1 apache apache 19K May 30 05:35 localcopy_728ca2ce9d32-1.svg -rw-r--r-- 1 apache apache 19K May 30 10:14 localcopy_3a483ff65908-1.svg -rw-r--r-- 1 apache apache 19K May 30 16:58 localcopy_bde12d83680d-1.svg -rw-r--r-- 1 apache apache 19K May 30 16:59 localcopy_12aa3e76ef45-1.svg -rw-r--r-- 1 apache apache 19K May 30 17:00 localcopy_078234140633-1.svg -rw-r--r-- 1 apache apache 19K May 31 16:44 localcopy_1b44bb384189-1.svg -rw-r--r-- 1 apache apache 19K Jun 1 15:14 localcopy_d71bc6e4f94d-1.svg -rw-r--r-- 1 apache apache 19K Jun 2 17:53 localcopy_c3d7a0a674a5-1.svg -rw-r--r-- 1 apache apache 19K Jun 3 15:16 localcopy_f17379c9ae4a-1.svg Please implement a way of a limit on the number of attempts (3-5 should be enough) in the backend. Other options, such as Swift/Varnish caching the 500 after two or three restarts were considered and discussed but they're uglier and harder to implement. Having MediaWiki do it when a scaling attempt fails instead looks like the more reasonable way to go. We should also have a way of resetting such a counter either per image or en masse, as the underlying causes of the failure could be fixed in the meantime (e.g. a newer imagemagick version, an increase of our memory limits, a MediaWiki bug getting fixed etc.)
Change 107419 had a related patch set uploaded by Aaron Schulz: Limit attempts to render the same thumbnail after failures https://gerrit.wikimedia.org/r/107419
Also improved in https://gerrit.wikimedia.org/r/#/c/127642/. This can be closed.
The patchset is a good first step, but I don't think it's enough. First of all, if I'm reading the code right, the attempt key is keyed by original _and_ thumb size; my impression is that almost always, scaling errors are because of properties of the original file, not the thumb size specifically (very large, like the 100MB TIFFs, or very complex, like the animated DNA helix). I think that it'd be better to just count hits with just the original, as otherwise we are getting DoSed by people hitting the same image in different pages, e.g. Special:NewFiles' 120px vs. Special:ListFiles' 180px. More importantly, the code seems to add a key in memcache with TTL 3600. This won't actually fix the effect that the /tmp listing presented in the original description, since those images were apart hours, or even days from each other. It will also not protect us from the general issue of "unscalable images", images that we can never render with our current thresholds but we will forever try to (hundreds of megabytes old multipage TIFFs, for instance). I think a short memcached-based approach is fine for short temporary errors, like protecting us from bursts, but my original thought -and one that would probably save us from today's recurring rendering outages- would be to actually store this in a flag the image table in the database and *never* try again, unless the flag is reset by an admin or the epoch is bumped. The flag could even be set to the current date, as to be able to quickly reset it with a query, if our infrastructure goes nuts for a day and we get flooded with false positives.
(In reply to Faidon Liambotis from comment #3) > The patchset is a good first step, but I don't think it's enough. > > First of all, if I'm reading the code right, the attempt key is keyed by > original _and_ thumb size; my impression is that almost always, scaling > errors are because of properties of the original file, not the thumb size > specifically (very large, like the 100MB TIFFs, or very complex, like the > animated DNA helix). I think that it'd be better to just count hits with > just the original, as otherwise we are getting DoSed by people hitting the > same image in different pages, e.g. Special:NewFiles' 120px vs. > Special:ListFiles' 180px. > Sometiles thumb.php calls fail due to the requeted size being to big, with smaller sizes working. A per-file limit would punish all renderings. We already have per-user rate limiting on thumbnails/min and "non-standard" thumbnails/min. I'd hope that would help with the DOS aspect. For the non DOS aspect, the failure limiting should mostly be enough. > More importantly, the code seems to add a key in memcache with TTL 3600. > This won't actually fix the effect that the /tmp listing presented in the > original description, since those images were apart hours, or even days from > each other. It will also not protect us from the general issue of > "unscalable images", images that we can never render with our current > thresholds but we will forever try to (hundreds of megabytes old multipage > TIFFs, for instance). > > I think a short memcached-based approach is fine for short temporary errors, > like protecting us from bursts, but my original thought -and one that would > probably save us from today's recurring rendering outages- would be to > actually store this in a flag the image table in the database and *never* > try again, unless the flag is reset by an admin or the epoch is bumped. The > flag could even be set to the current date, as to be able to quickly reset > it with a query, if our infrastructure goes nuts for a day and we get > flooded with false positives. If the media handler class bound the the file knows that it cannot be rendered (based on the format, size, and configured resource limits, ect...) it could store it as img_metadata and bail out quickly. There isn't really a way for thumb.php itself to make that determination afaik. Maybe the Multimedia team can look into media handler tweaks.