Last modified: 2014-05-30 11:10:31 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T51118, the corresponding Phabricator task for complete and up-to-date bug report information.

Bug 49118 - Limit number of image scaling attempts


Summary:	Limit number of image scaling attempts

Status:	REOPENED

Product:	MediaWiki
Classification:	Unclassified
Component:	File management (Other open bugs)
Version:	1.22.0
Hardware:	All All

Importance:	Normal normal (vote)
Target Milestone:	---
Assigned To:	Nobody - You can work on this!

URL:
Whiteboard:	perfsprint-13
Keywords:	performance, platformeng

Depends on:
Blocks:	41371
	Show dependency tree / graph

Reported:	2013-06-04 13:31 UTC by Faidon Liambotis
Modified:	2014-05-30 11:10 UTC (History)
CC List:	11 users (show)

See Also:	52045 65217
Web browser:	---
Mobile Platform:	---
Assignee Huggle Beta Tester:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Description Faidon Liambotis 2013-06-04 13:31:24 UTC

Currently, requests for thumbnails are attempted, then possibly failed (e.g. due to memory limits reached, or imagemagick crashing), then retried until the end of time, as long as someone requests such a URL. There are some known offenders, like a 3MB animated DNA Helix image I've been seeing on imagescalers for months. There are also incidents such as this:

-rw-r--r-- 1 apache apache  19K May 25 11:46 localcopy_0ef74f0b53b6-1.svg
-rw-r--r-- 1 apache apache  19K May 25 21:30 localcopy_ca923e0f99a6-1.svg
-rw-r--r-- 1 apache apache  19K May 26 10:48 localcopy_572f861613b9-1.svg
-rw-r--r-- 1 apache apache  19K May 26 11:18 localcopy_def9292ef643-1.svg
-rw-r--r-- 1 apache apache  19K May 26 11:19 localcopy_0572c50518be-1.svg
-rw-r--r-- 1 apache apache  19K May 26 12:01 localcopy_304cb07a6b45-1.svg
-rw-r--r-- 1 apache apache  19K May 26 13:14 localcopy_fb22d6ef358b-1.svg
-rw-r--r-- 1 apache apache  19K May 26 13:19 localcopy_16568ad8072b-1.svg
-rw-r--r-- 1 apache apache  19K May 26 13:44 localcopy_0a9a32f4877c-1.svg
-rw-r--r-- 1 apache apache  19K May 26 14:23 localcopy_813c0c9416bc-1.svg
-rw-r--r-- 1 apache apache  19K May 26 15:58 localcopy_969e57b3aec0-1.svg
-rw-r--r-- 1 apache apache  19K May 26 16:34 localcopy_de13c2c069d6-1.svg
-rw-r--r-- 1 apache apache  19K May 26 17:54 localcopy_6e9b4a4aafbc-1.svg
-rw-r--r-- 1 apache apache  19K May 27 09:09 localcopy_15e927467d8f-1.svg
-rw-r--r-- 1 apache apache  19K May 27 10:09 localcopy_8e31a32bf69a-1.svg
-rw-r--r-- 1 apache apache  19K May 27 12:37 localcopy_9fb1ac64e9d2-1.svg
-rw-r--r-- 1 apache apache  19K May 27 12:38 localcopy_7f2b88f782a3-1.svg
-rw-r--r-- 1 apache apache  19K May 28 02:46 localcopy_72abccb50bdf-1.svg
-rw-r--r-- 1 apache apache  19K May 28 02:46 localcopy_aa5f4680fab2-1.svg
-rw-r--r-- 1 apache apache  19K May 28 02:50 localcopy_8023d355393a-1.svg
-rw-r--r-- 1 apache apache  19K May 28 11:19 localcopy_f2dab828e134-1.svg
-rw-r--r-- 1 apache apache  19K May 29 16:05 localcopy_0ebbf8feb6a7-1.svg
-rw-r--r-- 1 apache apache  19K May 29 18:45 localcopy_c5f2dbd0d36d-1.svg
-rw-r--r-- 1 apache apache  19K May 29 18:58 localcopy_4a87fe4aa75f-1.svg
-rw-r--r-- 1 apache apache  19K May 29 20:40 localcopy_b9df425c39f7-1.svg
-rw-r--r-- 1 apache apache  19K May 29 20:42 localcopy_68fafb94b201-1.svg
-rw-r--r-- 1 apache apache  19K May 29 20:43 localcopy_28e26296e19c-1.svg
-rw-r--r-- 1 apache apache  19K May 29 20:51 localcopy_83c7795c9538-1.svg
-rw-r--r-- 1 apache apache  19K May 29 20:54 localcopy_923d3ff45ad6-1.svg
-rw-r--r-- 1 apache apache  19K May 29 20:55 localcopy_ff8795390f2e-1.svg
-rw-r--r-- 1 apache apache  19K May 30 05:35 localcopy_728ca2ce9d32-1.svg
-rw-r--r-- 1 apache apache  19K May 30 10:14 localcopy_3a483ff65908-1.svg
-rw-r--r-- 1 apache apache  19K May 30 16:58 localcopy_bde12d83680d-1.svg
-rw-r--r-- 1 apache apache  19K May 30 16:59 localcopy_12aa3e76ef45-1.svg
-rw-r--r-- 1 apache apache  19K May 30 17:00 localcopy_078234140633-1.svg
-rw-r--r-- 1 apache apache  19K May 31 16:44 localcopy_1b44bb384189-1.svg
-rw-r--r-- 1 apache apache  19K Jun  1 15:14 localcopy_d71bc6e4f94d-1.svg
-rw-r--r-- 1 apache apache  19K Jun  2 17:53 localcopy_c3d7a0a674a5-1.svg
-rw-r--r-- 1 apache apache  19K Jun  3 15:16 localcopy_f17379c9ae4a-1.svg

Please implement a way of a limit on the number of attempts (3-5 should be enough) in the backend. Other options, such as Swift/Varnish caching the 500 after two or three restarts were considered and discussed but they're uglier and harder to implement. Having MediaWiki do it when a scaling attempt fails instead looks like the more reasonable way to go.

We should also have a way of resetting such a counter either per image or en masse, as the underlying causes of the failure could be fixed in the meantime (e.g. a newer imagemagick version, an increase of our memory limits, a MediaWiki bug getting fixed etc.)

Comment 1 Gerrit Notification Bot 2014-01-14 19:37:54 UTC

Change 107419 had a related patch set uploaded by Aaron Schulz:
Limit attempts to render the same thumbnail after failures

https://gerrit.wikimedia.org/r/107419

Comment 2 Aaron Schulz 2014-04-21 16:46:31 UTC

Also improved in https://gerrit.wikimedia.org/r/#/c/127642/. This can be closed.

Comment 3 Faidon Liambotis 2014-04-21 17:51:18 UTC

The patchset is a good first step, but I don't think it's enough.

First of all, if I'm reading the code right, the attempt key is keyed by original _and_ thumb size; my impression is that almost always, scaling errors are because of properties of the original file, not the thumb size specifically (very large, like the 100MB TIFFs, or very complex, like the animated DNA helix). I think that it'd be better to just count hits with just the original, as otherwise we are getting DoSed by people hitting the same image in different pages, e.g. Special:NewFiles' 120px vs. Special:ListFiles' 180px.

More importantly, the code seems to add a key in memcache with TTL 3600. This won't actually fix the effect that the /tmp listing presented in the original description, since those images were apart hours, or even days from each other. It will also not protect us from the general issue of "unscalable images", images that we can never render with our current thresholds but we will forever try to (hundreds of megabytes old multipage TIFFs, for instance).

I think a short memcached-based approach is fine for short temporary errors, like protecting us from bursts, but my original thought -and one that would probably save us from today's recurring rendering outages- would be to actually store this in a flag the image table in the database and *never* try again, unless the flag is reset by an admin or the epoch is bumped. The flag could even be set to the current date, as to be able to quickly reset it with a query, if our infrastructure goes nuts for a day and we get flooded with false positives.

Comment 4 Aaron Schulz 2014-04-21 18:23:40 UTC

(In reply to Faidon Liambotis from comment #3)
> The patchset is a good first step, but I don't think it's enough.
> 
> First of all, if I'm reading the code right, the attempt key is keyed by
> original _and_ thumb size; my impression is that almost always, scaling
> errors are because of properties of the original file, not the thumb size
> specifically (very large, like the 100MB TIFFs, or very complex, like the
> animated DNA helix). I think that it'd be better to just count hits with
> just the original, as otherwise we are getting DoSed by people hitting the
> same image in different pages, e.g. Special:NewFiles' 120px vs.
> Special:ListFiles' 180px.
> 

Sometiles thumb.php calls fail due to the requeted size being to big, with smaller sizes working. A per-file limit would punish all renderings. We already have per-user rate limiting on thumbnails/min and "non-standard" thumbnails/min. I'd hope that would help with the DOS aspect. For the non DOS aspect, the failure limiting should mostly be enough.

> More importantly, the code seems to add a key in memcache with TTL 3600.
> This won't actually fix the effect that the /tmp listing presented in the
> original description, since those images were apart hours, or even days from
> each other. It will also not protect us from the general issue of
> "unscalable images", images that we can never render with our current
> thresholds but we will forever try to (hundreds of megabytes old multipage
> TIFFs, for instance).
> 
> I think a short memcached-based approach is fine for short temporary errors,
> like protecting us from bursts, but my original thought -and one that would
> probably save us from today's recurring rendering outages- would be to
> actually store this in a flag the image table in the database and *never*
> try again, unless the flag is reset by an admin or the epoch is bumped. The
> flag could even be set to the current date, as to be able to quickly reset
> it with a query, if our infrastructure goes nuts for a day and we get
> flooded with false positives.

If the media handler class bound the the file knows that it cannot be rendered (based on the format, size, and configured resource limits, ect...) it could store it as img_metadata and bail out quickly. There isn't really a way for thumb.php itself to make that determination afaik. Maybe the Multimedia team can look into media handler tweaks.

Wikimedia Bugzilla is closed!

Search

Personal tools

Navigation

Links