Last modified: 2014-09-16 19:36:48 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T67217, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 65217 - Operational issues for very large TIFFs
Operational issues for very large TIFFs
Status: PATCH_TO_REVIEW
Product: MediaWiki extensions
Classification: Unclassified
GWToolset (Other open bugs)
unspecified
All All
: High critical (vote)
: ---
Assigned To: Nobody - You can work on this!
:
Depends on: 52045 67525 65691
Blocks: 41371
  Show dependency treegraph
 
Reported: 2014-05-12 10:52 UTC by dan
Modified: 2014-09-16 19:36 UTC (History)
13 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description dan 2014-05-12 10:52:02 UTC
when a bot or user visits a wiki’s SpecialNewFiles page, and some other pages like this page, missing thumbnails are created on the fly. this can potentially flood the server(s) with thumbnail creation jobs, which slow down the wiki or potentially bring down its ability to server web pages. GWToolset has the potential to create this situation when it uploads several large mediafiles at once @see http://lists.wikimedia.org/pipermail/glamtools/2014-May/000135.html.

during the zürich hackathon i spoke with aaron schultz, faiden liambotis, and brion vibber regarding approaches to dealing with this issue. in summary, the idea aaron came up with is to create initial thumbnails on download of the original mediafile to the wiki. this should block the appearance of the title on the new files page and anywhere else until the thumbnails and title creation/edit have completed. aaron thought, and faidon and i agree, that further throttling of gwtoolset will not help resolve the issue.

i am currently looking into implementing this approach and will use this bug to track activity on it.
Comment 1 Andre Klapper 2014-05-12 19:54:36 UTC
According to Gergo, workaround (not: fix) is in
https://gerrit.wikimedia.org/r/#/c/132111/
https://gerrit.wikimedia.org/r/#/c/132112/

Related: bug 49118 triggered by https://commons.wikimedia.org/wiki/Commons:Village_pump/Archive/2014/05#Images_so_big_they_break_Commons.3F (and to very little extend also bug 52045).
Comment 2 dan 2014-05-15 02:31:09 UTC
my initial thoughts on how to approach this, utilising methods within thumb.php, are not accessible to jobs run in the job queue.

another approach, discussed with gilles and gergo in irc, involves uploading the media file to an upload stash, creating thumbnails based on that media file, and then creating the title for the media file. this requires re-architecting the way the job queue jobs currently run, which i don’t have time to work on at the moment. will try and get to this when time permits.
Comment 3 Tisza Gergő 2014-05-23 18:49:44 UTC
The consensus on the ops list was that https://gerrit.wikimedia.org/r/#/c/132112/ is not enough to safely resume uploads, and bug 52045 probably would not help much. The current plan is to

* extract a large thumbnail from the file, and use that thumbnail to create smaller thumbnails (possibly in a chain, i.e. use some of those smaller thumbnails to create even smaller thumbnails)
* make this thumbnail generation happen immediately after upload
* limit the number of expensive thumbnail generations that can happen in parallel
Comment 4 Bawolff (Brian Wolff) 2014-05-23 21:09:22 UTC
I recently realized that we still download the source file, even if its above $wgMaxImageArea (e.g. https://commons.wikimedia.org/wiki/File:Map_of_New-York_Bay_and_Harbor_and_the_environs_-_founded_upon_a_trigonometrical_survey_under_the_direction_of_F._R._Hassler,_superintendent_of_the_Survey_of_the_Coast_of_the_United_States;_NYPL1696369.tiff is a 540 mb file, which takes 37 seconds just to get to the error message that says we aren't even going to attempt to thumbnail the file). I've submitted https://gerrit.wikimedia.org/r/135101 to fix this.

I've missed much of the events that unfolded around this situation. Looking back in mailing list archives, I'm not even clear if it is swift being overloaded, or time taken to actually thumbnail the image that's the problem (or both. Or something else). One of the earlier emails says:

>We just had a brief imagescaler outage today at approx. 11:20 UTC that
>was investigated and NYPL maps were found to be the cause of the outage.
>Besides the complete outage of imagescaling, Swift's (4Gbps) bandwidth
>was saturated again, which would cause slowdowns and timeouts in file
>serving as well.

So possibly (correct me if I'm off base here) its just swift network connection being overloaded, which in turn causes the image scalars to have to wait longer before getting the original image asset is delivered to them, causing them to be overloaded. If so, the fact we are fetching the original > 100 mb source file, only to not even try to scale it, and doing so repetitively until 4 attempts at a specific file width trigger attempt-failures to stop it for an hour on that particular size only, may be a very significant contributor to the situation.

The attempt-failures thing only increments the cache key after the attempt failed. Given it was taking ~ 38 seconds just to download the file to the image scalar (in the case I tried), A lot of people could try and render that file in that time before the key is incremented (Still limited by the pool counter though). Maybe that key should be incremented at the beginning of the request. Sure in certain situations a couple people might get an error for the couple of seconds it takes a good file to render, but that would only last a couple seconds and would much more quickly limit the damage a stampede of people requesting a hard to render file could do.
Comment 5 Bawolff (Brian Wolff) 2014-05-24 02:56:16 UTC
I was reading over the thread on multimedia - I'm not entirely sure the Special:Newfiles theory makes sense, I think its more likely someone maybe viewed a category of the tiff uploads from gwtoolset or something like that.

So we have this graph of april 21, with a peak about 2:55 to 3:20 UTC http://lists.wikimedia.org/pipermail/multimedia/attachments/20140420/35015082/attachment-0001.png

However when you look at the uploads from around the time, the peak in large tiff uploads do not correspond with a peak in the graph:

MariaDB [commonswiki_p]> select substring( img_timestamp, 9, 3) "time", count(*) "# images", round(MAX(img_width*img_height/1000000)) "max Mpx", round( avg(img_width*img_height/1000000)) "avg mpx", round(avg (img_size/(1024*1024))) "avg MB", round(sum(img_size/(1024*1024))) "total mb", round( max( img_size/(1024*1024))) "max mb" from image where img_timestamp > '20140421010000' and img_timestamp < '20140421050000' and img_minor_mime = 'tiff' and img_user_text = 'Fæ' group by substring( img_timestamp, 1, 11);
+------+----------+---------+---------+--------+----------+--------+
| time | # images | max Mpx | avg mpx | avg MB | total mb | max mb |
+------+----------+---------+---------+--------+----------+--------+
| 010  |       40 |      60 |      42 |    121 |     4822 |    172 |
| 011  |       40 |      39 |      39 |    110 |     4409 |    112 |
| 012  |       19 |      60 |      42 |    120 |     2280 |    172 |
| 013  |       37 |      60 |      60 |    171 |     6328 |    173 |
| 014  |       17 |      60 |      60 |    172 |     2916 |    173 |
| 015  |       20 |      60 |      60 |    171 |     3427 |    173 |
| 020  |       35 |      60 |      60 |    171 |     5986 |    173 |
| 021  |       15 |      60 |      60 |    170 |     2555 |    172 |
| 022  |       26 |      60 |      60 |    172 |     4463 |    173 |
| 023  |       18 |      60 |      60 |    171 |     3079 |    173 |
| 030  |        6 |      60 |      59 |    170 |     1018 |    173 |
| 032  |        5 |      60 |      60 |    171 |      857 |    173 |
| 033  |        2 |      60 |      60 |    172 |      343 |    173 |
+------+----------+---------+---------+--------+----------+--------+
13 rows in set (0.01 sec)

That is between 2:50-3:20 there was a total of 6 tiff files  uploaded by Fae with gwtoolset (out of 141 total uploads in that time period, 4.2%), compared to say 1:00-1:30 which didn't have a spike but had 99 tiff files uploaded by fae (compared to 373 total, 27%). If it was caused by viewing Special:Newfiles, I would expect the spike would come when the 99 tiffs were uploaded instead of when the 6 tiffs were uploaded.

Which leads me to suspect the issue was not with people viewing Special:NewFiles a lot, but maybe viewing something else that had a lot of uncached thumbnail hits associated. Maybe the category for the batch upload, which would have up to 200 images on it, probably a lot over the $wgMaxImageArea so triggering what I mentioned in comment 4 - and the rest might simply have not been viewed before, was viewed by several someones at the same time. [[Commons:Category:NYPL maps (over 50 megapixels)]] was linked in the VP at the time (although it had been for about a day), maybe somebody just hit reload on that page repetitively for some unknown reason and that overloaded things. Or something.

With all that said, I guess even if it wasn't Special:Newfiles, it probably doesn't change much as its still related to on-demand thumbnailing.
Comment 6 Keegan Peterzell 2014-05-24 04:09:25 UTC
(In reply to Bawolff (Brian Wolff) from comment #5)
> I was reading over the thread on multimedia - I'm not entirely sure the
> Special:Newfiles theory makes sense, I think its more likely someone maybe
> viewed a category of the tiff uploads from gwtoolset or something like that.
<snip>
> 
> Which leads me to suspect the issue was not with people viewing
> Special:NewFiles a lot, but maybe viewing something else that had a lot of
> uncached thumbnail hits associated. Maybe the category for the batch upload,
> which would have up to 200 images on it, probably a lot over the
> $wgMaxImageArea so triggering what I mentioned in comment 4 - and the rest
> might simply have not been viewed before, was viewed by several someones at
> the same time. [[Commons:Category:NYPL maps (over 50 megapixels)]] was
> linked in the VP at the time (although it had been for about a day), maybe
> somebody just hit reload on that page repetitively for some unknown reason
> and that overloaded things. Or something.
> 
> With all that said, I guess even if it wasn't Special:Newfiles, it probably
> doesn't change much as its still related to on-demand thumbnailing.

You could be on to something. For example, all of the thumbnails in [[commons:Category:Sanborn maps of Staten Island]] are broken when you go to view an image in full resolution. It doesn't have to be someone hitting reload repeatedly, the call for the thumb regenerates on its own once it fails. For example:

https://upload.wikimedia.org/wikipedia/commons/thumb/9/9d/Staten_Island%2C_Plate_No._12_%28Map_bounded_by_Boyd%2C_Brooks%2C_Mc_Keon%2C_Varian_Cedar%29_NYPL1957089.tiff/lossy-page1-3000px-Staten_Island%2C_Plate_No._12_%28Map_bounded_by_Boyd%2C_Brooks%2C_Mc_Keon%2C_Varian_Cedar%29_NYPL1957089.tiff.jpg

I can open that up in a background browser tab and it just keeps hitting the server over and over for thumbnail requests.
Comment 7 Keegan Peterzell 2014-05-24 04:15:06 UTC
(In reply to Keegan Peterzell from comment #6)
> https://upload.wikimedia.org/wikipedia/commons/thumb/9/9d/
> Staten_Island%2C_Plate_No.
> _12_%28Map_bounded_by_Boyd%2C_Brooks%2C_Mc_Keon%2C_Varian_Cedar%29_NYPL195708
> 9.tiff/lossy-page1-3000px-Staten_Island%2C_Plate_No.
> _12_%28Map_bounded_by_Boyd%2C_Brooks%2C_Mc_Keon%2C_Varian_Cedar%29_NYPL195708
> 9.tiff.jpg
> 
> I can open that up in a background browser tab and it just keeps hitting the
> server over and over for thumbnail requests.

I should clarify: My browswer (Chrome 34.0.1847.137 m) is giving different behaviors when I open up images from that gallery. One image failed upon its own refresh call six times before halting and returning the proper error message (There have been too many recent failed attempts (4 or more) to render this thumbnail. Please try again later.) Another image reloaded only twice before halting with no error message. Yet another image just keep reloading without the error message.
Comment 8 Keegan Peterzell 2014-05-24 04:17:43 UTC
(In reply to Keegan Peterzell from comment #7)
> I should clarify: My browswer (Chrome 34.0.1847.137 m) is giving different
> behaviors when I open up images from that gallery. One image failed upon its
> own refresh call six times before halting and returning the proper error
> message (There have been too many recent failed attempts (4 or more) to
> render this thumbnail. Please try again later.) Another image reloaded only
> twice before halting with no error message. Yet another image just keep
> reloading without the error message.

And by without the error message, I mean that the server is leaving the field blank.

Error generating thumbnail

Error creating thumbnail:
Comment 9 Bawolff (Brian Wolff) 2014-05-24 04:44:32 UTC
(In reply to Keegan Peterzell from comment #8)
> (In reply to Keegan Peterzell from comment #7)
> > I should clarify: My browswer (Chrome 34.0.1847.137 m) is giving different
> > behaviors when I open up images from that gallery. One image failed upon its
> > own refresh call six times before halting and returning the proper error
> > message (There have been too many recent failed attempts (4 or more) to
> > render this thumbnail. Please try again later.) Another image reloaded only
> > twice before halting with no error message. Yet another image just keep
> > reloading without the error message.
> 
> And by without the error message, I mean that the server is leaving the
> field blank.
> 
> Error generating thumbnail
> 
> Error creating thumbnail:

Well the blank error message is consistent with an out of memory error for a tiff file (Since process gets killed and doesn't output anything to stdout. Other formats return the exit code, but tiff doesn't). However your web browser is not supposed to be loading the page over and over again by itself. My copy of chrome doesn't do that.


-----

Furthermore, looking at the irc logs - http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-operations/20140421.txt the servers had issues way up to 14:50 UTC on april 21, which is long after Fae's uploads stop and are off Special:Newfiles/Special:Listfiles. Similarly for the outage at 11:20 utc on May 11 - http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-operations/20140511.txt [[commons:file:Bronx,_V._12,_Double_Page_Plate_No._273_%28Map_bounded_by_Whiting_Ave.,_Ewen_Ave.,_Warren_Ave.,_Hudson_River%29_NYPL2001533.tiff]] is mentioned, which is one of the images uploaded back on april 21, so definitely not on Special:NewFiles. (Also that file is over the $wgMaxImageArea, so Gerrit change #135101 would have stopped that particular file from causing a problem. Of course the irc log is unclear if that was the main file causing problems or if it was just one example of many files being currently requested)
Comment 10 Bawolff (Brian Wolff) 2014-05-25 04:50:31 UTC
(In reply to Tisza Gergő from comment #3)
> The consensus on the ops list was that
> https://gerrit.wikimedia.org/r/#/c/132112/ is not enough to safely resume
> uploads, and bug 52045 probably would not help much. The current plan is to
> 
> * extract a large thumbnail from the file, and use that thumbnail to create
> smaller thumbnails (possibly in a chain, i.e. use some of those smaller
> thumbnails to create even smaller thumbnails)

I sort of did this for tiff as part of the work to make vips work on tiffs - see Gerrit change #135289.
Comment 11 dan 2014-06-13 11:20:20 UTC
with these gerrit patches merged, and deployed onto production, is it time for fae to re-try one of his large tiff uploads?

* https://gerrit.wikimedia.org/r/#/c/107419/
* https://gerrit.wikimedia.org/r/#/c/127642/
* https://gerrit.wikimedia.org/r/#/c/132111/
* https://gerrit.wikimedia.org/r/#/c/135701/
* https://gerrit.wikimedia.org/r/#/c/135702/
* https://gerrit.wikimedia.org/r/#/c/135976/

or do these need to also be deployed to production before we try testing large tiffs again?

* https://gerrit.wikimedia.org/r/#/c/135703
* https://gerrit.wikimedia.org/r/#/c/135704
Comment 12 Tisza Gergő 2014-06-21 01:51:39 UTC
Sorry for the slow response, I got unCCd from this bug somehow.

(In reply to dan from comment #11)
> with these gerrit patches merged, and deployed onto production, is it time
> for fae to re-try one of his large tiff uploads?

The changes you mention don't really help:

> * https://gerrit.wikimedia.org/r/#/c/107419/
> * https://gerrit.wikimedia.org/r/#/c/127642/

These only help with thumbnails which completely fail to render, and even for those have limited effect (as Bawolff pointed out above - the rendering would still take up time and memory, until the failure threshold is hit).

Also, the first was merged long ago, and the second right after the first outage, so they did not stop the second one.

> * https://gerrit.wikimedia.org/r/#/c/132111/
> * https://gerrit.wikimedia.org/r/#/c/135701/
> * https://gerrit.wikimedia.org/r/#/c/135702/
> * https://gerrit.wikimedia.org/r/#/c/135976/

These don't really do anything without the two pending ones you mention. (Sorry to be so sluggish on this - we were distracted by troubles with the MediaViewer rollout on enwiki. Also, Gilles is on vacation next week, so unless someone else is willing to review them, not much will happen. I hope to get them merged the following week.)

Bawolff's $wgMaxArea patch might help somewhat:

https://gerrit.wikimedia.org/r/#/c/135101/

not sure if the files involved in the second outage were that large, though.

The multi-step scaling patches might also help, once they get merged:
https://gerrit.wikimedia.org/r/#/c/135289/
https://gerrit.wikimedia.org/r/#/c/135008/
(the second one is only for JPEGs at the moment though)
Comment 13 Tisza Gergő 2014-06-21 01:57:16 UTC
(In reply to Bawolff (Brian Wolff) from comment #4)
> The attempt-failures thing only increments the cache key after the attempt
> failed. Given it was taking ~ 38 seconds just to download the file to the
> image scalar (in the case I tried), A lot of people could try and render
> that file in that time before the key is incremented (Still limited by the
> pool counter though). Maybe that key should be incremented at the beginning
> of the request.

That would be a semaphore, basically (except that its value would decrease with failures). Isn't that what the FileRender poolcounter does already?
Comment 14 Bawolff (Brian Wolff) 2014-06-21 22:16:06 UTC
(In reply to Tisza Gergő from comment #13)
> (In reply to Bawolff (Brian Wolff) from comment #4)
> > The attempt-failures thing only increments the cache key after the attempt
> > failed. Given it was taking ~ 38 seconds just to download the file to the
> > image scalar (in the case I tried), A lot of people could try and render
> > that file in that time before the key is incremented (Still limited by the
> > pool counter though). Maybe that key should be incremented at the beginning
> > of the request.
> 
> That would be a semaphore, basically (except that its value would decrease
> with failures). Isn't that what the FileRender poolcounter does already?

Yes. You're right.
Comment 15 Nemo 2014-06-26 07:31:38 UTC
Cc Sam here because I don't know where else, about:

samwilson> one thing i've been tinkering with is a system of generating thumbnails offline and ploking them in their correct locations. that'd reduce a pile of the out-of-memory things i see on DH [DreamHost] sites.
Comment 16 Sam Wilson 2014-06-26 09:01:12 UTC
(Thanks for the heads-up re this, Nemo.)

My thing isn't really a fix! It's just a simple way for the site administrator to be told that some thumbnail is missing, and where it should go in the filesystem, so that they can generate it locally (i.e. Gimp or whatnot) and upload it (via some easy interface, although I've not considered that bit; scp is my usual).

So, not really a help. But good for memory-poor places like Dreamhost!
Comment 17 Bawolff (Brian Wolff) 2014-06-26 16:35:43 UTC
(In reply to Sam Wilson from comment #16)
> (Thanks for the heads-up re this, Nemo.)
> 
> My thing isn't really a fix! It's just a simple way for the site
> administrator to be told that some thumbnail is missing, and where it should
> go in the filesystem, so that they can generate it locally (i.e. Gimp or
> whatnot) and upload it (via some easy interface, although I've not
> considered that bit; scp is my usual).
> 
> So, not really a help. But good for memory-poor places like Dreamhost!

[Slightly off topic] how memory poor is dreamhost?
Comment 18 Sam Wilson 2014-06-27 00:23:45 UTC
Their shared hosting: 90M. Acutally, I think imagemagick failures are also the processes running too long and being kissed.
Comment 19 Sam Wilson 2014-06-27 00:24:45 UTC
Agh, *killed*. Unless DH is the mafia I guess...

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links