Last modified: 2013-09-19 01:10:25 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T56045, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 54045 - Wikimedia thumbnail generator often returns HTTP 500 Internal Server Error
Wikimedia thumbnail generator often returns HTTP 500 Internal Server Error
Status: RESOLVED FIXED
Product: Wikimedia
Classification: Unclassified
Media storage (Other open bugs)
wmf-deployment
All All
: High critical (vote)
: ---
Assigned To: Nobody - You can work on this!
: ops
: 54188 (view as bug list)
Depends on:
Blocks: 41371
  Show dependency treegraph
 
Reported: 2013-09-11 23:17 UTC by Krinkle
Modified: 2013-09-19 01:10 UTC (History)
11 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Krinkle 2013-09-11 23:17:53 UTC
When requesting random images from Commons' MediaWiki API and requesting a thumbnail for it, I often get a HTTP 500.

For example:



GET http://upload.wikimedia.org/wikipedia/commons/thumb/f/fb/USMC-05934.jpg/400px-USMC-05934.jpg
HTTP 500
<html><head><title>Error generating thumbnail</title></head>
<body>
<h1>Error generating thumbnail</h1>
<p>
Error creating thumbnail:
</p>
<!-- http://commons.wikimedia.org/w/thumb_handler.php/f/fb/USMC-05934.jpg/400px-USMC-05934.jpg -->
<!-- mw1154 -->
</body>
</html>

Usually when trying again, it just works.

I've been getting these errors in many different environments:
* When reading articles and HiDPI plugin swaps the src attributes (the larger version would fail maybe)
* When opening the VisualEditor (rendering the new DOM means we re-parse the <img>tag and thus re-request it, thus making it more likely for the error to happen again)
* When working with gadgets that render image galleries through requesting file category members and the thumbnail url.

I don't think the scenario is relevant, there is either something wrong with the thumbnail generator script that is triggered by lots of images. Or there is a few faulty servers in the upload.wikimedia.org pool that cause the errors.
Comment 1 Bawolff (Brian Wolff) 2013-09-11 23:20:47 UTC
See also bug 53573. (which was mostly mw1153)
-----
>I don't think the scenario is relevant, there is either something wrong with
>the thumbnail generator script that is triggered by lots of images. Or there is
>a few faulty servers in the upload.wikimedia.org pool that cause the errors.

Thumbnail generation should be entirely independent of page view actions, so the scenario should definitely not matter
Comment 2 Sam Reed (reedy) 2013-09-11 23:21:39 UTC
Always mw1154?

See also bug 53573, mw1153 was (not sure if it's changed) seemingly causing a lot more errors than its counterparts
Comment 3 Sam Reed (reedy) 2013-09-11 23:29:37 UTC
Looks like mw1153 has stopped being so noisy, and it's now mw1154. Which I guess makes this a dupe of 53753

reedy@fluorine:/a/mw-log$ grep -c error thumbnail.log
324850
reedy@fluorine:/a/mw-log$ grep mw1153 thumbnail.log |grep -c error
22013
reedy@fluorine:/a/mw-log$ grep mw1154 thumbnail.log |grep -c error
54280
reedy@fluorine:/a/mw-log$ grep mw1155 thumbnail.log |grep -c error
22011
reedy@fluorine:/a/mw-log$ grep mw1156 thumbnail.log |grep -c error
21980
reedy@fluorine:/a/mw-log$ grep mw1157 thumbnail.log |grep -c error
21847
reedy@fluorine:/a/mw-log$ grep mw1158 thumbnail.log |grep -c error
22146
reedy@fluorine:/a/mw-log$ grep mw1159 thumbnail.log |grep -c error
21905
reedy@fluorine:/a/mw-log$ grep mw1160 thumbnail.log |grep -c error
22107
Comment 4 Bawolff (Brian Wolff) 2013-09-11 23:46:10 UTC
I just did a test on commons - visited a page with 200 images which I suspect were not previously rendered before (note: all were tiff files). 28 of them failed, all from mw1154. There's only 8 image scalars. 200/8 = 25, so its not hard to imagine this means all requests to mw1154 are failing.

Perhaps something is wrong with the cgroups config on that host (or something else that would prevent all shell commands from succeeding)

The ganglia graph seems to indicate that the CPU usage on that host has dropped off since friday - http://ganglia.wikimedia.org/latest/graph.php?g=cpu_report&z=xlarge&c=Image%20scalers%20eqiad&h=mw1154.eqiad.wmnet&l=e2ecff&v=&r=week&su=1&st=1378942982&x=0&n=0
Comment 5 Sam Reed (reedy) 2013-09-11 23:52:05 UTC
reedy@fluorine:/a/mw-log$ grep mw1154 thumbnail.log |grep -c error
55424
reedy@fluorine:/a/mw-log$ grep -c mw1154 thumbnail.log
111320
Comment 6 Bawolff (Brian Wolff) 2013-09-12 01:52:20 UTC
When you look at the 500 error message, its supposed to include all the stdout and stderr from convert. However these are empty. Perhaps https://gerrit.wikimedia.org/r/83974 would help debug the situation (including stderr from limit.sh in the 500 error message).
Comment 7 Bawolff (Brian Wolff) 2013-09-12 04:07:25 UTC
(In reply to comment #5)
> reedy@fluorine:/a/mw-log$ grep mw1154 thumbnail.log |grep -c error
> 55424
> reedy@fluorine:/a/mw-log$ grep -c mw1154 thumbnail.log
> 111320

We pass things with \n's in them to wfDebugLog for the thumbnail log. I don't think we log successful thumbnailing events at all. So I think all that means is many of the thumbnail error log messages span several lines.
Comment 8 Bawolff (Brian Wolff) 2013-09-16 23:16:43 UTC
*** Bug 54188 has been marked as a duplicate of this bug. ***
Comment 9 Bawolff (Brian Wolff) 2013-09-16 23:24:45 UTC
Btw, good debugging step would be for someone with shell to try running the convert command by hand (with limit.sh) to see what happens (e.g. if issue with cgroup config this would give an error that wouldn't be included in thumb log)

See also Gerrit change #83974 which would give more useful error messages in certain circumstances (which may or may not help for the current situation)
Comment 10 Krinkle 2013-09-17 04:16:20 UTC
The entries in /a/mw-log/thumbnail.log are often hard to correlate since they use temporary file names. However it appears some (or all?) contain a MediaWiki File-namespace url as well.

I just got another HTTP 500 Internal Server Error and found the following entry:

URL: https://upload.wikimedia.org/wikipedia/commons/thumb/e/e8/Trevor_Parscal_December_2008.jpg/660px-Trevor_Parscal_December_2008.jpg

Triggered from an <img> tag on https://commons.wikimedia.org/w/index.php?title=File:Trevor_Parscal_December_2008.jpg&action=submit when previewing an edit.


[04:11 UTC] krinkle at fluorine in /a/mw-log
$ tail thumbnail.log  -n500 | ack-grep Trevor -C 10

Bytes: 0xFF 0x27 0x20 0x69" from "'/usr/bin/'rsvg-convert --no-external-files -w 120 -h 61 -o '/tmp/transform_5a95aeb43c6b-1.png' '/tmp/localcopy_61a6c00091e1-1.svg' 2>&1"
2013-09-17 04:10:06 mw1154 commonswiki: thumbnail failed on mw1154: error 1 "" from "'/usr/bin/convert' -quality 80 -background white -define jpeg:size=660x440 '/tmp/localcopy_cb6372f89daf-1.jpg'  -thumbnail '660x440!' -set comment 'File source: http://commons.wikimedia.org/wiki/File:Trevor_Parscal_December_2008.jpg' -depth 8 -sharpen '0x0.8'  -rotate -0   '/tmp/transform_f22fd870a3fc-1.jpg' 2>&1"
2013-09-17 04:10:06 mw1154 commonswiki: Removing bad 0-byte thumbnail "/tmp/transform_f22fd870a3fc-1.jpg". unlink() succeeded
2013-09-17 04:10:06 mw1157 commonswiki: thumbnail failed on mw1157: error 1 "convert: no decode delegate for this image format `/a/magick-tmp/magick-NyMRfqeG' @ error/constitute.c/ReadImage/532.
convert: missing an image filename `/tmp/transform_6c0a3d0fd7c4-1.jpg' @ error/convert.c/ConvertImageCommand/3011." from "'/usr/bin/convert' -quality 80 -background white -define jpeg:size=120x53 ''  -thumbnail '120x53!' -set comment 'File source: http://commons.wikimedia.org/wiki/File:Banknote_5000_rubles_(1997)_front.jpg' -depth 8 -sharpen '0x0.8'  -rotate -0   '/tmp/transform_6c0a3d0fd7c4-1.jpg' 2>&1"
..
2013-09-17 04:13:59 mw1159 commonswiki: thumbnail failed on mw1159: error 1 "Error reading SVG:Error domain 1 code 96 on line 1 column 17 of file:///tmp/localcopy_2362643f10ec-1.svg: Malformed declaration expecting version" from "'/usr/bin/'rsvg-convert --no-external-files -w 120 -h 96 -o '/tmp/transform_8ba6acf9d427-1.png' '/tmp/localcopy_2362643f10ec-1.svg' 2>&1"
2013-09-17 04:13:59 mw1154 commonswiki: thumbnail failed on mw1154: error 1 "" from "'/usr/bin/convert' -quality 80 -background white -define jpeg:size=660x440 '/tmp/localcopy_babba88e51a4-1.jpg'  -thumbnail '660x440!' -set comment 'File source: http://commons.wikimedia.org/wiki/File:Trevor_Parscal_December_2008.jpg' -depth 8 -sharpen '0x0.8'  -rotate -0   '/tmp/transform_de7473e35169-1.jpg' 2>&1"
2013-09-17 04:13:59 mw1154 commonswiki: Removing bad 0-byte thumbnail "/tmp/transform_de7473e35169-1.jpg". unlink() succeeded
2013-09-17 04:14:01 mw1154 commonswiki: Removing bad 0-byte thumbnail "/tmp/transform_55d5b98e5f7b-1.png". unlink() succeeded
Comment 11 Bawolff (Brian Wolff) 2013-09-18 23:25:04 UTC
Could this server be taken out of rotation until the issue is resolved?
Comment 12 Sam Reed (reedy) 2013-09-18 23:38:26 UTC
Part of the question would be is it still the same machine? (based on the fact it's already changed once)
Comment 13 Bawolff (Brian Wolff) 2013-09-19 00:59:53 UTC
Ganglia graphs suggest so (mw1154 has less activity) but I haven't specificly tested.
Comment 14 Sam Reed (reedy) 2013-09-19 01:05:17 UTC
Yup, still mw1154


reedy@fluorine:/a/mw-log$ grep -c mw1153 thumbnail.log
50057
reedy@fluorine:/a/mw-log$ grep -c mw1154 thumbnail.log
125850
reedy@fluorine:/a/mw-log$ grep -c mw1155 thumbnail.log
50451
reedy@fluorine:/a/mw-log$ grep -c mw1156 thumbnail.log
50584
reedy@fluorine:/a/mw-log$ grep -c mw1157 thumbnail.log
51015
reedy@fluorine:/a/mw-log$ grep -c mw1158 thumbnail.log
50386
reedy@fluorine:/a/mw-log$ grep -c mw1159 thumbnail.log
50134
reedy@fluorine:/a/mw-log$ grep -c mw1160 thumbnail.log
51486
Comment 15 Sam Reed (reedy) 2013-09-19 01:10:25 UTC
[02:06:08] <Reedy> Can someone depool mw1154 from the image scaler cluster?
[02:06:20] <paravoid> what's wrong with it?
[02:07:40] <Reedy> It's seemingly generating a lot more errors (by a factor of 2 or so) than any of the other scalers
[02:07:41] <Reedy> https://bugzilla.wikimedia.org/show_bug.cgi?id=54045
[02:08:32] <paravoid> root@mw1154:~# ls /sys/fs/cgroup/memory/mediawiki/job
[02:08:33] <paravoid> ls: cannot access /sys/fs/cgroup/memory/mediawiki/job: No such file or directory
[02:08:35] <paravoid> blergh
[02:09:01] <paravoid> fixed
[02:09:58] <Reedy> is that's what's up with it?
[02:10:05] <paravoid> yes, fixed
[02:10:09] <Reedy> awesome
[02:10:10] <Reedy> thanks

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links