Last modified: 2013-09-19 01:10:25 UTC
When requesting random images from Commons' MediaWiki API and requesting a thumbnail for it, I often get a HTTP 500. For example: GET http://upload.wikimedia.org/wikipedia/commons/thumb/f/fb/USMC-05934.jpg/400px-USMC-05934.jpg HTTP 500 <html><head><title>Error generating thumbnail</title></head> <body> <h1>Error generating thumbnail</h1> <p> Error creating thumbnail: </p> <!-- http://commons.wikimedia.org/w/thumb_handler.php/f/fb/USMC-05934.jpg/400px-USMC-05934.jpg --> <!-- mw1154 --> </body> </html> Usually when trying again, it just works. I've been getting these errors in many different environments: * When reading articles and HiDPI plugin swaps the src attributes (the larger version would fail maybe) * When opening the VisualEditor (rendering the new DOM means we re-parse the <img>tag and thus re-request it, thus making it more likely for the error to happen again) * When working with gadgets that render image galleries through requesting file category members and the thumbnail url. I don't think the scenario is relevant, there is either something wrong with the thumbnail generator script that is triggered by lots of images. Or there is a few faulty servers in the upload.wikimedia.org pool that cause the errors.
See also bug 53573. (which was mostly mw1153) ----- >I don't think the scenario is relevant, there is either something wrong with >the thumbnail generator script that is triggered by lots of images. Or there is >a few faulty servers in the upload.wikimedia.org pool that cause the errors. Thumbnail generation should be entirely independent of page view actions, so the scenario should definitely not matter
Always mw1154? See also bug 53573, mw1153 was (not sure if it's changed) seemingly causing a lot more errors than its counterparts
Looks like mw1153 has stopped being so noisy, and it's now mw1154. Which I guess makes this a dupe of 53753 reedy@fluorine:/a/mw-log$ grep -c error thumbnail.log 324850 reedy@fluorine:/a/mw-log$ grep mw1153 thumbnail.log |grep -c error 22013 reedy@fluorine:/a/mw-log$ grep mw1154 thumbnail.log |grep -c error 54280 reedy@fluorine:/a/mw-log$ grep mw1155 thumbnail.log |grep -c error 22011 reedy@fluorine:/a/mw-log$ grep mw1156 thumbnail.log |grep -c error 21980 reedy@fluorine:/a/mw-log$ grep mw1157 thumbnail.log |grep -c error 21847 reedy@fluorine:/a/mw-log$ grep mw1158 thumbnail.log |grep -c error 22146 reedy@fluorine:/a/mw-log$ grep mw1159 thumbnail.log |grep -c error 21905 reedy@fluorine:/a/mw-log$ grep mw1160 thumbnail.log |grep -c error 22107
I just did a test on commons - visited a page with 200 images which I suspect were not previously rendered before (note: all were tiff files). 28 of them failed, all from mw1154. There's only 8 image scalars. 200/8 = 25, so its not hard to imagine this means all requests to mw1154 are failing. Perhaps something is wrong with the cgroups config on that host (or something else that would prevent all shell commands from succeeding) The ganglia graph seems to indicate that the CPU usage on that host has dropped off since friday - http://ganglia.wikimedia.org/latest/graph.php?g=cpu_report&z=xlarge&c=Image%20scalers%20eqiad&h=mw1154.eqiad.wmnet&l=e2ecff&v=&r=week&su=1&st=1378942982&x=0&n=0
reedy@fluorine:/a/mw-log$ grep mw1154 thumbnail.log |grep -c error 55424 reedy@fluorine:/a/mw-log$ grep -c mw1154 thumbnail.log 111320
When you look at the 500 error message, its supposed to include all the stdout and stderr from convert. However these are empty. Perhaps https://gerrit.wikimedia.org/r/83974 would help debug the situation (including stderr from limit.sh in the 500 error message).
(In reply to comment #5) > reedy@fluorine:/a/mw-log$ grep mw1154 thumbnail.log |grep -c error > 55424 > reedy@fluorine:/a/mw-log$ grep -c mw1154 thumbnail.log > 111320 We pass things with \n's in them to wfDebugLog for the thumbnail log. I don't think we log successful thumbnailing events at all. So I think all that means is many of the thumbnail error log messages span several lines.
*** Bug 54188 has been marked as a duplicate of this bug. ***
Btw, good debugging step would be for someone with shell to try running the convert command by hand (with limit.sh) to see what happens (e.g. if issue with cgroup config this would give an error that wouldn't be included in thumb log) See also Gerrit change #83974 which would give more useful error messages in certain circumstances (which may or may not help for the current situation)
The entries in /a/mw-log/thumbnail.log are often hard to correlate since they use temporary file names. However it appears some (or all?) contain a MediaWiki File-namespace url as well. I just got another HTTP 500 Internal Server Error and found the following entry: URL: https://upload.wikimedia.org/wikipedia/commons/thumb/e/e8/Trevor_Parscal_December_2008.jpg/660px-Trevor_Parscal_December_2008.jpg Triggered from an <img> tag on https://commons.wikimedia.org/w/index.php?title=File:Trevor_Parscal_December_2008.jpg&action=submit when previewing an edit. [04:11 UTC] krinkle at fluorine in /a/mw-log $ tail thumbnail.log -n500 | ack-grep Trevor -C 10 Bytes: 0xFF 0x27 0x20 0x69" from "'/usr/bin/'rsvg-convert --no-external-files -w 120 -h 61 -o '/tmp/transform_5a95aeb43c6b-1.png' '/tmp/localcopy_61a6c00091e1-1.svg' 2>&1" 2013-09-17 04:10:06 mw1154 commonswiki: thumbnail failed on mw1154: error 1 "" from "'/usr/bin/convert' -quality 80 -background white -define jpeg:size=660x440 '/tmp/localcopy_cb6372f89daf-1.jpg' -thumbnail '660x440!' -set comment 'File source: http://commons.wikimedia.org/wiki/File:Trevor_Parscal_December_2008.jpg' -depth 8 -sharpen '0x0.8' -rotate -0 '/tmp/transform_f22fd870a3fc-1.jpg' 2>&1" 2013-09-17 04:10:06 mw1154 commonswiki: Removing bad 0-byte thumbnail "/tmp/transform_f22fd870a3fc-1.jpg". unlink() succeeded 2013-09-17 04:10:06 mw1157 commonswiki: thumbnail failed on mw1157: error 1 "convert: no decode delegate for this image format `/a/magick-tmp/magick-NyMRfqeG' @ error/constitute.c/ReadImage/532. convert: missing an image filename `/tmp/transform_6c0a3d0fd7c4-1.jpg' @ error/convert.c/ConvertImageCommand/3011." from "'/usr/bin/convert' -quality 80 -background white -define jpeg:size=120x53 '' -thumbnail '120x53!' -set comment 'File source: http://commons.wikimedia.org/wiki/File:Banknote_5000_rubles_(1997)_front.jpg' -depth 8 -sharpen '0x0.8' -rotate -0 '/tmp/transform_6c0a3d0fd7c4-1.jpg' 2>&1" .. 2013-09-17 04:13:59 mw1159 commonswiki: thumbnail failed on mw1159: error 1 "Error reading SVG:Error domain 1 code 96 on line 1 column 17 of file:///tmp/localcopy_2362643f10ec-1.svg: Malformed declaration expecting version" from "'/usr/bin/'rsvg-convert --no-external-files -w 120 -h 96 -o '/tmp/transform_8ba6acf9d427-1.png' '/tmp/localcopy_2362643f10ec-1.svg' 2>&1" 2013-09-17 04:13:59 mw1154 commonswiki: thumbnail failed on mw1154: error 1 "" from "'/usr/bin/convert' -quality 80 -background white -define jpeg:size=660x440 '/tmp/localcopy_babba88e51a4-1.jpg' -thumbnail '660x440!' -set comment 'File source: http://commons.wikimedia.org/wiki/File:Trevor_Parscal_December_2008.jpg' -depth 8 -sharpen '0x0.8' -rotate -0 '/tmp/transform_de7473e35169-1.jpg' 2>&1" 2013-09-17 04:13:59 mw1154 commonswiki: Removing bad 0-byte thumbnail "/tmp/transform_de7473e35169-1.jpg". unlink() succeeded 2013-09-17 04:14:01 mw1154 commonswiki: Removing bad 0-byte thumbnail "/tmp/transform_55d5b98e5f7b-1.png". unlink() succeeded
Could this server be taken out of rotation until the issue is resolved?
Part of the question would be is it still the same machine? (based on the fact it's already changed once)
Ganglia graphs suggest so (mw1154 has less activity) but I haven't specificly tested.
Yup, still mw1154 reedy@fluorine:/a/mw-log$ grep -c mw1153 thumbnail.log 50057 reedy@fluorine:/a/mw-log$ grep -c mw1154 thumbnail.log 125850 reedy@fluorine:/a/mw-log$ grep -c mw1155 thumbnail.log 50451 reedy@fluorine:/a/mw-log$ grep -c mw1156 thumbnail.log 50584 reedy@fluorine:/a/mw-log$ grep -c mw1157 thumbnail.log 51015 reedy@fluorine:/a/mw-log$ grep -c mw1158 thumbnail.log 50386 reedy@fluorine:/a/mw-log$ grep -c mw1159 thumbnail.log 50134 reedy@fluorine:/a/mw-log$ grep -c mw1160 thumbnail.log 51486
[02:06:08] <Reedy> Can someone depool mw1154 from the image scaler cluster? [02:06:20] <paravoid> what's wrong with it? [02:07:40] <Reedy> It's seemingly generating a lot more errors (by a factor of 2 or so) than any of the other scalers [02:07:41] <Reedy> https://bugzilla.wikimedia.org/show_bug.cgi?id=54045 [02:08:32] <paravoid> root@mw1154:~# ls /sys/fs/cgroup/memory/mediawiki/job [02:08:33] <paravoid> ls: cannot access /sys/fs/cgroup/memory/mediawiki/job: No such file or directory [02:08:35] <paravoid> blergh [02:09:01] <paravoid> fixed [02:09:58] <Reedy> is that's what's up with it? [02:10:05] <paravoid> yes, fixed [02:10:09] <Reedy> awesome [02:10:10] <Reedy> thanks