Last modified: 2013-06-17 13:43:41 UTC
On Wikimedia Commons (i.e. the version running there), the file File:Finlands Allmänna Tidning 1820-01-03.pdf doesn't show any page images. Adding ?action=purge to the URL doesn't help. No explanation is given. In an offline reader, the PDF looks fine. If the images are encoded in a way that MediaWiki can't handle, the user would be helped by an error message that gives instructions on which image encodings are supported.
On the Commons:Village_pump I was told how to view the error message (this was not trivial and leaves room for improvement). Apparently: Error creating thumbnail: GPL Ghostscript 8.61: Unrecoverable error, exit code 1 convert: no decode delegate for this image format `/tmp/magick-XXP4reva'. However, the offline PDF viewer "evince" that comes with Ubuntu Linux had no problem to view this PDF (images+text), and "pdfimages" also succeeds to extract the images, so it should be possible with free software.
The error message makes it slightly sound like a font problem maybe(?) since according to the ghostscript faq ( http://pages.cs.wisc.edu/~ghost/doc/gnu/7.05/Issues.htm ): When CIDFont-CMap pair required by PDF file is not available GS fails with: /undefinedresource in --findresource-- and theres all sorts of font related stuff on the operhand stack, but i don't know much about pdfs, so that is a wild geuss. ------- Anyways, here's the actual output from ghostscript when run on the command line (page 1 seems to print fine before it all blows up): Processing pages 1 through 4. Page 1 Substituting CID font resource/Adobe-Identity for /Arial. Error: /undefinedresource in findresource Operand stack: --nostringval-- --dict:8/17(L)-- FontU 56.41 --dict:6/6(L)-- --dict:6/6(L)-- ArialUnicodeMS-Identity-H --dict:9/12(ro)(G)-- --nostringval-- --dict:6/6(L)-- --dict:6/6(L)-- Adobe-Identity CIDFont Adobe-Identity Execution stack: %interp_exit .runexec2 --nostringval-- --nostringval-- --nostringval-- 2 %stopped_push --nostringval-- --nostringval-- --nostringval-- false 1 %stopped_push 1905 1 3 %oparray_pop 1904 1 3 %oparray_pop 1888 1 3 %oparray_pop --nostringval-- --nostringval-- 2 1 4 --nostringval-- %for_pos_int_continue --nostringval-- --nostringval-- --nostringval-- --nostringval-- %array_continue --nostringval-- false 1 %stopped_push --nostringval-- %loop_continue --nostringval-- --nostringval-- --nostringval-- --nostringval-- --nostringval-- --nostringval-- %array_continue --nostringval-- --nostringval-- --nostringval-- --nostringval-- --nostringval-- %loop_continue --nostringval-- 1856 13 10 %oparray_pop findresource %errorexec_pop --nostringval-- --nostringval-- --nostringval-- Dictionary stack: --dict:1151/1684(ro)(G)-- --dict:1/20(G)-- --dict:97/200(L)-- --dict:97/200(L)-- --dict:108/127(ro)(G)-- --dict:275/300(ro)(G)-- --dict:22/25(L)-- --dict:4/6(L)-- --dict:21/40(L)-- --dict:6/8(L)-- --dict:38/40(ro)(G)-- Current allocation mode is local Last OS error: 2 GPL Ghostscript 8.62: Unrecoverable error, exit code 1
No fonts should be needed to extract scanned images from a PDF, so maybe the use of Ghostscript is the problem, and we should use pdfimages instead?
I also have a pdf that isn't thumnbnailing at commons: File:EAA2 Mississippi River Delta.pdf When I try to create a thumnail, it gives "Error creating thumbnail: convert: no decode delegate for this image format `/tmp/magick-XXKuSImy' @ error/constitute.c/ReadImage/532. convert: missing an image filename `/mnt/thumbs/wikipedia/commons/thumb/9/92/EAA2_Mississippi_River_Delta.pdf/page1-557px-EAA2_Mississippi_River_Delta.pdf.jpg' @ error/convert.c/ConvertImageCommand/2970."
this is referenced by RT #1175 which is now closed. this can probably be closed, but needs verification.
Hmm, doesn't seem to be solved by the 8.71 upgrade (bug 26388), and this isn't fixed by 9.04 either ("Error: /syntaxerror in -file-GPL Ghostscript 9.04: Unrecoverable error, exit code 1"), so it doesn't seem likely that 9.05 is going to fix this (bug 36580). Someone should probably test this with the very latest version of Ghostscript, and if it's broken there, too, report a bug upstream (see http://www.ghostscript.com/ )
Still there. The PDF opens correctly on my machine and a user successfully converted it to https://commons.wikimedia.org/wiki/File:Finlands_Allm%C3%A4nna_Tidning_1820-01-03.djvu
(In reply to comment #7) > Still there. The PDF opens correctly on my machine and a user successfully > converted it to Correctly on your machine with ghostscript or using some other program?
(In reply to comment #8) > (In reply to comment #7) > > Still there. The PDF opens correctly on my machine and a user successfully > > converted it to > > Correctly on your machine with ghostscript or using some other program? I had tried okular, but gs works too. $ ghostscript Finlands_Allmänna_Tidning_1820-01-03.pdf GPL Ghostscript 9.05 (2012-02-08) Copyright (C) 2010 Artifex Software, Inc. All rights reserved. This software comes with NO WARRANTY: see the file PUBLIC for details. Processing pages 1 through 4. Page 1 Can't find CID font "Arial". Attempting to substitute CID font /Adobe-Identity for /Arial, see doc/Use.htm#CIDFontSubstitution. The substitute CID font "Adobe-Identity" is not provided either. attempting to use fallback CIDFont.See doc/Use.htm#CIDFontSubstitution. Loading a TT font from /usr/share/ghostscript/9.05/Resource/CIDFSubst/DroidSansFallback.ttf to emulate a CID font Adobe-Identity ... Done. >>showpage, press <return> to continue<< Page 2 Can't find CID font "Arial". Attempting to substitute CID font /Adobe-Identity for /Arial, see doc/Use.htm#CIDFontSubstitution. Loading a TT font from /usr/share/ghostscript/9.05/Resource/CIDFSubst/DroidSansFallback.ttf to emulate a CID font Adobe-Identity ... Done. >>showpage, press <return> to continue<< Page 3 Can't find CID font "Arial". Attempting to substitute CID font /Adobe-Identity for /Arial, see doc/Use.htm#CIDFontSubstitution. Loading a TT font from /usr/share/ghostscript/9.05/Resource/CIDFSubst/DroidSansFallback.ttf to emulate a CID font Adobe-Identity ... Done. >>showpage, press <return> to continue<< Page 4 Can't find CID font "Arial". Attempting to substitute CID font /Adobe-Identity for /Arial, see doc/Use.htm#CIDFontSubstitution. Loading a TT font from /usr/share/ghostscript/9.05/Resource/CIDFSubst/DroidSansFallback.ttf to emulate a CID font Adobe-Identity ... Done. >>showpage, press <return> to continue<<
>I had tried okular, but gs works too. Ok, that implies that the issue was fixed upstream and an upgrade to ghostscript would fix the issue. Adding keyword ops.
(In reply to comment #10) > >I had tried okular, but gs works too. > > Ok, that implies that the issue was fixed upstream and an upgrade to > ghostscript would fix the issue. I don't think so. We're already on 9.05...
I fixed the PDF (workaround)
== Testcase in Comment 0 == Trying https://upload.wikimedia.org/wikipedia/commons/archive/1/19/20121126125750%21Finlands_Allm%C3%A4nna_Tidning_1820-01-03.pdf in Ghostscript 9.06 from 2012-08-08 on a Fedora 18 machine I get: **** Warning: File has unbalanced q/Q operators (too many q's) **** This file had errors that were repaired or ignored. **** Please notify the author of the software that produced this **** file that it does not conform to Adobe's published PDF **** specification. Hence I don't see any valid bug report here and nothing that could fixed on Wikimedia's side. => Closing as INVALID. A bug report should be filed against the tool the PDF was created with (unfortunately not exposed in its metadata). If anybody thinks that GhostScript should be more forgiving, feel free to report a request at http://bugs.ghostscript.com/ . == Testcase in Comment 4 == No problems reproducible, thumbnail shown, no issues in Ghostscript. Might have been a different issue that somehow disappeared.
Andre, did you read the comments? In comment 12, Marco replaced the original PDF with a modified PDF. That doesn't remove this bug, which is that Mediawiki fails to generate thumbnails or a proper error message for the original PDF. The original PDF still displays properly in other software, so it is not broken.
(In reply to comment #14) > Andre, did you read the comments? In comment 12, Marco replaced > the original PDF with a modified PDF. That's why I tested with the old PDF. > That doesn't remove this bug This report covers a few things. One is the problem that sometimes thumbnails are not created for PDF files that Ghostscript considers to be invalid. If this report is about the aspect "Provide some error message in the browser" then it is not fixed, indeed, but I consider this aspect ("Expose readable error messages in the browser" to be covered in bug 23831 already. > The original PDF still displays > properly in other software, so it is not broken. So far the issue was the missing thumbnail, not how the PDF itself displays in other software. Ghostscript says the PDF file is broken and we use Ghostscript. I can imagine that other software is more forgiving. If you know that the PDF file is not broken and hence consider the error message in GhostScript wrong or misleading it would be best to discuss this with the GhostScript developers. See the link in comment 13.
"Ghostscript says the PDF file is broken and we use Ghostscript." With that logic, you can say "and we use Mediawiki 1.5", and stop improving anything. Why should we report bugs anymore? Already in comment 1, I suggested that perhaps we should use pdfimages (which does work) instead of ghostscript (which is overly picky). But if the file is indeed broken, then Ghostscript should be used as a validator during upload and refuse to accept this broken file.
I think there is a misunderstanding here. We are responsible for MediaWiki and this is the canonical, "upstream" bugtracker for MediaWiki, so we of course accept reports and fix bugs for it. So far I have no reason to not believe the output of GhostScript that the specific PDF file is invalid. Again, if you think that GhostScript is wrong, the GhostScript developers need to be contacted "upstream", but I haven't seen any indication that it's a bug in GS so far. We use 3rd party software in many places (like PDF handling) to not reinvent the wheel (the related term is "downstream" - just mentioning the concept here, as I don't know how much open source background you have). > I suggested that perhaps we should use pdfimages (which does work) > instead of ghostscript (which is overly picky). That's worth a separate enhancement request, please file it in this Bugzilla so it can be considered. > But if the file is indeed broken, then Ghostscript should be used > as a validator during upload and refuse to accept this broken file. That's another pretty good idea, and worth another separate request. :) In general only one issue per report should be handled, and this report is about a specific PDF file testcase that does not show a thumbnail, and from all I know so far the reason is that the PDF file is broken, so there's nothing to do server-/software-side (yet) for Wikimedia developers. Hence I closed this as INVALID. This does not mean that things could not be improved in several ways via several involved parties in the long run, but that's out of scope for this specific issue.
We seem to have 3200 affected files in [[Category:PDF files affected by MediaWiki restrictions]]
(In reply to comment #18) > [[Category:PDF files affected by MediaWiki restrictions]] -> not en, Commons: https://commons.wikimedia.org/wiki/Category:PDF_files_affected_by_MediaWiki_restrictions
Long, long ago, I was a little interested in getting Wikisource to work, and so, when I found something that didn't work, I used to file bug reports like this one (in April 2010, mind you). However, the tendency of every little problem to become huge and impossible to solve has removed most of my previous interest. Comment #17 above is one very typical example of how this happens. Three years have passed. I leave it to others to try to get Wikisource to work. I have another project to work on. Have a good life.
(In reply to comment #20) > Long, long ago, I was a little interested in getting Wikisource > to work, and so, when I found something that didn't work, I used > to file bug reports like this one (in April 2010, mind you). > > However, the tendency of every little problem to become huge and > impossible to solve has removed most of my previous interest. > Comment #17 above is one very typical example of how this happens. > > Three years have passed. I leave it to others to try to get > Wikisource to work. I have another project to work on. > Have a good life. Indeed this is very frustrating. As domas said on some other bug, it doesn't matter whose fault it is; what matters is that the site is broken for the users (and readers). I think it's useful to discover that the problem lies in some PDF error, that might even be something users can "easily" solve themselves without waiting years for a bug fix; if we decide not to work around library restrictions, though, this doesn't make the problem disappear. In other words, what are users supposed to do in order to fix those PDFs? Are there standard commands to do so? We could for instance run a bot on Commons (this bug would be moved to Wikimedia>General), or at least make the error more useful.
(In reply to comment #21) > I think it's useful to discover that the problem lies in some PDF error, that > might even be something users can "easily" solve themselves without waiting > years for a bug fix; How? > if we decide not to work around library restrictions, > though, this doesn't make the problem disappear. Which library restriction would you exactly like to work around here and how? With which exact incentive was this bug report reopened? We cannot easily fix broken damaged PDF files that were uploaded, so what is the expectation? (The feature requests in comment 16 should be separate bug reports as I write in comment 17). As I wrote before, a Ghostscript update very likely won't fix the issue in comment 0, and the different issue in comment 4 vanished. If you are after better error message propogation etc, please make that a different enhancement request. For the testcases in comment 0 and comment 4 on this bug report, I still consider this bug report INVALID.
(In reply to comment #22) > (In reply to comment #21) > > I think it's useful to discover that the problem lies in some PDF error, that > > might even be something users can "easily" solve themselves without waiting > > years for a bug fix; > > How? pdfimages works, apparently.
[editconflict] I had a look at a non-representative amount of PDF files taken from https://commons.wikimedia.org/wiki/Category:PDF_files_affected_by_MediaWiki_restrictions I encountered the following problems: * PDF renders fine on local machine but does not on the servers due to resource limitations. Ex: https://commons.wikimedia.org/wiki/File:Banner30a%C3%B1os.pdf & https://commons.wikimedia.org/w/index.php?title=File:Cox_and_box.pd -> There is no "real" fix for those files. One could higher the limits but this would had an impact on the server performance. * Corrupt PDF. Such as "File did not complete the page properly and may be damaged." or "Please notify the author of the software that produced this file that it does not conform to Adobe's published PDF specification." Ex: https://commons.wikimedia.org/wiki/File:Boca_y_su_historia.pdf & https://commons.wikimedia.org/wiki/File:Commons_upload_and_my_uploads_android_workflows.pdf -> Possible fix: Repair those files by bot or use another software which is less strict to process PDF files. Though changing the viewer could also introduce more problems or new bugs...
(In reply to comment #23) > pdfimages works, apparently. $: man pdfimages Pdfimages saves images from a Portable Document Format... pdfimages does not save a PDF file as JPEG. It only extracts images from PDF files!?
(In reply to comment #18) > We seem to have 3200 affected files in https://commons.wikimedia.org/wiki/Category:PDF_files_affected_by_MediaWiki_restrictions I fixed 99% of all files.
(In reply to comment #26) > (In reply to comment #18) > > We seem to have 3200 affected files in https://commons.wikimedia.org/wiki/Category:PDF_files_affected_by_MediaWiki_restrictions > > I fixed 99% of all files. Wonderful, let's consider this bug fixed (you deserve a medal!). There are two more bugs opened for some of the remaining files, which probably hit the resource limitations you mentioned. Making MediaWiki work with such files by using lpr/CUPS or whatever would also be another request.