Last modified: 2013-06-17 13:43:41 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T25326, the corresponding Phabricator task for complete and up-to-date bug report information.

Bug 23326 - PDF image extraction fails


Summary:	PDF image extraction fails

Status:	RESOLVED FIXED

Product:	Wikimedia
Classification:	Unclassified
Component:	General/Unknown (Other open bugs)
Version:	unspecified
Hardware:	All All

Importance:	Low normal (vote)
Target Milestone:	---
Assigned To:	Marco

URL:	https://commons.wikimedia.org/w/index...
Whiteboard:
Keywords:

Depends on:	36580
Blocks:	41037 41371
	Show dependency tree / graph

Reported:	2010-04-26 11:30 UTC by Lars Aronsson
Modified:	2013-06-17 13:43 UTC (History)
CC List:	14 users (show)

See Also:
Web browser:	---
Mobile Platform:	---
Assignee Huggle Beta Tester:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Description Lars Aronsson 2010-04-26 11:30:13 UTC

On Wikimedia Commons (i.e. the version running there), the file
File:Finlands Allmänna Tidning 1820-01-03.pdf
doesn't show any page images.
Adding ?action=purge to the URL doesn't help.

No explanation is given.
In an offline reader, the PDF looks fine.

If the images are encoded in a way that MediaWiki can't handle,
the user would be helped by an error message that gives
instructions on which image encodings are supported.

Comment 1 Lars Aronsson 2010-04-26 12:08:18 UTC

On the Commons:Village_pump I was told how to view the
error message (this was not trivial and leaves room
for improvement). Apparently:

 Error creating thumbnail: GPL Ghostscript 8.61: Unrecoverable error, exit code 1
 convert: no decode delegate for this image format `/tmp/magick-XXP4reva'.

However, the offline PDF viewer "evince" that comes with
Ubuntu Linux had no problem to view this PDF (images+text),
and "pdfimages" also succeeds to extract the images,
so it should be possible with free software.

Comment 2 Bawolff (Brian Wolff) 2010-04-27 07:06:25 UTC

The error message makes it slightly sound like a font problem maybe(?) since according to the ghostscript faq ( http://pages.cs.wisc.edu/~ghost/doc/gnu/7.05/Issues.htm ):

When CIDFont-CMap pair required by PDF file is not available GS fails with:
/undefinedresource in --findresource--

and theres all sorts of font related stuff on the operhand stack, but i don't know much about pdfs, so that is a wild geuss.

-------
Anyways, here's the actual output from ghostscript when run on the command line (page 1 seems to print fine before it all blows up):


Processing pages 1 through 4.
Page 1
Substituting CID font resource/Adobe-Identity for /Arial.
Error: /undefinedresource in findresource
Operand stack:
   --nostringval--   --dict:8/17(L)--   FontU   56.41   --dict:6/6(L)--   --dict:6/6(L)--   ArialUnicodeMS-Identity-H   --dict:9/12(ro)(G)--   --nostringval--   --dict:6/6(L)--   --dict:6/6(L)--   Adobe-Identity   CIDFont   Adobe-Identity
Execution stack:
   %interp_exit   .runexec2   --nostringval--   --nostringval--   --nostringval--   2   %stopped_push   --nostringval--   --nostringval--   --nostringval--   false   1   %stopped_push   1905   1   3   %oparray_pop   1904   1   3   %oparray_pop   1888   1   3   %oparray_pop   --nostringval--   --nostringval--   2   1   4   --nostringval--   %for_pos_int_continue   --nostringval--   --nostringval--   --nostringval--   --nostringval--   %array_continue   --nostringval--   false   1   %stopped_push   --nostringval--   %loop_continue   --nostringval--   --nostringval--   --nostringval--   --nostringval--   --nostringval--   --nostringval--   %array_continue   --nostringval--   --nostringval--   --nostringval--   --nostringval--   --nostringval--   %loop_continue   --nostringval--   1856   13   10   %oparray_pop   findresource   %errorexec_pop   --nostringval--   --nostringval--   --nostringval--
Dictionary stack:
   --dict:1151/1684(ro)(G)--   --dict:1/20(G)--   --dict:97/200(L)--   --dict:97/200(L)--   --dict:108/127(ro)(G)--   --dict:275/300(ro)(G)--   --dict:22/25(L)--   --dict:4/6(L)--   --dict:21/40(L)--   --dict:6/8(L)--   --dict:38/40(ro)(G)--
Current allocation mode is local
Last OS error: 2
GPL Ghostscript 8.62: Unrecoverable error, exit code 1

Comment 3 Lars Aronsson 2010-05-07 04:27:09 UTC

No fonts should be needed to extract scanned images from a PDF, so maybe the use of Ghostscript is the problem, and we should use pdfimages instead?

Comment 4 Smallman 2010-12-25 16:39:07 UTC

I also have a pdf that isn't thumnbnailing at commons:
File:EAA2 Mississippi River Delta.pdf

When I try to create a thumnail, it gives 
"Error creating thumbnail: convert: no decode delegate for this image format `/tmp/magick-XXKuSImy' @ error/constitute.c/ReadImage/532.
convert: missing an image filename `/mnt/thumbs/wikipedia/commons/thumb/9/92/EAA2_Mississippi_River_Delta.pdf/page1-557px-EAA2_Mississippi_River_Delta.pdf.jpg' @ error/convert.c/ConvertImageCommand/2970."

Comment 5 Peter Youngmeisterarius 2011-10-14 23:21:38 UTC

this is referenced by RT #1175 which is now closed.

this can probably be closed, but needs verification.

Comment 6 Rob Lanphier 2012-05-08 01:51:07 UTC

Hmm, doesn't seem to be solved by the 8.71 upgrade (bug 26388), and this isn't fixed by 9.04 either ("Error: /syntaxerror in -file-GPL Ghostscript 9.04: Unrecoverable error, exit code 1"), so it doesn't seem likely that 9.05 is going to fix this (bug 36580).  Someone should probably test this with the very latest version of Ghostscript, and if it's broken there, too, report a bug upstream (see http://www.ghostscript.com/ )

Comment 7 Nemo 2012-10-25 09:44:48 UTC

Still there. The PDF opens correctly on my machine and a user successfully converted it to https://commons.wikimedia.org/wiki/File:Finlands_Allm%C3%A4nna_Tidning_1820-01-03.djvu

Comment 8 Bawolff (Brian Wolff) 2012-10-26 17:08:14 UTC

(In reply to comment #7)
> Still there. The PDF opens correctly on my machine and a user successfully
> converted it to

Correctly on your machine with ghostscript or using some other program?

Comment 9 Nemo 2012-10-26 17:18:22 UTC

(In reply to comment #8)
> (In reply to comment #7)
> > Still there. The PDF opens correctly on my machine and a user successfully
> > converted it to
> 
> Correctly on your machine with ghostscript or using some other program?

I had tried okular, but gs works too.

$ ghostscript Finlands_Allmänna_Tidning_1820-01-03.pdf 
GPL Ghostscript 9.05 (2012-02-08)
Copyright (C) 2010 Artifex Software, Inc.  All rights reserved.
This software comes with NO WARRANTY: see the file PUBLIC for details.
Processing pages 1 through 4.
Page 1
Can't find CID font "Arial".
Attempting to substitute CID font /Adobe-Identity for /Arial, see doc/Use.htm#CIDFontSubstitution.
The substitute CID font "Adobe-Identity" is not provided either. attempting to use fallback CIDFont.See doc/Use.htm#CIDFontSubstitution.
Loading a TT font from /usr/share/ghostscript/9.05/Resource/CIDFSubst/DroidSansFallback.ttf to emulate a CID font Adobe-Identity ... Done.
>>showpage, press <return> to continue<<
 
Page 2
Can't find CID font "Arial".
Attempting to substitute CID font /Adobe-Identity for /Arial, see doc/Use.htm#CIDFontSubstitution.
Loading a TT font from /usr/share/ghostscript/9.05/Resource/CIDFSubst/DroidSansFallback.ttf to emulate a CID font Adobe-Identity ... Done.
>>showpage, press <return> to continue<<
Page 3
Can't find CID font "Arial".
Attempting to substitute CID font /Adobe-Identity for /Arial, see doc/Use.htm#CIDFontSubstitution.
Loading a TT font from /usr/share/ghostscript/9.05/Resource/CIDFSubst/DroidSansFallback.ttf to emulate a CID font Adobe-Identity ... Done.
>>showpage, press <return> to continue<<
Page 4
Can't find CID font "Arial".
Attempting to substitute CID font /Adobe-Identity for /Arial, see doc/Use.htm#CIDFontSubstitution.
Loading a TT font from /usr/share/ghostscript/9.05/Resource/CIDFSubst/DroidSansFallback.ttf to emulate a CID font Adobe-Identity ... Done.
>>showpage, press <return> to continue<<

Comment 10 Bawolff (Brian Wolff) 2012-10-26 20:42:07 UTC

>I had tried okular, but gs works too.

Ok, that implies that the issue was fixed upstream and an upgrade to ghostscript would fix the issue.

Adding keyword ops.

Comment 11 Nemo 2012-10-26 20:58:50 UTC

(In reply to comment #10)
> >I had tried okular, but gs works too.
> 
> Ok, that implies that the issue was fixed upstream and an upgrade to
> ghostscript would fix the issue.

I don't think so. We're already on 9.05...

Comment 12 Marco 2012-11-26 13:02:08 UTC

I fixed the PDF (workaround)

Comment 13 Andre Klapper 2013-03-15 11:36:28 UTC

== Testcase in Comment 0 ==

Trying https://upload.wikimedia.org/wikipedia/commons/archive/1/19/20121126125750%21Finlands_Allm%C3%A4nna_Tidning_1820-01-03.pdf in Ghostscript 9.06 from 2012-08-08 on a Fedora 18 machine I get:

   **** Warning: File has unbalanced q/Q operators (too many q's)
   **** This file had errors that were repaired or ignored.
   **** Please notify the author of the software that produced this
   **** file that it does not conform to Adobe's published PDF
   **** specification.

Hence I don't see any valid bug report here and nothing that could fixed on Wikimedia's side.  => Closing as INVALID.

A bug report should be filed against the tool the PDF was created with (unfortunately not exposed in its metadata).

If anybody thinks that GhostScript should be more forgiving, feel free to report a request at http://bugs.ghostscript.com/ .


== Testcase in Comment 4 ==

No problems reproducible, thumbnail shown, no issues in Ghostscript. Might have been a different issue that somehow disappeared.

Comment 14 Lars Aronsson 2013-03-15 12:01:51 UTC

Andre, did you read the comments? In comment 12, Marco replaced
the original PDF with a modified PDF. That doesn't remove this bug,
which is that Mediawiki fails to generate thumbnails or a proper
error message for the original PDF. The original PDF still displays
properly in other software, so it is not broken.

Comment 15 Andre Klapper 2013-03-15 12:25:13 UTC

(In reply to comment #14)
> Andre, did you read the comments? In comment 12, Marco replaced
> the original PDF with a modified PDF.

That's why I tested with the old PDF.

> That doesn't remove this bug

This report covers a few things. One is the problem that sometimes thumbnails are not created for PDF files that Ghostscript considers to be invalid.

If this report is about the aspect "Provide some error message in the browser" then it is not fixed, indeed, but I consider this aspect ("Expose readable error messages in the browser" to be covered in bug 23831 already.

> The original PDF still displays
> properly in other software, so it is not broken.

So far the issue was the missing thumbnail, not how the PDF itself displays in other software. Ghostscript says the PDF file is broken and we use Ghostscript.
I can imagine that other software is more forgiving.
If you know that the PDF file is not broken and hence consider the error message in GhostScript wrong or misleading it would be best to discuss this with the GhostScript developers. See the link in comment 13.

Comment 16 Lars Aronsson 2013-03-15 12:55:59 UTC

"Ghostscript says the PDF file is broken and we use Ghostscript."
With that logic, you can say "and we use Mediawiki 1.5", and stop
improving anything. Why should we report bugs anymore? Already in
comment 1, I suggested that perhaps we should use pdfimages
(which does work) instead of ghostscript (which is overly picky).

But if the file is indeed broken, then Ghostscript should be used
as a validator during upload and refuse to accept this broken file.

Comment 17 Andre Klapper 2013-03-15 13:19:11 UTC

I think there is a misunderstanding here. 

We are responsible for MediaWiki and this is the canonical, "upstream" bugtracker for MediaWiki, so we of course accept reports and fix bugs for it. 

So far I have no reason to not believe the output of GhostScript that the specific PDF file is invalid. Again, if you think that GhostScript is wrong, the GhostScript developers need to be contacted "upstream", but I haven't seen any indication that it's a bug in GS so far. We use 3rd party software in many places (like PDF handling) to not reinvent the wheel (the related term is "downstream" - just mentioning the concept here, as I don't know how much open source background you have).

> I suggested that perhaps we should use pdfimages (which does work)
>  instead of ghostscript (which is overly picky).

That's worth a separate enhancement request, please file it in this Bugzilla so it can be considered.

> But if the file is indeed broken, then Ghostscript should be used
> as a validator during upload and refuse to accept this broken file.

That's another pretty good idea, and worth another separate request. :)

In general only one issue per report should be handled, and this report is about a specific PDF file testcase that does not show a thumbnail, and from all I know so far the reason is that the PDF file is broken, so there's nothing to do server-/software-side (yet) for Wikimedia developers. Hence I closed this as INVALID. This does not mean that things could not be improved in several ways via several involved parties in the long run, but that's out of scope for this specific issue.

Comment 18 Jarek Tuszynski 2013-05-07 16:23:43 UTC

We seem to have 3200 affected files in [[Category:PDF files affected by MediaWiki restrictions]]

Comment 19 Marco 2013-05-07 18:02:11 UTC

(In reply to comment #18)
> [[Category:PDF files affected by MediaWiki restrictions]]
-> not en, Commons:
https://commons.wikimedia.org/wiki/Category:PDF_files_affected_by_MediaWiki_restrictions

Comment 20 Lars Aronsson 2013-05-30 00:03:25 UTC

Long, long ago, I was a little interested in getting Wikisource
to work, and so, when I found something that didn't work, I used
to file bug reports like this one (in April 2010, mind you).

However, the tendency of every little problem to become huge and
impossible to solve has removed most of my previous interest.
Comment #17 above is one very typical example of how this happens.

Three years have passed. I leave it to others to try to get
Wikisource to work. I have another project to work on.
Have a good life.

Comment 21 Nemo 2013-05-30 07:02:18 UTC

(In reply to comment #20)
> Long, long ago, I was a little interested in getting Wikisource
> to work, and so, when I found something that didn't work, I used
> to file bug reports like this one (in April 2010, mind you).
> 
> However, the tendency of every little problem to become huge and
> impossible to solve has removed most of my previous interest.
> Comment #17 above is one very typical example of how this happens.
> 
> Three years have passed. I leave it to others to try to get
> Wikisource to work. I have another project to work on.
> Have a good life.

Indeed this is very frustrating. As domas said on some other bug, it doesn't matter whose fault it is; what matters is that the site is broken for the users (and readers).

I think it's useful to discover that the problem lies in some PDF error, that might even be something users can "easily" solve themselves without waiting years for a bug fix; if we decide not to work around library restrictions, though, this doesn't make the problem disappear.
In other words, what are users supposed to do in order to fix those PDFs? Are there standard commands to do so? We could for instance run a bot on Commons (this bug would be moved to Wikimedia>General), or at least make the error more useful.

Comment 22 Andre Klapper 2013-05-30 09:06:09 UTC

(In reply to comment #21)
> I think it's useful to discover that the problem lies in some PDF error, that
> might even be something users can "easily" solve themselves without waiting
> years for a bug fix; 

How?

> if we decide not to work around library restrictions,
> though, this doesn't make the problem disappear.

Which library restriction would you exactly like to work around here and how?

With which exact incentive was this bug report reopened? We cannot easily fix broken damaged PDF files that were uploaded, so what is the expectation?
(The feature requests in comment 16 should be separate bug reports as I write in comment 17). As I wrote before, a Ghostscript update very likely won't fix the issue in comment 0, and the different issue in comment 4 vanished.

If you are after better error message propogation etc, please make that a different enhancement request. For the testcases in comment 0 and comment 4 on this bug report, I still consider this bug report INVALID.

Comment 23 Nemo 2013-05-30 09:13:54 UTC

(In reply to comment #22)
> (In reply to comment #21)
> > I think it's useful to discover that the problem lies in some PDF error, that
> > might even be something users can "easily" solve themselves without waiting
> > years for a bug fix; 
> 
> How?

pdfimages works, apparently.

Comment 24 Marco 2013-05-30 09:30:34 UTC

[editconflict]

I had a look at a non-representative amount of PDF files taken from https://commons.wikimedia.org/wiki/Category:PDF_files_affected_by_MediaWiki_restrictions

I encountered the following problems:
* PDF renders fine on local machine but does not on the servers due to resource limitations. Ex: https://commons.wikimedia.org/wiki/File:Banner30a%C3%B1os.pdf & https://commons.wikimedia.org/w/index.php?title=File:Cox_and_box.pd
-> There is no "real" fix for those files. One could higher the limits but this would had an impact on the server performance. 
* Corrupt PDF. Such as "File did not complete the page properly and may be damaged." or "Please notify the author of the software that produced this file that it does not conform to Adobe's published PDF specification." Ex: https://commons.wikimedia.org/wiki/File:Boca_y_su_historia.pdf & https://commons.wikimedia.org/wiki/File:Commons_upload_and_my_uploads_android_workflows.pdf
-> Possible fix: Repair those files by bot or use another software which is less strict to process PDF files. Though changing the viewer could also introduce more problems or new bugs...

Comment 25 Marco 2013-05-30 09:51:16 UTC

(In reply to comment #23)
> pdfimages works, apparently.

$: man pdfimages
Pdfimages  saves images from a Portable Document Format...

pdfimages does not save a PDF file as JPEG. It only extracts images from PDF files!?

Comment 26 Marco 2013-06-17 13:32:44 UTC

(In reply to comment #18)
> We seem to have 3200 affected files in https://commons.wikimedia.org/wiki/Category:PDF_files_affected_by_MediaWiki_restrictions

I fixed 99% of all files.

Comment 27 Nemo 2013-06-17 13:43:41 UTC

(In reply to comment #26)
> (In reply to comment #18)
> > We seem to have 3200 affected files in https://commons.wikimedia.org/wiki/Category:PDF_files_affected_by_MediaWiki_restrictions
> 
> I fixed 99% of all files.

Wonderful, let's consider this bug fixed (you deserve a medal!). There are two more bugs opened for some of the remaining files, which probably hit the resource limitations you mentioned.
Making MediaWiki work with such files by using lpr/CUPS or whatever would also be another request.

Wikimedia Bugzilla is closed!

Search

Personal tools

Navigation

Links