Last modified: 2014-01-13 04:03:43 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T59278, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 57278 - Issue with PDFs downloaded from Archive.org
Issue with PDFs downloaded from Archive.org
Status: RESOLVED WORKSFORME
Product: MediaWiki extensions
Classification: Unclassified
PdfHandler (Other open bugs)
unspecified
All All
: Low minor (vote)
: ---
Assigned To: Nobody - You can work on this!
:
Depends on:
Blocks: 41037 56295
  Show dependency treegraph
 
Reported: 2013-11-20 02:42 UTC by Shiju Alex
Modified: 2014-01-13 04:03 UTC (History)
9 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Shiju Alex 2013-11-20 02:42:22 UTC
Hi

I am finding some issues with the PDFs downloaded from https://archive.org when we associate it with Proofread extension in Wikisource. 

For example, see this file at Archive.org https://archive.org/details/pazhancholmala_gundert_1845 This file can viewed properly and downloaded from Archive.org. 

I downloaded this file and uploaded to Commons. https://commons.wikimedia.org/wiki/File:Pazhancholmala_Gundert_1845.pdf From Commons also if we download that file we can view it properly and can read it. 

Now there are 2 issues with Commons/Proofread/Mediawiki

1. Inside Commons itself (for example, https://commons.wikimedia.org/wiki/File:Pazhancholmala_Gundert_1845.pdf) you can see that you cannot view the pages from this file in higher resolution. 

2. When we create Index file in Wikisource (for example, https://ml.wikisource.org/wiki/Index:Pazhancholmala_Gundert_1845.pdf) and try to work on a page (for example, https://ml.wikisource.org/w/index.php?title=Page:Pazhancholmala_Gundert_1845.pdf/7&action=edit&redlink=1) you can see that nothing much can be seen on the scanned page. 


The second issue might be the direct consequence of issue 1. Could you please look into this issue. 

I suspect the issue is closely related to the PDF generation method at Archive.org. But I am not sure about that also since the PDF file as a whole is perfectly fine.
Comment 1 Sam Reed (reedy) 2013-11-20 02:52:47 UTC
I suspect that this probably isn't a ProofreadPage issue but one of either the PdfHandler extension, or more likely one related to the tool doing the PDF page rendering to images on the wikimedia image scalers.. Those being ghostscript and imagemagick

Moving to PdfHandler for the time being.

Software on cluster:
reedy@tin:/a/common$ dpkg -l | grep ghostscript
ii  ghostscript                      9.05~dfsg-0ubuntu4.2                interpreter for the PostScript language and for PDF
ii  gs-cjk-resource                  1.20100103-3                        Resource files for gs-cjk, ghostscript CJK-TrueType extension
reedy@tin:/a/common$ dpkg -l | grep imagemagick
ii  imagemagick                      8:6.6.9.7-5ubuntu3.2                image manipulation programs
ii  imagemagick-common               8:6.6.9.7-5ubuntu3.2                image manipulation programs -- infrastructure


I note a similar output locally too on my dev wiki

reedy@ubuntu64-web-esxi:/var/www/wiki/mediawiki/core$ dpkg -l | grep ghostscript
ii  ghostscript                          9.10~dfsg-0ubuntu2                  amd64        interpreter for the PostScript language and for PDF
ii  gs-cjk-resource                      1.20100103-3                        all          Resource files for gs-cjk, ghostscript CJK-TrueType extension
reedy@ubuntu64-web-esxi:/var/www/wiki/mediawiki/core$ dpkg -l | grep imagemagick
ii  imagemagick                          8:6.7.7.10-5ubuntu3                 amd64        image manipulation programs
ii  imagemagick-common                   8:6.7.7.10-5ubuntu3                 all          image manipulation programs -- infrastructure



Hopefully it can get triaged a little before being dumped onto the WMF image scaler component....
Comment 2 Shiju Alex 2013-11-20 06:16:48 UTC
Able to reproduce issue with another PDF downloaded from Archive.org https://ml.wikisource.org/w/index.php?title=Page:Dharmaraja_1913.pdf/11&action=edit&redlink=1  Even though, in this case, we can just able to read content (with some difficulty), it is not good enough for Wikisource digitization efforts.
Comment 3 Nemo 2013-11-20 08:42:54 UTC
(In reply to comment #0)
> 1. Inside Commons itself (for example,
> https://commons.wikimedia.org/wiki/File:Pazhancholmala_Gundert_1845.pdf) you
> can see that you cannot view the pages from this file in higher resolution. 

How is this unexpected? The PDF has low resolution (and it's only 2 MB), it's correctly displayed.

$ pdfinfo Gundert_Pazhancholmala_1845.pdf 
Title:          Pazhancholmala by Hermann Gundert 1845
Keywords:       http://archive.org/details/pazhancholmala_gundert_1845
Author:         Hermann Gundert
Creator:        Digitized by the Internet Archive
Producer:       Recoded by LuraDocument PDF v2.53
CreationDate:   Mon Sep 16 16:22:18 2013
ModDate:        Mon Sep 16 16:23:29 2013
Tagged:         no
Form:           none
Pages:          147
Encrypted:      no
Page size:      91 x 148 pts
Page rot:       0
File size:      2363482 bytes
Optimized:      yes
PDF version:    1.5

https://catalogd.archive.org/log/177773313 tells me:
Source Gundert_Pazhancholmala_1845_images.zip : "Generic Raw Book Zip"
[...]
INFO: Global image dpi: 600

It's possible that the resolution was guessed incorrectly (unless the pages of this book are very small, 147 pages at 600 dpi can't be 35 MB only): please edit the metadata to add the correct one at which the images were produced, see fixed-ppi instructions at <https://en.wikisource.org/wiki/Help:DjVu_files#The_Internet_Archive>

> 
> 2. When we create Index file in Wikisource (for example,
> https://ml.wikisource.org/wiki/Index:Pazhancholmala_Gundert_1845.pdf) and try
> to work on a page (for example,
> https://ml.wikisource.org/w/index.php?title=Page:Pazhancholmala_Gundert_1845.
> pdf/7&action=edit&redlink=1)
> you can see that nothing much can be seen on the scanned page. 

What is that you don't see there? The text isn't loaded but this is expected because as you know very well there is no OCR. I also see the image from the PDF correctly, in my case https://upload.wikimedia.org/wikipedia/commons/thumb/4/4d/Dharmaraja_1913.pdf/page11-500px-Dharmaraja_1913.pdf.jpg which according to wget -S is Last-Modified: Wed, 20 Nov 2013 03:16:52 GMT so may have been created when someone else clicked the link on comment 0. Do you still not see an image there? If you don't, is it consistent on all pages?
Comment 4 Andre Klapper 2014-01-09 12:27:30 UTC
Shiju Alex: Can you please answer Nemo's questions in comment 3?:

> What is that you don't see there? Do you
> still not see an image there? If you don't, is it consistent on all pages?


Looking for actionable items, I currently only see this:

(In reply to comment #3 by Nemo)
> It's possible that the resolution was guessed incorrectly
Comment 5 Bawolff (Brian Wolff) 2014-01-09 12:57:39 UTC
(In reply to comment #0)
>
> 2. When we create Index file in Wikisource (for example,
> https://ml.wikisource.org/wiki/Index:Pazhancholmala_Gundert_1845.pdf) and try
> to work on a page (for example,
> https://ml.wikisource.org/w/index.php?title=Page:Pazhancholmala_Gundert_1845.
> pdf/7&action=edit&redlink=1)
> you can see that nothing much can be seen on the scanned page. 
> 
> 
>

Are you sure that the pff has an ocr layer (you can test by opening up in a pdf viewer and seeing if you can select/copy text in the document)? Pdfhandler seems to think all the pages are blank - https://commons.wikimedia.org/w/api.php?action=query&prop=imageinfo&iiprop=metadata&titles=File:Pazhancholmala_Gundert_1845.pdf (scroll down to the text property)
Comment 6 Bawolff (Brian Wolff) 2014-01-13 04:03:43 UTC
Closing worksforme.

I downloaded the file, and looked at it with various tools:
*The text layer appears to be empty, It has no OCR data, hence proofread page cannot retrieve the text of the document. (Proofread page doesn't do OCR, it only extracts what is embedded in the document)
*The file does have a low resolution. Other PDF tools also display it very small.

(If you think there's still a bug here, please re-open)

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links