Last modified: 2014-02-21 00:41:09 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T33775, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 31775 - Improved version of PdfHandler
Improved version of PdfHandler
Status: NEW
Product: MediaWiki extensions
Classification: Unclassified
PdfHandler (Other open bugs)
unspecified
All All
: Low enhancement (vote)
: ---
Assigned To: Nobody - You can work on this!
:
Depends on:
Blocks: Wikisource
  Show dependency treegraph
 
Reported: 2011-10-17 13:31 UTC by Vitaliy Filippov
Modified: 2014-02-21 00:41 UTC (History)
3 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Vitaliy Filippov 2011-10-17 13:31:08 UTC
I want to suggest some improvements for PdfHandler:
* Respect page rotation (fix for Bug 22194 - pdfinfo doesn't respect it) - it's very simple to get page information from PHP code, so I've done it.
* Direct rendering without ImageMagick rescaling step, also fixes possible memory/tempfile/performance issues on high DPIs.
* Ghostscript options for better rendering (-dUseCropBox -dTextAlphaBits=4 -dGraphicsAlphaBits=4)
* Thumbnails now link to correct individual pages of PDF.

I have commit access to extensions now, am I allowed to create my own branch for review? Or where should I commit my work for review? (The most dubious of all this is probably the multiple page thumbnail feature)
Comment 1 Vitaliy Filippov 2011-10-17 13:35:18 UTC
Forgotten another one:
* Do not escape $wgPdfProcessor to allow specifying 'nice -n 20 gs' for lower ghostscript process priority, for example
Comment 2 Platonides 2011-10-17 15:27:59 UTC
Yes, you can create your own branch. Or even make the changes in trunk if they are straightforward enough (I recommend you separate the changes on its own revision, don't perform a big commit fixing a dozen different things).

$wgPdfProcessor needs to be escaped. What if I have gs at C:\Program files\Ghostview\gs.exe ?
Comment 3 Vitaliy Filippov 2011-10-18 14:01:54 UTC
I think that you can specify it already escaped in the configuration ?) I.e. it's still possible to use it.
The problem is that ghostscript can create relatively big load on the server, it's very useful to run it niced, but escaping forbids it. :(
On the other hand, one could use shell script with "nice -n 20 gs $*" instead of gs itself...
...
Comment 4 Vitaliy Filippov 2011-10-21 15:59:12 UTC
I discovered that it was too naive for me to think I can easily implement page size getting which works correct on all PDFs, so I returned to using pdfinfo, but had to install a patched version on all our servers.

It's a pain to separate the changes after they are already done :( So I've committed a single r100426 to the branch... Please give me some review? :)
Comment 5 Brion Vibber 2011-10-21 23:26:23 UTC
Made a few comments on the rev -- I'd recommend breaking this into several separate bugs and treating them separately.

The page rotation looks like it requires patching third-party programs -- this would seem to mean that it won't work with a default poppler-utils install, and would make deploying it a lot harder.

There are a bunch of introductions of register_globals vulnerabilities -- do not want. :)

I'm a bit unsure what this multiple-thumbnail thing is for; it's not even mentioned in your descriptions. Not sure if this is a desirable thing, and if it is I'm not sure how much difficulty it would cause with compatibility. Thumbnails are rendered via a 404 handler on major installs like Wikimedia, and we need consistent naming that's understood on the back and front end; this may not work with that.

What's the benefits of the direct rendering from ImageMagick? Are there any reasons that wasn't done in the first place, such as file size, memory usage, etc?

What does "Thumbnails now link to correct individual pages of PDF." actually mean?
Comment 6 Vitaliy Filippov 2011-10-27 21:22:55 UTC
(In reply to comment #5)
> The page rotation looks like it requires patching third-party programs -- this
> would seem to mean that it won't work with a default poppler-utils install, and
> would make deploying it a lot harder.

I've sent a bug to poppler's bugtracker: https://bugs.freedesktop.org/show_bug.cgi?id=41867
Hope the patch will be included some day. :)

> There are a bunch of introductions of register_globals vulnerabilities -- do
> not want. :)

Oops, thank you very much! I've understood at last, why so much extensions disallow setting parameters before inclusion... =) although register_globals is off on every "self-respecting" web server :)

> I'm a bit unsure what this multiple-thumbnail thing is for; it's not even
> mentioned in your descriptions. Not sure if this is a desirable thing, and if
> it is I'm not sure how much difficulty it would cause with compatibility.
> Thumbnails are rendered via a 404 handler on major installs like Wikimedia, and
> we need consistent naming that's understood on the back and front end; this may
> not work with that.

For example, insert all slides of a presentation to a wiki page...
I agree it's a sort of specific feature.

> What's the benefits of the direct rendering from ImageMagick? Are there any
> reasons that wasn't done in the first place, such as file size, memory usage,
> etc?

Increased performance, reduced memory usage and no temp file problems.
What about temp files: we've had an issue when handling some PDFs (I can find an example if you want) - gs was creating several ~1Gb files on /tmp and then crashing. So /tmp became full some day :)
And all gs-based pdf viewers suffer - opening such PDFs takes ~4 min on modern  hardware :) at the same time, Adobe Reader displays them normally.
The problem exists mostly when these PDFs are rendered with high DPI setting. So when using imagemagick, you must lower this setting to solve the problem, but this also affects the quality of other images.

> What does "Thumbnails now link to correct individual pages of PDF." actually
> mean?

This means [[File:Something.pdf|page=2]] will link to /wiki/File:Something.pdf?page=2 - not just /wiki/File:Something.pdf
Comment 7 Vitaliy Filippov 2011-10-27 21:48:07 UTC
Reverted !isset. r101066 :)
There was also r100601...

I have a question about 404 handler:
Is it http://www.mediawiki.org/wiki/Extension:WebStore ?
How does it reduce the load? I mean, when the article which includes an image is requested, the browser is anyway expected to download the thumbnail?

And another one about file names:
Multiple page rendering allows to efficiently generate thumbnails for the whole PDF (rather than calling gs multiple times). But imagehandler checks for the existence of ONE file. What would be the correct way to make it compatible with multiple files?
Comment 8 Vitaliy Filippov 2011-10-28 12:51:08 UTC
(In reply to comment #7)
> I have a question about 404 handler:
Or is it purposed to ease distributing the load between different groups of backends?
Comment 9 Platonides 2011-10-28 21:54:05 UTC
No.
See the comments on http://www.mediawiki.org/wiki/Special:Code/MediaWiki/100535

I think you will find there pointers to pretty much everything on 404 handler.
Comment 10 Vitaliy Filippov 2011-11-12 21:28:11 UTC
In the meantime, libpoppler bug that I've filled (https://bugs.freedesktop.org/show_bug.cgi?id=41867) was fixed by developers in a different way - pdfinfo will now also show page rotation in addition to the page size. The patch is included into trunk and it will be in poppler 0.20. So we should update PdfHandler to work with it :)
Comment 11 Vitaliy Filippov 2011-11-16 11:48:57 UTC
I've added the support for newer poppler in r103313 to PdfHandler trunk, also posted a comment to bug 22194.
Comment 12 Andre Klapper 2013-02-06 20:25:32 UTC
(In reply to comment #5)
> I'd recommend breaking this into several
> separate bugs and treating them separately.


Yes! Only one issue per report, please, otherwise it's easy to lose overview, like I did here.

Vitaliy: What's the status of this request? What's still needed to get this fixed?

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links