Last modified: 2013-10-09 07:24:34 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T37122, the corresponding Phabricator task for complete and up-to-date bug report information.

Bug 35122 - pdftotext should be poppler version not xpdf version on wikisource


Summary:	pdftotext should be poppler version not xpdf version on wikisource

Status:	RESOLVED FIXED

Product:	Wikimedia
Classification:	Unclassified
Component:	Extension setup (Other open bugs)
Version:	unspecified
Hardware:	All All

Importance:	Normal normal with 1 vote (vote)
Target Milestone:	---
Assigned To:	Nobody - You can work on this!

URL:
Whiteboard:
Keywords:	ops

Duplicates:	32064 34540 (view as bug list)
Depends on:
Blocks:	Wikisource 41037
	Show dependency tree / graph

Reported:	2012-03-10 11:07 UTC by Lars Aronsson
Modified:	2013-10-09 07:24 UTC (History)
CC List:	12 users (show)

See Also:
Web browser:	---
Mobile Platform:	---
Assignee Huggle Beta Tester:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Description Lars Aronsson 2012-03-10 11:07:52 UTC

[[Commons:File:Иннокентий Анненский - Царь Иксион, 1902.pdf]]
or
http://commons.wikimedia.org/wiki/File:%D0%98%D0%BD%D0%BD%D0%BE%D0%BA%D0%B5%D0%BD%D1%82%D0%B8%D0%B9_%D0%90%D0%BD%D0%BD%D0%B5%D0%BD%D1%81%D0%BA%D0%B8%D0%B9_-_%D0%A6%D0%B0%D1%80%D1%8C_%D0%98%D0%BA%D1%81%D0%B8%D0%BE%D0%BD,_1902.pdf
in the new version uploaded March 10, 2012,
is a PDF/A file with page images and OCR text layer, generated
from ABBYY Finereader OCR software.

The program pdftotext extracts the OCR text layer, which for the
first page begins: "Дннѳнскій.\n\nТ Р А Г Е Д І Я\nВЪ пяти ДѢЙСТВІЯХЪ\n".
(This text contains a few OCR errors, such as the initial "Д", which
is a misinterpreted "А", but this is entirely normal.)

The pdftotext output, piped through "od -c" begins:
 0000000 320 224 320 275 320 275 321 263 320 275 321 201 320 272 321 226
 0000020 320 271   .  \n  \n 320 242     320 240     320 220     320 223
 0000040     320 225     320 224     320 206     320 257  \n 320 222 320
 0000060 252     320 277 321 217 321 202 320 270     320 224 321 242 320

However, when the ProofreadPage extension tries to extract the text,
using the PdfHandler, the text passes through UtfNormal::cleanUp()
(line 140 of source file extensions/PdfHandler/PdfHandler.image.php),
and only the period, newline, some hyphens and digits come through.
Try this at the Russian Wikisource, by clicking the red-linked page numbers,
http://ru.wikisource.org/wiki/%D0%98%D0%BD%D0%B4%D0%B5%D0%BA%D1%81:%D0%98%D0%BD%D0%BD%D0%BE%D0%BA%D0%B5%D0%BD%D1%82%D0%B8%D0%B9_%D0%90%D0%BD%D0%BD%D0%B5%D0%BD%D1%81%D0%BA%D0%B8%D0%B9_-_%D0%A6%D0%B0%D1%80%D1%8C_%D0%98%D0%BA%D1%81%D0%B8%D0%BE%D0%BD,_1902.pdf

Pages are correctly split on \f (form feed).

Comment 1 Lars Aronsson 2012-03-10 11:26:20 UTC

I should add that I run Ubuntu Linux 11.10, where pdftotext -? says:
 pdftotext version 0.16.7
 Copyright 2005-2011 The Poppler Developers - http://poppler.freedesktop.org
 Copyright 1996-2004 Glyph & Cog, LLC

The above version successfully extracts the text.

A different version, which fails to extract letters, is included in xpdf 3.02, which says:
 pdftotext version 3.02
 Copyright 1996-2007 Glyph & Cog, LLC

Comment 2 Marcin Cieślak 2012-03-10 11:33:53 UTC

pdftotext version 3.02

from xpdf-3.02 package produces a nice garbage mainly with spaces, dots and other ASCII punctuation.

Comment 3 Lars Aronsson 2012-03-10 13:27:51 UTC

Update: The "red-linked page numbers" mentioned above are not red links anymore.

OCR text extracted by Proofread Page, on WMF servers,
http://ru.wikisource.org/w/index.php?title=%D0%A1%D1%82%D1%80%D0%B0%D0%BD%D0%B8%D1%86%D0%B0:%D0%98%D0%BD%D0%BD%D0%BE%D0%BA%D0%B5%D0%BD%D1%82%D0%B8%D0%B9_%D0%90%D0%BD%D0%BD%D0%B5%D0%BD%D1%81%D0%BA%D0%B8%D0%B9_-_%D0%A6%D0%B0%D1%80%D1%8C_%D0%98%D0%BA%D1%81%D0%B8%D0%BE%D0%BD,_1902.pdf/1&oldid=704348

OCR text extracted correctly (on my local computer), and uploaded by bot,
http://ru.wikisource.org/w/index.php?title=%D0%A1%D1%82%D1%80%D0%B0%D0%BD%D0%B8%D1%86%D0%B0:%D0%98%D0%BD%D0%BD%D0%BE%D0%BA%D0%B5%D0%BD%D1%82%D0%B8%D0%B9_%D0%90%D0%BD%D0%BD%D0%B5%D0%BD%D1%81%D0%BA%D0%B8%D0%B9_-_%D0%A6%D0%B0%D1%80%D1%8C_%D0%98%D0%BA%D1%81%D0%B8%D0%BE%D0%BD,_1902.pdf/1&oldid=704352

Comment 4 Mark A. Hershberger 2012-03-12 11:32:30 UTC

I've updated the summary to make it clearer what is needed.  Let me know if I have that right and I'll open an RT ticket.

Comment 5 Marcin Cieślak 2012-03-12 13:17:40 UTC

Yes, it's fine. The xpdf version thing is just our theory. We have no idea which version of pdftotext is running really.

Comment 6 Sam Reed (reedy) 2012-03-12 14:26:27 UTC

(In reply to comment #5)
> Yes, it's fine. The xpdf version thing is just our theory. We have no idea
> which version of pdftotext is running really.

reedy@fenari:~$ pdftotext -v
pdftotext version 3.02
Copyright 1996-2007 Glyph & Cog, LLC

Comment 7 Mark A. Hershberger 2012-03-14 18:09:16 UTC

Interesting that the latest ubuntu doesn't have pdftotext from xpdf,
lucid has it in xpdf-utils

Comment 8 Marcin Cieślak 2012-03-14 19:36:32 UTC

New version of pdftotext is available from poppler-utils, although version numbers are low (now at 0.18.4):

http://packages.ubuntu.com/precise/poppler-utils

http://poppler.freedesktop.org/

Usually you have to get rid of xpdf to use poppler.

Comment 9 Mark A. Hershberger 2012-03-14 20:06:40 UTC

mah@lucid:~$ xpdf -v    
xpdf version 3.02
Copyright 1996-2007 Glyph & Cog, LLC
mah@lucid:~$ pdftotext -v
pdftotext version 0.12.4
Copyright 2005-2009 The Poppler Developers - http://poppler.freedesktop.org
Copyright 1996-2004 Glyph & Cog, LLC
mah@lucid:~$ dpkg -l xpdf xpdf-reader poppler-utils xpdf-utils
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Cfg-files/Unpacked/Failed-cfg/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name                           Version                        
+++-==============================-==============================-
ii  poppler-utils                  0.12.4-0ubuntu5.2              
ii  xpdf                           3.02-2ubuntu1.1                
ii  xpdf-reader                    3.02-2ubuntu1.1                
un  xpdf-utils                     <none>                         

https://rt.wikimedia.org/Ticket/Display.html?id=2631

Comment 10 Beau 2012-05-05 08:38:30 UTC

*** Bug 34540 has been marked as a duplicate of this bug. ***

Comment 11 Beau 2012-05-07 21:22:55 UTC

*** Bug 32064 has been marked as a duplicate of this bug. ***

Comment 12 Andre Klapper 2013-07-25 18:13:19 UTC

This was fixed when MediaWiki boxes were upgraded to Ubuntu Precise (which happened a few months ago). Faidon checked that on a Precise box poppler-utils is indeed installed instead of xpdf-utils.

Closing as FIXED.

Comment 13 Pikne 2013-10-06 08:19:45 UTC

Examples in bug bug 34540 and bug 32064 still show foreign characters as �. Any chance that the fix isn't deployed yet? Or these other bugs are not duplicates really?

Comment 14 Andre Klapper 2013-10-06 21:05:29 UTC

(In reply to comment #13)
> Examples in bug bug 34540 and bug 32064 still show foreign characters as �.
> Any chance that the fix isn't deployed yet? Or these other bugs are not
> duplicates really?

I don't know the implementation details of this functionality, but I'd be surprised if the text extraction wasn't cached. Hence if the text was extracted before this bug report was fixed, the text should still be wrong.
And now somebody please correct me if I'm wrong.

Comment 15 Nemo 2013-10-07 08:21:09 UTC

Yeah, action=purge on the file seems to have fixed it. Pikne, do you confirm?
As for what's a duplicate and what not, we can assume that poppler-utils has and/or will have bugs that xpdf doesn't, so the only way to know is to run it locally on your computer for the files you have problems with, to find out where the problem lies.

Comment 16 Pikne 2013-10-09 07:24:34 UTC

Yes, looks fine now. I didn't realize that sort of things could be cached too.

Wikimedia Bugzilla is closed!

Search

Personal tools

Navigation

Links