Last modified: 2013-10-09 07:24:34 UTC
[[Commons:File:Иннокентий Анненский - Царь Иксион, 1902.pdf]] or http://commons.wikimedia.org/wiki/File:%D0%98%D0%BD%D0%BD%D0%BE%D0%BA%D0%B5%D0%BD%D1%82%D0%B8%D0%B9_%D0%90%D0%BD%D0%BD%D0%B5%D0%BD%D1%81%D0%BA%D0%B8%D0%B9_-_%D0%A6%D0%B0%D1%80%D1%8C_%D0%98%D0%BA%D1%81%D0%B8%D0%BE%D0%BD,_1902.pdf in the new version uploaded March 10, 2012, is a PDF/A file with page images and OCR text layer, generated from ABBYY Finereader OCR software. The program pdftotext extracts the OCR text layer, which for the first page begins: "Дннѳнскій.\n\nТ Р А Г Е Д І Я\nВЪ пяти ДѢЙСТВІЯХЪ\n". (This text contains a few OCR errors, such as the initial "Д", which is a misinterpreted "А", but this is entirely normal.) The pdftotext output, piped through "od -c" begins: 0000000 320 224 320 275 320 275 321 263 320 275 321 201 320 272 321 226 0000020 320 271 . \n \n 320 242 320 240 320 220 320 223 0000040 320 225 320 224 320 206 320 257 \n 320 222 320 0000060 252 320 277 321 217 321 202 320 270 320 224 321 242 320 However, when the ProofreadPage extension tries to extract the text, using the PdfHandler, the text passes through UtfNormal::cleanUp() (line 140 of source file extensions/PdfHandler/PdfHandler.image.php), and only the period, newline, some hyphens and digits come through. Try this at the Russian Wikisource, by clicking the red-linked page numbers, http://ru.wikisource.org/wiki/%D0%98%D0%BD%D0%B4%D0%B5%D0%BA%D1%81:%D0%98%D0%BD%D0%BD%D0%BE%D0%BA%D0%B5%D0%BD%D1%82%D0%B8%D0%B9_%D0%90%D0%BD%D0%BD%D0%B5%D0%BD%D1%81%D0%BA%D0%B8%D0%B9_-_%D0%A6%D0%B0%D1%80%D1%8C_%D0%98%D0%BA%D1%81%D0%B8%D0%BE%D0%BD,_1902.pdf Pages are correctly split on \f (form feed).
I should add that I run Ubuntu Linux 11.10, where pdftotext -? says: pdftotext version 0.16.7 Copyright 2005-2011 The Poppler Developers - http://poppler.freedesktop.org Copyright 1996-2004 Glyph & Cog, LLC The above version successfully extracts the text. A different version, which fails to extract letters, is included in xpdf 3.02, which says: pdftotext version 3.02 Copyright 1996-2007 Glyph & Cog, LLC
pdftotext version 3.02 from xpdf-3.02 package produces a nice garbage mainly with spaces, dots and other ASCII punctuation.
Update: The "red-linked page numbers" mentioned above are not red links anymore. OCR text extracted by Proofread Page, on WMF servers, http://ru.wikisource.org/w/index.php?title=%D0%A1%D1%82%D1%80%D0%B0%D0%BD%D0%B8%D1%86%D0%B0:%D0%98%D0%BD%D0%BD%D0%BE%D0%BA%D0%B5%D0%BD%D1%82%D0%B8%D0%B9_%D0%90%D0%BD%D0%BD%D0%B5%D0%BD%D1%81%D0%BA%D0%B8%D0%B9_-_%D0%A6%D0%B0%D1%80%D1%8C_%D0%98%D0%BA%D1%81%D0%B8%D0%BE%D0%BD,_1902.pdf/1&oldid=704348 OCR text extracted correctly (on my local computer), and uploaded by bot, http://ru.wikisource.org/w/index.php?title=%D0%A1%D1%82%D1%80%D0%B0%D0%BD%D0%B8%D1%86%D0%B0:%D0%98%D0%BD%D0%BD%D0%BE%D0%BA%D0%B5%D0%BD%D1%82%D0%B8%D0%B9_%D0%90%D0%BD%D0%BD%D0%B5%D0%BD%D1%81%D0%BA%D0%B8%D0%B9_-_%D0%A6%D0%B0%D1%80%D1%8C_%D0%98%D0%BA%D1%81%D0%B8%D0%BE%D0%BD,_1902.pdf/1&oldid=704352
I've updated the summary to make it clearer what is needed. Let me know if I have that right and I'll open an RT ticket.
Yes, it's fine. The xpdf version thing is just our theory. We have no idea which version of pdftotext is running really.
(In reply to comment #5) > Yes, it's fine. The xpdf version thing is just our theory. We have no idea > which version of pdftotext is running really. reedy@fenari:~$ pdftotext -v pdftotext version 3.02 Copyright 1996-2007 Glyph & Cog, LLC
Interesting that the latest ubuntu doesn't have pdftotext from xpdf, lucid has it in xpdf-utils
New version of pdftotext is available from poppler-utils, although version numbers are low (now at 0.18.4): http://packages.ubuntu.com/precise/poppler-utils http://poppler.freedesktop.org/ Usually you have to get rid of xpdf to use poppler.
mah@lucid:~$ xpdf -v xpdf version 3.02 Copyright 1996-2007 Glyph & Cog, LLC mah@lucid:~$ pdftotext -v pdftotext version 0.12.4 Copyright 2005-2009 The Poppler Developers - http://poppler.freedesktop.org Copyright 1996-2004 Glyph & Cog, LLC mah@lucid:~$ dpkg -l xpdf xpdf-reader poppler-utils xpdf-utils Desired=Unknown/Install/Remove/Purge/Hold | Status=Not/Inst/Cfg-files/Unpacked/Failed-cfg/Half-inst/trig-aWait/Trig-pend |/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad) ||/ Name Version +++-==============================-==============================- ii poppler-utils 0.12.4-0ubuntu5.2 ii xpdf 3.02-2ubuntu1.1 ii xpdf-reader 3.02-2ubuntu1.1 un xpdf-utils <none> https://rt.wikimedia.org/Ticket/Display.html?id=2631
*** Bug 34540 has been marked as a duplicate of this bug. ***
*** Bug 32064 has been marked as a duplicate of this bug. ***
This was fixed when MediaWiki boxes were upgraded to Ubuntu Precise (which happened a few months ago). Faidon checked that on a Precise box poppler-utils is indeed installed instead of xpdf-utils. Closing as FIXED.
Examples in bug bug 34540 and bug 32064 still show foreign characters as �. Any chance that the fix isn't deployed yet? Or these other bugs are not duplicates really?
(In reply to comment #13) > Examples in bug bug 34540 and bug 32064 still show foreign characters as �. > Any chance that the fix isn't deployed yet? Or these other bugs are not > duplicates really? I don't know the implementation details of this functionality, but I'd be surprised if the text extraction wasn't cached. Hence if the text was extracted before this bug report was fixed, the text should still be wrong. And now somebody please correct me if I'm wrong.
Yeah, action=purge on the file seems to have fixed it. Pikne, do you confirm? As for what's a duplicate and what not, we can assume that poppler-utils has and/or will have bugs that xpdf doesn't, so the only way to know is to run it locally on your computer for the files you have problems with, to find out where the problem lies.
Yes, looks fine now. I didn't realize that sort of things could be cached too.