Last modified: 2014-01-05 04:30:41 UTC

Wikimedia Bugzilla is closed!

Wikimedia has migrated from Bugzilla to Phabricator. Bug reports should be created and updated in Wikimedia Phabricator instead. Please create an account in Phabricator and add your Bugzilla email address to it.
Wikimedia Bugzilla is read-only. If you try to edit or create any bug report in Bugzilla you will be shown an intentional error message.
In order to access the Phabricator task corresponding to a Bugzilla report, just remove "static-" from its URL.
You could still run searches in Bugzilla or access your list of votes but bug reports will obviously not be up-to-date in Bugzilla.
Bug 6422 - Extract embedded text from PDF documents for search
Extract embedded text from PDF documents for search
Status: RESOLVED FIXED
Product: MediaWiki
Classification: Unclassified
Search (Other open bugs)
1.7.x
All All
: Low enhancement with 4 votes (vote)
: ---
Assigned To: Nobody - You can work on this!
:
Depends on: 21061
Blocks: 41037 6421
  Show dependency treegraph
 
Reported: 2006-06-24 08:50 UTC by Brion Vibber
Modified: 2014-01-05 04:30 UTC (History)
7 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Brion Vibber 2006-06-24 08:50:53 UTC
PDF files may contain a machine-readable form of the text contained 
in the represented document. It could be useful to extract this 
text and include it in the search index for the file's description 
page.

I'm pretty sure there are open-source tools for extracting text 
data from PDFs out and about, but haven't looked into it.
Comment 1 Brion Vibber 2011-12-05 19:09:43 UTC
PdfHandler extension does text extraction using 'pdftotext' utility if $wgPdftoText is on.

Currently this is stored into the metadata blob and isn't available for search, but may be used by Extension:ProofreadPage.
Comment 2 dchandler 2011-12-24 16:54:46 UTC
@Brian: Thanks so much for posting this. I have desperately been trying to add the capability of searching within pdfs. I'm definitely a non-expert though and can generally only install extensions or make modifications that are well-documented.

Have you already implemented this on a wiki or know anyone who has? I've seen it suggested that FileIndexer (http://www.mediawiki.org/wiki/Extension_talk:FileIndexer) may be another approach. Do you have any advice for which approach is easier to implement for a non-expert? Do you think that the Extension:Proofreadpage method might be easier or more stable than using the other extension?

Do you know of any step-by-step guides to doing this with pdftotext and Proofread page?

Thanks so much in advance for any suggestion or guidance you have.
Comment 3 DrTrigon 2013-12-29 22:06:44 UTC
As mentioned in bug 6421 (comment #3) - DrTrigonBot could do text extraction and store it into a dedicated wiki page in order to be accessible by search. But since PdfHandler does text extraction as well this should not be needed.

As I see we have everything needed:
1.) text extraction (PdfHandler or DrTrigonBot)
2.) indexing for search (see bug 6421)
...so as I understand we should be able to finish this and close the ticket/bug, or am I wrong? Could somebody comment on this?

Thanks and Greetings
Comment 4 Chad H. 2014-01-05 04:30:41 UTC
I don't think there's anything left here to do, we index PDF/DJVU data in the new search.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links