Last modified: 2014-01-05 04:30:41 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T8422, the corresponding Phabricator task for complete and up-to-date bug report information.

Bug 6422 - Extract embedded text from PDF documents for search


Summary:	Extract embedded text from PDF documents for search

Status:	RESOLVED FIXED

Product:	MediaWiki
Classification:	Unclassified
Component:	Search (Other open bugs)
Version:	1.7.x
Hardware:	All All

Importance:	Low enhancement with 4 votes (vote)
Target Milestone:	---
Assigned To:	Nobody - You can work on this!

URL:
Whiteboard:
Keywords:

Depends on:	21061
Blocks:	41037 6421
	Show dependency tree / graph

Reported:	2006-06-24 08:50 UTC by Brion Vibber
Modified:	2014-01-05 04:30 UTC (History)
CC List:	7 users (show)

See Also:
Web browser:	---
Mobile Platform:	---
Assignee Huggle Beta Tester:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Description Brion Vibber 2006-06-24 08:50:53 UTC

PDF files may contain a machine-readable form of the text contained 
in the represented document. It could be useful to extract this 
text and include it in the search index for the file's description 
page.

I'm pretty sure there are open-source tools for extracting text 
data from PDFs out and about, but haven't looked into it.

Comment 1 Brion Vibber 2011-12-05 19:09:43 UTC

PdfHandler extension does text extraction using 'pdftotext' utility if $wgPdftoText is on.

Currently this is stored into the metadata blob and isn't available for search, but may be used by Extension:ProofreadPage.

Comment 2 dchandler 2011-12-24 16:54:46 UTC

@Brian: Thanks so much for posting this. I have desperately been trying to add the capability of searching within pdfs. I'm definitely a non-expert though and can generally only install extensions or make modifications that are well-documented.

Have you already implemented this on a wiki or know anyone who has? I've seen it suggested that FileIndexer (http://www.mediawiki.org/wiki/Extension_talk:FileIndexer) may be another approach. Do you have any advice for which approach is easier to implement for a non-expert? Do you think that the Extension:Proofreadpage method might be easier or more stable than using the other extension?

Do you know of any step-by-step guides to doing this with pdftotext and Proofread page?

Thanks so much in advance for any suggestion or guidance you have.

Comment 3 DrTrigon 2013-12-29 22:06:44 UTC

As mentioned in bug 6421 (comment #3) - DrTrigonBot could do text extraction and store it into a dedicated wiki page in order to be accessible by search. But since PdfHandler does text extraction as well this should not be needed.

As I see we have everything needed:
1.) text extraction (PdfHandler or DrTrigonBot)
2.) indexing for search (see bug 6421)
...so as I understand we should be able to finish this and close the ticket/bug, or am I wrong? Could somebody comment on this?

Thanks and Greetings

Comment 4 Chad H. 2014-01-05 04:30:41 UTC

I don't think there's anything left here to do, we index PDF/DJVU data in the new search.

Wikimedia Bugzilla is closed!

Search

Personal tools

Navigation

Links