Last modified: 2013-12-29 22:10:41 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T8421, the corresponding Phabricator task for complete and up-to-date bug report information.

Bug 6421 - Extract embedded text from DjVu and PDF documents for search


Summary:	Extract embedded text from DjVu and PDF documents for search

Status:	RESOLVED FIXED

Product:	MediaWiki
Classification:	Unclassified
Component:	DjVu (Other open bugs)
Version:	unspecified
Hardware:	All All

Importance:	Low enhancement with 2 votes (vote)
Target Milestone:	---
Assigned To:	Nobody - You can work on this!

URL:	http://djvulibre.djvuzone.org/
Whiteboard:
Keywords:

Depends on:	21061 6422
Blocks:	Wikisource 41037
	Show dependency tree / graph

Reported:	2006-06-24 08:49 UTC by Brion Vibber
Modified:	2013-12-29 22:10 UTC (History)
CC List:	9 users (show)

See Also:	13370
Web browser:	---
Mobile Platform:	---
Assignee Huggle Beta Tester:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Description Brion Vibber 2006-06-24 08:49:40 UTC

DjVu image files may contain a machine-readable form of the text 
contained in the represented document. It could be useful to 
extract this text and include it in the search index for the file's 
description page.

It's probably possible to extract text using the DjVuLibre library 
or tools.

Comment 1 sakthi 2012-03-10 20:06:57 UTC

Do you mean that the file's description page should be on the search results of  the 'extracted text' of the DjVu file ?

Comment 2 Doug 2012-04-12 22:16:27 UTC

I'm pretty sure he means that search results should include hits from within the text layer of the file scans proper.  So if the scans include text "Foobar" a search for "bar" would return the text from the scan's text layer as a result even though the text layer had not been extracted and placed on a wiki page.  There's no reason that this should be restricted to DjVu's (though there was at the time the bug was filed)

Comment 3 DrTrigon 2012-09-04 15:32:08 UTC

As I can see the bug here is quite old and additionally marked "low" in prority. Is this bug up to be fixed at all? In my opinion to solve this bug here is a *must have*.

DrTrigonBot [1] does file content based categorization in commons. Due to this embedded text from PDF (later DJVU too) is extracted and processed. We are currently debating [2] about whether to store this text data to a page - in order to enable the mediawiki search engine to index and find those contents - or not.

Now the question is: When is this bug scheduled to become fixed? Will it be fixed at all? IF NOT; As mentioned DrTrigonBot could dump the files text content to a dedicated page in order to enable the mediawiki search engine to handle them. This should be considered as a work-a-round only and would not be needed at all,
if and when this bug here is solved.

[1] http://commons.wikimedia.org/wiki/User:DrTrigonBot
[2] http://commons.wikimedia.org/wiki/User_talk:DrTrigonBot/JavaScript#PDF_content_extraction

Comment 4 Nemo 2012-09-04 15:53:54 UTC

I doubt this will be solved any time soon, there's nobody working on this or related issues and search is a monster nobody really wants to touch AFAIK, so I'd suggest you to implement whatever workaround you think it's worth about 5 more years of usage.
The pages you create should probably be as hidden as possible to users, in particular they shouldn't be indexed by external search engines or they would e.g. "compete" with Wikisource (or even archive.org) which doesn't make any sense.

Comment 5 DrTrigon 2012-09-10 00:08:25 UTC

Is it possible to include metadata into the search (indexing) in mediawiki software? As I was informed the text layer gets extracted by Pdf Handler [1] and is stored in the images (PDFs) metadata [2] (name="0").

[1] https://www.mediawiki.org/wiki/Extension:PdfHandler
[2] http://commons.wikimedia.org/w/api.php?action=query&iilimit=500&iiprop=metadata|timestamp&prop=imageinfo&titles=File:Resume-.pdf

Comment 6 Gerrit Notification Bot 2013-12-14 03:54:36 UTC

Change 101252 had a related patch set uploaded by Brian Wolff:
Begin indexing file text from pdf/djvu files

https://gerrit.wikimedia.org/r/101252

Comment 7 Nemo 2013-12-14 07:14:21 UTC

(In reply to comment #6)
> Change 101252 had a related patch set uploaded by Brian Wolff:
> Begin indexing file text from pdf/djvu files
> 
> https://gerrit.wikimedia.org/r/101252

Ah, wonderful. That's in CirrusSearch and the core part was already done in https://gerrit.wikimedia.org/r/#/c/99715/ , but there's nothing more specific than the DjVu component so I'm not moving this bug.

Comment 8 Gerrit Notification Bot 2013-12-26 21:41:04 UTC

Change 101252 merged by jenkins-bot:
Index and search file text from pdf/djvu files

https://gerrit.wikimedia.org/r/101252

Comment 9 Nik Everett 2013-12-26 21:44:52 UTC

Merged.  It won't take effect until a full reindex of everything in the file namespace.  That'll take a few days after the deployment.  Results will start showing up when the document is indexed.

Also, the file text results are with .8 of a page text result from a scoring standpoint.

Finally, this'll work with any files from which mediawiki is able to extract text.  If a new file type is plugged in at a later date those files will have to be reindexed for the text to be searchable.

Comment 10 DrTrigon 2013-12-28 10:42:55 UTC

Nice! Good job - thanks!

What about including metadata into the search (indexing) as well??

Comment 11 Chad H. 2013-12-28 19:47:50 UTC

(In reply to comment #10)
> Nice! Good job - thanks!
> 
> What about including metadata into the search (indexing) as well??

Not a bad idea. We have a bug filed for it somewhere?

Comment 12 DrTrigon 2013-12-29 09:44:12 UTC

(In reply to comment #11)
> Not a bad idea. We have a bug filed for it somewhere?

What about bug 21061, may be bug 13370 as well?! Or shall I create a new one?

Comment 13 Chad H. 2013-12-29 19:16:50 UTC

(In reply to comment #12)
> (In reply to comment #11)
> > Not a bad idea. We have a bug filed for it somewhere?
> 
> What about bug 21061, may be bug 13370 as well?! Or shall I create a new one?

Those will do great :)

Comment 14 DrTrigon 2013-12-29 22:08:34 UTC

Good! I linked them.

Am I wrong or should we now be able to close bug 6422 as well?

Wikimedia Bugzilla is closed!

Search

Personal tools

Navigation

Links