Last modified: 2014-10-16 19:29:17 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T59807, the corresponding Phabricator task for complete and up-to-date bug report information.

Bug 57807 - Merge proofread text back into Djvu files


Summary:	Merge proofread text back into Djvu files

Status:	UNCONFIRMED

Product:	MediaWiki extensions
Classification:	Unclassified
Component:	ProofreadPage (Other open bugs)
Version:	unspecified
Hardware:	All All

Importance:	Low enhancement (vote)
Target Milestone:	---
Assigned To:	Nobody - You can work on this!

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:	Wikisource
	Show dependency tree / graph

Reported:	2013-12-01 16:09 UTC by vladjohn2013
Modified:	2014-10-16 19:29 UTC (History)
CC List:	13 users (show)

See Also:
Web browser:	---
Mobile Platform:	---
Assignee Huggle Beta Tester:	---

Attachments
simple DjVu test file (61.06 KB, image/vnd.djvu) 2014-09-21 04:30 UTC, George Orwell III	Details
resulting XML file (6.19 KB, text/xml) 2014-09-21 04:34 UTC, George Orwell III	Details
Add an attachment (proposed patch, testcase, etc.)

Description vladjohn2013 2013-12-01 16:09:08 UTC

Merge proofread text back into Djvu files

Wikisource, the free library, has an enormous collection of Djvu files and proofread texts based on those scans. However, while the DjVu files contain a text layer, this text is the original computer generated (OCR) text and not the volunteer-proofread text. There is some previous work about merging the proofread text as a blob into pages, and also about finding similar words to be used as anchors for text re-mapping. The idea is to create an export tool that will get word positions and confidence levels using Tesseract and then re-map the text layer back into the DjVu file. If possible, word coordinates should be kept.

    Project proposed by Micru. I have found an external mentor that could give a hand on Tesseract, now I'm looking for a mentor that would provide assistance on Mediawiki.
    Aubrey can be a mentor providing assistance regarding Wikisource, and some past history of this issue. Not much, but glad to help if needed.
    Rtdwivedi is willing to be a mentor.


URL:https://www.mediawiki.org/wiki/Mentorship_programs/Possible_projects#Merge_proofread_text_back_into_Djvu_files

Comment 1 vladjohn2013 2013-12-01 16:09:26 UTC

This proposal has been listed at https://www.mediawiki.org/wiki/Mentorship_programs/Possible_projects and we are filing a report to gather community feedback and share updates.

Comment 2 Quim Gil 2013-12-03 18:49:59 UTC

CCing Micru and Aarti, who are proposing this project. I'm not sure who is Aubrey:

"Aubrey can be a mentor providing assistance regarding Wikisource"

Comment 3 George Orwell III 2014-09-21 04:30:46 UTC

Created attachment 16531 [details]
simple DjVu test file

A simple .DjVu test file where the embedded text-layer is LINE based instead of WORD based. Provided for XML file generation illustration (testme.xml)

Comment 4 George Orwell III 2014-09-21 04:34:00 UTC

Created attachment 16532 [details]
resulting XML file

Command line used to generate XML file

C:\Program Files (x86)\DjVuLibre>djvutoxml.exe testme.djvu testme.xml

Comment 5 George Orwell III 2014-09-21 05:16:11 UTC

(In reply to vladjohn2013 from comment #0)
> Merge proofread text back into Djvu files
> 
> . . . The idea is to create an
> export tool that will get word positions and confidence levels using
> Tesseract and then re-map the text layer back into the DjVu file. If
> possible, word coordinates should be kept.

Isn't some of that already possible using DjVuLibre's built in DjVu-to-XML scheme? (See attachments)

As far as I can tell, this method was once feasible & pursued then "abandoned" some 7+ years ago for the current 'plain-text' dump approach we have now due to some resource(?) issues at the time. Most of the related bits seem (to me) to still be in place if you go by what is found in   https://git.wikimedia.org/tree/mediawiki%2Fcore  

/includes/media/DjVu.php  and;
/includes/media/DjVuImage.php

It seems (again, to me) the first step on the path to making the proposal a reality is to see if its still possible to actually generate an XML from a DjVu file using the current state of mediawiki et. al as it stands today. I know this is possible on a vanilla x86 local install of the DjVuLibre software package (refer to the attachments again)... but all that online server, Linux, Debian, Ubuntobama stuff is beyond me - and something along those lines is what is in play here.

So:  Can anyone successfully generate the DjVuLibre defined XML derivative from a .DjVu file using just the available mediawiki regime/scheme in place?

Wikimedia Bugzilla is closed!

Search

Personal tools

Navigation

Links