Last modified: 2014-10-16 19:29:17 UTC
Merge proofread text back into Djvu files Wikisource, the free library, has an enormous collection of Djvu files and proofread texts based on those scans. However, while the DjVu files contain a text layer, this text is the original computer generated (OCR) text and not the volunteer-proofread text. There is some previous work about merging the proofread text as a blob into pages, and also about finding similar words to be used as anchors for text re-mapping. The idea is to create an export tool that will get word positions and confidence levels using Tesseract and then re-map the text layer back into the DjVu file. If possible, word coordinates should be kept. Project proposed by Micru. I have found an external mentor that could give a hand on Tesseract, now I'm looking for a mentor that would provide assistance on Mediawiki. Aubrey can be a mentor providing assistance regarding Wikisource, and some past history of this issue. Not much, but glad to help if needed. Rtdwivedi is willing to be a mentor. URL:https://www.mediawiki.org/wiki/Mentorship_programs/Possible_projects#Merge_proofread_text_back_into_Djvu_files
This proposal has been listed at https://www.mediawiki.org/wiki/Mentorship_programs/Possible_projects and we are filing a report to gather community feedback and share updates.
CCing Micru and Aarti, who are proposing this project. I'm not sure who is Aubrey: "Aubrey can be a mentor providing assistance regarding Wikisource"
Created attachment 16531 [details] simple DjVu test file A simple .DjVu test file where the embedded text-layer is LINE based instead of WORD based. Provided for XML file generation illustration (testme.xml)
Created attachment 16532 [details] resulting XML file Command line used to generate XML file C:\Program Files (x86)\DjVuLibre>djvutoxml.exe testme.djvu testme.xml
(In reply to vladjohn2013 from comment #0) > Merge proofread text back into Djvu files > > . . . The idea is to create an > export tool that will get word positions and confidence levels using > Tesseract and then re-map the text layer back into the DjVu file. If > possible, word coordinates should be kept. Isn't some of that already possible using DjVuLibre's built in DjVu-to-XML scheme? (See attachments) As far as I can tell, this method was once feasible & pursued then "abandoned" some 7+ years ago for the current 'plain-text' dump approach we have now due to some resource(?) issues at the time. Most of the related bits seem (to me) to still be in place if you go by what is found in https://git.wikimedia.org/tree/mediawiki%2Fcore /includes/media/DjVu.php and; /includes/media/DjVuImage.php It seems (again, to me) the first step on the path to making the proposal a reality is to see if its still possible to actually generate an XML from a DjVu file using the current state of mediawiki et. al as it stands today. I know this is possible on a vanilla x86 local install of the DjVuLibre software package (refer to the attachments again)... but all that online server, Linux, Debian, Ubuntobama stuff is beyond me - and something along those lines is what is in play here. So: Can anyone successfully generate the DjVuLibre defined XML derivative from a .DjVu file using just the available mediawiki regime/scheme in place?