Last modified: 2014-10-08 16:01:37 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T30206, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 28206 - PDF generation does not support Complex Script Wikis (e.g. Indic languages) and needs to be re-written
PDF generation does not support Complex Script Wikis (e.g. Indic languages) a...
Status: ASSIGNED
Product: MediaWiki extensions
Classification: Unclassified
Collection (Other open bugs)
unspecified
All All
: Normal enhancement with 1 vote (vote)
: ---
Assigned To: C. Scott Ananian
: i18n
: 20403 30508 (view as bug list)
Depends on: 68922
Blocks: 31552 32578 41348 56295
  Show dependency treegraph
 
Reported: 2011-03-23 16:48 UTC by Mark A. Hershberger
Modified: 2014-10-08 16:01 UTC (History)
18 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments
Tamil Wikipedia's page rendered as pdf by ocg latex renderer (54.62 KB, application/pdf)
2014-07-30 04:31 UTC, Rahmanuddin Shaik
Details
Bengali Wikipedia's page rendered as pdf by ocg latex renderer (36.90 KB, application/pdf)
2014-07-30 18:11 UTC, Jayanta Nath
Details
Malayalam page (http:// ml.wikipedia.org/wiki/മലയാളം) rendered as pdf by ocg latex renderer (172.33 KB, application/pdf)
2014-07-31 15:41 UTC, C. Scott Ananian
Details
Tamil Wikipedia's page rendered as pdf by ocg latex renderer (3.85 MB, application/octet-stream)
2014-08-01 02:28 UTC, Srikanth Logic
Details
Bengali Wikipedia's page rendered as pdf by ocg latex renderer (কমন জেন্ডার-দ্য ফিল্ম) (42.20 KB, application/pdf)
2014-08-05 18:52 UTC, Jayanta Nath
Details

Description Mark A. Hershberger 2011-03-23 16:48:12 UTC
While support for complex script wikis (e.g. Indic languages) is lacking, developers I spoke with are trying to add that support.  Please incorporate this support.
Comment 1 Roan Kattouw 2011-03-23 16:50:37 UTC
Apparently there's this new package for building PDF that can handle Indic scripts, and someone built a proof-of-concept tool on top of it that generates PDFs from Wikipedia pages. See http://ultimategerardm.blogspot.com/2011/03/pdf-library-with-potential.html
Comment 2 Bugmeister Bot 2011-08-19 19:12:31 UTC
Unassigning default assignments. http://article.gmane.org/gmane.science.linguistics.wikipedia.technical/54734
Comment 3 Mayur 2011-08-25 05:43:14 UTC
I have also registered a similar bug https://bugzilla.wikimedia.org/show_bug.cgi?id=30508 for hindi scripts.
Comment 4 Mark A. Hershberger 2011-08-25 19:03:53 UTC
Assigning this to Tomasz after triage so that he can bring this up with ErikM and, hopefully, find some resources to take care of this.
Comment 5 Mark A. Hershberger 2011-08-25 19:11:05 UTC
I should add that, during triage, we talked about opening a new bug: "pdf export should support indic languages and needs top be re-written".

I'll just update the title of this one.
Comment 6 Mark A. Hershberger 2011-08-25 21:44:42 UTC
*** Bug 30508 has been marked as a duplicate of this bug. ***
Comment 7 Srikanth Logic 2012-01-21 15:10:12 UTC
*** Bug 20403 has been marked as a duplicate of this bug. ***
Comment 8 Jayanta Nath 2012-01-21 15:26:54 UTC
https://bugzilla.wikimedia.org/show_bug.cgi?id=20403 very very old bug
Comment 9 Arjuna Rao Chavala 2012-06-02 13:41:12 UTC
Pediapress view of the book (Which can only be ordered) is displaying Telugu text properly on all the text pages, except the cover page. But the downloadable PDF still has rendering problem for Telugu.
Comment 10 វ័ណថារិទ្ធ 2013-03-20 13:07:07 UTC
I experienced the same issue with Khmer Unicode font in Khmer Wikipedia too.
Any update on this rending issue for complex scripts?
Thank you!
Comment 11 Tomasz Finc 2013-03-20 17:45:21 UTC
Re-assigning back to default as I am not actively working on this.
Comment 12 Rahmanuddin Shaik 2013-05-25 06:15:40 UTC
Any update on this?
Comment 13 Andre Klapper 2013-05-25 13:20:59 UTC
Currently, nobody is actively working on this - See "Assigned To: Nobody" above.
In case anybody wants to / is able to help codewise, please check http://www.mediawiki.org/wiki/Developer_access
Comment 14 Rahmanuddin Shaik 2013-09-30 14:56:31 UTC
I would be interested in resolving this issue myself. Please someone guide me into it. 
I have checked that http://silpa.org.in/Render renders more properly.
Comment 15 Jayanta Nath 2013-09-30 18:32:44 UTC
Hi, Rahmanuddin, this not render bengali at all, shown garbage.
Comment 16 Andre Klapper 2013-10-03 16:27:20 UTC
(In reply to comment #14)
> I would be interested in resolving this issue myself. Please someone guide me
> into it.

Does the docs/download/install info on https://www.mediawiki.org/wiki/Extension:Collection help? If not you might want to contact developers directly to pinpoint the code area(s).
Comment 17 Siddhartha Ghai 2013-12-06 06:58:05 UTC
I just tried saving a pdf from hindi wikipedia. The pdf apparently contains the text correctly but the rendering of the devanagari combining marks is incorrect. Copy-pasting the text into other text processors renders the text correctly.

As far as I know, the Collection extension[1] uses the PDF Writer extension[2] to create pdfs which uses the mwlib.rl library[3] for the pdf creation. The library uses the GNU freefont project's fonts.

I think there's possibly some problem with the GNU freefont coverage[4] which is causing the rendering issues. I may be wrong in this assessment though. It'd be nice if a developer could confirm the cause.

[1]: https://www.mediawiki.org/wiki/Extension:Collection
[2]: https://www.mediawiki.org/wiki/Extension:PDF_Writer
[3]: https://github.com/pediapress/mwlib.rl
[4]: http://www.gnu.org/software/freefont/coverage.html
Comment 18 Rahmanuddin Shaik 2013-12-06 09:06:22 UTC
I suggest that the server on which the rendering is being done, let it have some free licensed fonts installed for each language, at least the prominent ones.
Comment 19 Andre Klapper 2013-12-06 10:29:25 UTC
Note that there are plans to rework the current code. See https://www.mediawiki.org/wiki/PDF_rendering
Comment 20 Rahmanuddin Shaik 2014-03-27 08:44:50 UTC
Any update on this?
Comment 21 C. Scott Ananian 2014-07-30 02:08:15 UTC
The new OCG renderer handles Indic scripts much better.  It is now enabled on the production wikis, although you need to use the 'Create a book' function in the sidebar to access it.
Comment 22 Rahmanuddin Shaik 2014-07-30 03:47:42 UTC
>>The new OCG renderer handles Indic scripts much better.  It is now enabled on the production wikis, although you need to use the 'Create a book' function in the sidebar to access it.
No It does not render properly! Its buggy and unreadable with ligatures shown wide apart.
Comment 23 C. Scott Ananian 2014-07-30 04:07:24 UTC
Can you provide a specific page, along with details on the specific ligature which is incorrect?  Are you sure you are looking at the output of the "OCG latex renderer"?  (You need to use the "Create a book" function, and specifically select the "e-book (PDF, ocg latex renderer)" format.)
Comment 24 Rahmanuddin Shaik 2014-07-30 04:16:47 UTC
On Telugu Wikipedia, te.wikipedia.org, I added home page itself as book. an hour back I got wrong rendering. Now I get the following error :
Rendering failed
Generation of the document file has failed.

Status: Rendering process died with non zero code: 1
Comment 25 Rahmanuddin Shaik 2014-07-30 04:25:12 UTC
Another note : I checked on Hindi Wikipedia, Its working fine there. 
Jayantha, please confirm about Bangla. I will check rest South Indian languages as well.
Comment 26 Rahmanuddin Shaik 2014-07-30 04:31:36 UTC
Created attachment 16101 [details]
Tamil Wikipedia's page rendered as pdf by ocg latex renderer

For Tamil, bold and heading characters are shown as question marks.
Comment 27 Andre Klapper 2014-07-30 09:41:21 UTC
In general: Please please provide exact and clear steps with URLs to reproduce problems, otherwise we might end up with misunderstandings and trying different things. Thanks :)
Comment 28 Jayanta Nath 2014-07-30 11:14:29 UTC
In Bengali wiki we found same as On Telugu Wikipedia error :
Rendering failed
Generation of the document file has failed.

Status: Rendering process died with non zero code: 1
Comment 29 C. Scott Ananian 2014-07-30 14:56:49 UTC
Please give me exact pages on different wikipedias, that helps me a lot and lets me add a reproducible test case.  If you just say "Tamil's wikipedia" I need to do a  extra work to figure out what the language code of tamil is, and then to try to find a reasonable test page without my being able to read tamil at all.

Note that the renderer currently has an issue with images in the PDF, which we are working on fixing.  "Rendering process died with non zero code: 1" seems to be that image bug.  So if you could find test pages without images on them, that would be helpful.
Comment 30 Andre Klapper 2014-07-30 15:18:50 UTC
BEFORE COMMENTING HERE: Please read comment 27, comment 29, and https://www.mediawiki.org/wiki/How_to_report_a_bug . Thank you!
Comment 31 Jayanta Nath 2014-07-30 18:11:38 UTC
Created attachment 16103 [details]
Bengali Wikipedia's page rendered as pdf by ocg latex renderer
Comment 32 Jayanta Nath 2014-07-30 18:13:00 UTC
As in Bengali complex word not rendered properly.
Comment 33 Rahmanuddin Shaik 2014-07-30 18:25:21 UTC
1. Go to https://te.wikipedia.org with Firefox version 31;
2. Check a page with no images (I chose 1911)
3. Select Book creator to the leftsidebar of the page (పుస్తకం కూర్పరిని అచేతనం చెయ్యి under ముద్రించండి/ఎగుమతి చేయండి head)
4. Start book
5. Add 1911 page to the book.
6. Navigate to the book page, and then give some title and subtitle
7. Select ebook (PDF ocg latex renderer) in the dropdown under దింపుకోండి head
8. Click on Export, 

Expected Result : Book gets rendered and download link to pdf is given

Actual result : "Rendering failed

Generation of the document file has failed.

Status: Rendering process died with non zero code: 1 "
Comment 34 C. Scott Ananian 2014-07-30 19:07:58 UTC
Thanks for the good testcase in comment 33.  We might eventually have to split this up into separate bugs for the Tamil, Bengali, Telegu issues, but we can keep them together for now.

If you'd like to help debug the issues at a lower-level, the new PDF backend is comprised of a "bundler" and a "renderer" portion, which are described at https://www.npmjs.org/package/mw-ocg-bundler and https://www.npmjs.org/package/mw-ocg-latexer and can be run standalone if you're brave.  I reproduced the issue described in comment 33 as follows:

$ mw-ocg-bundler -o tamil.zip -p tewiki 1911
$ mw-ocg-latexer -D -v -o tamil.pdf tamil.zip

which gave me the following error from xelatex:

! Package polyglossia Error: The current roman font does not contain the Telugu script!
(polyglossia)                Please define \telugufont with \newfontfamily.

So it looks like I need to find a good font covering Telegu.  Can you suggest one?
Comment 35 C. Scott Ananian 2014-07-30 19:09:31 UTC
And note that my commands above named the files tamil.zip and tamil.pdf, even though the bug is really about telugu support.  Whoops!
Comment 36 Gerrit Notification Bot 2014-07-30 19:27:06 UTC
Change 150634 had a related patch set uploaded by Cscott:
Use FAKESTYLES for FreeSerif.

https://gerrit.wikimedia.org/r/150634
Comment 37 C. Scott Ananian 2014-07-30 19:27:53 UTC
For Tamil I'm looking at:
https://ta.wikipedia.org/wiki/1911
and it appears that it is using the FreeSerif font for Tamil.  I've fixed the bad boldface issue -- but is there a better font for Tamil we could/should be using?
Comment 38 C. Scott Ananian 2014-07-30 19:29:22 UTC
It looks like Bengali is also using FreeSerif, which probably explains the issues with "complex words".  What font should we be using?
Comment 39 Gerrit Notification Bot 2014-07-30 19:37:03 UTC
Change 150634 merged by jenkins-bot:
Use FAKESTYLES for FreeSerif.

https://gerrit.wikimedia.org/r/150634
Comment 40 Jayanta Nath 2014-07-30 19:49:42 UTC
For Bengali Wikipedia (https://bn.wikipedia.org), you can use 'Lohit Bengali' (https://fedorahosted.org/lohit/ ) or Siyam Rupali ( already both available in ULS).
Comment 41 Andre Klapper 2014-07-30 19:55:58 UTC
The Language Engineering team might also have expertise in recommending fonts with sufficient support for Indic languages, based on their ULS experience. CC'ing Runa.
Comment 42 C. Scott Ananian 2014-07-30 20:49:51 UTC
It's not strictly comparable, since web typography has different constraints (file size, format, etc) which don't apply to the XeLaTeX engine -- and for the OCG servers it's important that the fonts in question are well-packaged for Ubuntu. But ULS is often a good start.
Comment 43 Rahmanuddin Shaik 2014-07-31 01:49:27 UTC
For Telugu, Lohit Telugu of Lohit set of fonts could be used. Also, Vemana can be included. Both fonts are available as ttf-telugu-fonts and ttf-indic-core-fonts packages in Ubuntu/Debian.
Comment 45 C. Scott Ananian 2014-07-31 15:40:09 UTC
wikisource is untested.  it should work, but it's better to use wikipedia test cases for now, so that we're dealing with one bug at a time.

The malayalam test case I used was http://ml.
wikipedia.org/wiki/Special:Redirect/revision/1852257 ; I've attached a copy of the PDF to the bug.  It uses the Rachana font.
Comment 46 C. Scott Ananian 2014-07-31 15:41:40 UTC
Created attachment 16110 [details]
Malayalam page (http:// ml.wikipedia.org/wiki/മലയാളം) rendered as pdf by ocg latex renderer
Comment 47 C. Scott Ananian 2014-07-31 15:48:47 UTC
I spent most of yesterday working on this, but I had trouble finding appropriate fonts for Tamil, Telugu, and Bengali which also had coverage of the latin code points.  When I use lohit, for example, all the latin-script numbers (and list bullets) render as tofu. :(

I have a good idea how this can be worked around; I've filed that as bug 68922.
Comment 48 praveenp 2014-07-31 18:21:28 UTC
Thanks. Malayalam rendering in the attached pdf is good with Rachana font although windows users generally prefers Anjali font (and also I am not able to find the collection extension in ml.wp).
Comment 49 praveenp 2014-07-31 18:22:11 UTC
Thanks. Malayalam rendering in the attached pdf is good with Rachana font although windows users generally prefer Anjali font (and also I am not able to find the collection extension in ml.wp).
Comment 50 Srikanth Logic 2014-08-01 02:28:41 UTC
Created attachment 16113 [details]
Tamil Wikipedia's page rendered as pdf by ocg latex renderer

Tamil now works fine including bold / latin characters (Attached new file), although it would be better, if we could find a better font. I shall check the output with Meera-Tamil. Thanks.
Comment 51 Jayanta Nath 2014-08-01 08:12:54 UTC
For Bengali Wikipedia I found same as complex letter not renders properly , may be due to font issue! Could you please add Lohit Bengali or Siyam Rupali , so I can test it !
Comment 52 Gerrit Notification Bot 2014-08-02 20:27:21 UTC
Change 151360 had a related patch set uploaded by Cscott:
Use Lohit fonts when possible.

https://gerrit.wikimedia.org/r/151360
Comment 53 Gerrit Notification Bot 2014-08-03 08:44:07 UTC
Change 151360 merged by jenkins-bot:
Use Lohit fonts when possible.

https://gerrit.wikimedia.org/r/151360
Comment 54 Jayanta Nath 2014-08-05 18:52:31 UTC
Created attachment 16140 [details]
Bengali Wikipedia's page rendered as pdf by ocg latex renderer (কমন জেন্ডার-দ্য ফিল্ম)

Today I have checked again, but out have same issues like complex letter not rendered properly.Till using Free serif font instead of Lohit Bengali
Comment 55 Nemo 2014-09-25 11:53:12 UTC
(In reply to C. Scott Ananian from comment #38)
> It looks like Bengali is also using FreeSerif, which probably explains the
> issues with "complex words".  What font should we be using?

For FreeSerif see also these bugs
Comment 57 C. Scott Ananian 2014-09-25 15:19:40 UTC
I did a session with Indic Wikipedians at wikimedia.  I believe I have fixed all the bugs in our indic languages, but I have not yet patched the polyglossia package in production.  (The current PDFs don't use the correct font.)  Working on it.

I don't have any expertise with Burmese, however.  If a native reader can volunteer to check rendering, I can probably fix any bugs which exist.
Comment 58 C. Scott Ananian 2014-10-08 16:00:00 UTC
This bug has grown unwieldy, and the new OCG renderer fixes most of these problems.

Please open new bugs for specific issues with specific languages, after confirming that they still exist.

After a week or so, I'll close this bug.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links