Last modified: 2013-06-18 16:27:40 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T19766, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 17766 - Problematic book export to PDF/ODF for bidi documents
Problematic book export to PDF/ODF for bidi documents
Status: RESOLVED DUPLICATE of bug 30326
Product: MediaWiki extensions
Classification: Unclassified
Collection (Other open bugs)
unspecified
PC Windows NT
: Lowest enhancement with 4 votes (vote)
: ---
Assigned To: PediaPress Development Team
: i18n
: 23893 (view as bug list)
Depends on:
Blocks: 28708 24466
  Show dependency treegraph
 
Reported: 2009-03-02 23:22 UTC by Itay P
Modified: 2013-06-18 16:27 UTC (History)
11 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments
PDF and ODF files of bidi document, part of a Hebrew book from the Hebrew wikibooks (167.03 KB, application/octet-stream)
2009-03-02 23:22 UTC, Itay P
Details
RTL & pdf (31.12 KB, image/jpeg)
2009-06-02 11:32 UTC, lɛʁi לערי ריינהארט
Details
Arabic page export from ar.wikibooks.org (189.32 KB, application/pdf)
2011-08-04 14:36 UTC, Brion Vibber
Details
Screenshot of misplaced hebrew nukta vs web rendering (52.00 KB, image/png)
2011-08-04 23:09 UTC, Brion Vibber
Details
Fixed screenshot (65.51 KB, image/png)
2011-08-04 23:10 UTC, Brion Vibber
Details

Description Itay P 2009-03-02 23:22:55 UTC
Created attachment 5879 [details]
PDF and ODF files of bidi document, part of a Hebrew book from the Hebrew wikibooks

When trying to export bidi documents collection (download the collection), there are problems with both PDF and ODF files:
- Text in PDF files is mirrored (ordered from left to right instead of right to left). For example, instead of תכנות מתקדם, it is written םדקתמ תונכת.
- Headers inside the document are displayed as blank squares (probably illegal font was used).
- ODF files are assigned as LTR document instead of RTL.
Comment 1 Brion Vibber 2009-03-02 23:31:48 UTC
Yeah, RTL is not currently working in the PDF export... additionally, character shaping doesn't happen for Arabic script.

The ODF bit might actually be an easier fix, if it mainly comes to marking the document language/direction... though embedded LTR bits might be a problem.
Comment 2 Rochus Schirmer 2009-05-16 10:10:54 UTC
While creating a PDF-version the folling error occures since a few days:

POST-Anfrage fehlgeschlagen
aus Wikipedia, der freien Enzyklopädie
Wechseln zu: Navigation, Suche

Die POST-Anfrage an http://pdf1.wikimedia.org:8080/mw-serve/ ist fehlgeschlagen (Empty reply from server).

Zurück zur Seite Wikipedia:Hauptseite.

Comment 3 Roan Kattouw 2009-05-16 17:25:58 UTC
(In reply to comment #2)
> While creating a PDF-version the folling error occures since a few days:
> 
> POST-Anfrage fehlgeschlagen
> aus Wikipedia, der freien Enzyklopädie
> Wechseln zu: Navigation, Suche
> 
> Die POST-Anfrage an http://pdf1.wikimedia.org:8080/mw-serve/ ist fehlgeschlagen
> (Empty reply from server).
> 
> Zurück zur Seite Wikipedia:Hauptseite.
> 

This has also been filed as bug 18816
Comment 4 lɛʁi לערי ריינהארט 2009-06-02 11:32:52 UTC
Created attachment 6182 [details]
RTL & pdf

sample rendering from
http://test.wikipedia.org/w/index.php?oldid=66359
Comment 5 Siebrand Mazeland 2011-06-19 08:42:55 UTC
*** Bug 23893 has been marked as a duplicate of this bug. ***
Comment 6 Siebrand Mazeland 2011-08-02 12:14:26 UTC
Can someone of PediaPress clarify the root cause of this issue? I'd like to get this resolved after two and a half years preferably sooner than later, and it is not clear to me where the issue is coming from.
Comment 7 Brion Vibber 2011-08-03 07:18:33 UTC
I believe it needs support for bidirectional text and complex scripts in the underlying ReportLab library that does the PDF output.

Some googling indicates there are at least some RTL/bidi patches around, some of which may or may not have gotten merged upstream. Last I saw though using fribidi for Arabic shaping was sufficient only for Arabic language, not for other languages in Arabic script like Farsi and Pashto.
Comment 8 Siebrand Mazeland 2011-08-03 07:47:13 UTC
OK. Since PediaPress does not appear to be anywhere interested in getting this fixed, we should just disable the Collection extension where it does not work because of failing script support. I have checked Arabic and Hebrew Wikipedias and there Collection is not enabled, so that appears to be fine.
Comment 9 Amir Ladsgroup 2011-08-03 07:58:14 UTC
we hid it in fa.wp by javascript
Comment 10 Volker Haas 2011-08-03 08:09:57 UTC
The problem is not that PediaPress is not interested in fixing this. The problem is that the framework we are using to generate PDFs does not support right-to-left languages properly (reportlab). I have been in discussion with the developers and some Israelis (for QA mainly) and we have made slow progress. Now it looks as if all major issues have been ironed out and that the PDF export is ready for hebrew at least. Check the sample the I rendered ~2 days ago [1] - beware that all contents are random since I do not speak or read hebrew. 
So far I have some feedback from hebrew speaking persons who think that the quality is good and PDF export is ready for production use for hebrew. As a matter of fact we have contacted the WMF a week ago and informed them that we think the PDF export is now ready to go live in the hebrew wikipedia and scheduled to activate the Collection extension in the hebrew wikipedia later today.

Please keep in mind that the PDF export is open source software and everybody can contribute. I have spend a substantial amount of time already to improve rtl support, but as someone who does neither speak or ready any of these languages I make progress slowly.

Please contribute!

* http://code.pediapress.com/git/mwlib.rl (rtl_support branch)
* http://code.pediapress.com/git/?p=mwlib.ext/.git (rtl_support branch)

My last tests indicated that for arabic there are still some problems. If anybody is willing to contribute I'd be happy to accept patches.

[1] http://pediapress.com/files/he/sample_1.pdf
Comment 11 Volker Haas 2011-08-03 08:45:29 UTC
This bug report specifically mentioned one article [1]. I fixed the remaining issue (wrong direction in source nodes) with [2]. I updated the render servers. The article now looks correct to me (at least the squiggly lines in my browser closely resemble the ones in the PDF).

I am closing this ticket, open new ones for specific issues for hebrew.


[1]http://he.wikibooks.org/wiki/%D7%AA%D7%9B%D7%A0%D7%95%D7%AA_%D7%9E%D7%AA%D7%A7%D7%93%D7%9D_%D7%91-Java/%D7%96%D7%A8%D7%9E%D7%99%D7%9D
[2] http://code.pediapress.com/git/mwlib.rl?p=mwlib.rl;a=commit;h=bde6f53e9159a8869a8b1a6d529d657fd2ca3d5a
Comment 12 Brion Vibber 2011-08-03 08:52:21 UTC
I still see wrong direction in the source nodes for that page.

Preformatted sections are showing left-aligned, but in a variable-width font instead of a monospace one.

The final template on the page seems to render as a tiny table containing only '-', '"', and '-' respectively in each cell.

Other than that it looks about right; the bidi layout generally seems to match what I see in a modern browser. Hebrew is simpler than Arabic though as it doesn't have the glyph shaping / ligature things, so that needs to be clearly tested as well.
Comment 13 Volker Haas 2011-08-03 09:10:53 UTC
(In reply to comment #12)
> I still see wrong direction in the source nodes for that page.

The wrong direction in the source nodes is a caching issue. If you want to make sure that the latest version of the software is used you need to make a (one article) collection and not render by using the "download as PDF" link. That is unfortunate, but I don't know how to fix it, but the problem solves itself pretty quickly...

> 
> Preformatted sections are showing left-aligned, but in a variable-width font
> instead of a monospace one.

The problem is that the text in question is not recognized as preformatted sections at all...I'll investigage, seems like a mwlib parser issue.

> 
> The final template on the page seems to render as a tiny table containing only
> '-', '"', and '-' respectively in each cell.

I'll check that. To be honest, I think that table should probably not be printed at all, since it seems to be some navigational template. 

> 
> Other than that it looks about right; the bidi layout generally seems to match
> what I see in a modern browser. Hebrew is simpler than Arabic though as it
> doesn't have the glyph shaping / ligature things, so that needs to be clearly
> tested as well.
Comment 14 Volker Haas 2011-08-04 14:13:01 UTC
(In reply to comment #13)
> (In reply to comment #12)
> > 
> > Preformatted sections are showing left-aligned, but in a variable-width font
> > instead of a monospace one.
> 
> The problem is that the text in question is not recognized as preformatted
> sections at all...I'll investigage, seems like a mwlib parser issue.
> 

This is now fixed with http://code.pediapress.com/git/mwlib?p=mwlib;a=commit;h=dc8311e85de779d991fe34d7f09879006801a998
Comment 15 Siebrand Mazeland 2011-08-04 14:15:50 UTC
Great. Does Wikimedia need to do anything to get this deployed, or is this all on your end?
Comment 16 Brion Vibber 2011-08-04 14:35:13 UTC
I've filed a minor additional bug with the page footer in Hebrew as bug 30223.

Arabic support however seems to still be very problematic.

I tried exporting a random page from ar.wikibooks.org:
https://secure.wikimedia.org/wikibooks/ar/wiki/%D8%B3%D9%84%D9%81%D9%86%D9%8A_3_%D8%AC%D9%86%D9%8A%D9%87:_%D8%A7%D9%84%D8%A5%D8%AA%D8%B5%D8%A7%D9%84%D8%A7%D8%AA_%D9%88%D8%A7%D9%84%D9%85%D8%AC%D8%AA%D9%85%D8%B9_%D9%81%D9%8A_%D9%85%D8%B5%D8%B1

The text positioning is completely wrong; instead of being right-justified things seem to float somewhere in the middle. This may indicate incorrect handling of font metrics?

There also appear generic box characters in a large number of places (I think where a zero-width non-joiner appears, which happens *a lot* on that page).

Other than the boxes, shaping looks more or less ok but I can't read Arabic myself so I could easily be missing some additional details.

PDF attachment to follow.
Comment 17 Brion Vibber 2011-08-04 14:36:21 UTC
Created attachment 8884 [details]
Arabic page export from ar.wikibooks.org

See above comment describing rendering bugs.
Comment 18 Brion Vibber 2011-08-04 14:37:47 UTC
I can confirm that the directionality on source nodes has been fixed with a forced reload of the Hebrew page. The font fix for <pre> sections I assume hasn't been installed yet on the generator server.
Comment 19 Brion Vibber 2011-08-04 23:09:10 UTC
Created attachment 8888 [details]
Screenshot of misplaced hebrew nukta vs web rendering

Hebrew also has failures -- when combining characters for vowel markings (nukta) are used, they do not combine, but stack up as separate characters along the line.
Comment 20 Brion Vibber 2011-08-04 23:10:19 UTC
Created attachment 8889 [details]
Fixed screenshot

Not sure what broke on the previous image.
Comment 21 Volker Haas 2011-08-05 06:49:35 UTC
(In reply to comment #15)
> Great. Does Wikimedia need to do anything to get this deployed, or is this all
> on your end?

Deployment of the rendering software is done by me. Activation of the Collection extension is done by WMF.
Comment 22 Volker Haas 2011-08-05 09:39:29 UTC
I investigated the 'nukta issue'.

Some preliminary remarks:

We are using a python library which implements the bidi algorithm. This algorithm basically reorders characters from their logical (the "direction" of storage) to their visual ordering. The library uses the fribidi c library. Details can be found at [1]

After the tests I have done, I believe the fribidi library screws up when reordering:

word investigated:
חַיְפַא

logical ordering (this is how the string is stored)
ח 1495
 1463
י 1497
 1456
פ 1508
 1463
א 1488
ERRONEOUS transformation by fribidi
א 1488
פ 1508
 1463
י 1497
 1456
ח 1495
 1463
correct transformation (manually transformed):
א 1488
 1463
פ 1508
 1456
י 1497
 1463
ח 1495

I checked the manual transformation in the PDF and the result is as expected (same as in the browser).

Minimal example in python:

first install pyfribidi: easy_install pyfribidi

the run python (or ipython):

----
In [35]: import pyfribidi2

In [36]: text = unicode('חַיְפַא', 'utf-8')

In [37]: bidi_trans = lambda t: pyfribidi2.log2vis(t, base_direction=pyfribidi2.RTL)

In [38]: for c in bidi_trans(text): print c, ord(c)
   ....: 
א 1488
פ 1508
 1463
י 1497
 1456
ח 1495
 1463
----

To me it looks as if the fribidi library needs to be fixed. Help is welcome ;)

[1] http://pypi.python.org/pypi/pyfribidi/0.10.0
Comment 23 Brion Vibber 2011-08-05 12:17:01 UTC
Hmmm.... the fribidi transformation actually looks legit to me; the combining characters should appear after their base characters in the stream, same as in Latin ("e", "combining acute" -> renders like "é")

It looks like fribidi deliberately switched *to* keeping the combining characters logically after their base characters some years ago; here's some old threads on the subject:

http://www.mail-archive.com/linux-utf8@nl.linux.org/msg01710.html
http://lists.freedesktop.org/archives/fribidi/2002-March/000067.html

It may be that something specific about how the underlying PDF library handles fonts and combining characters could be incorrectly pushing them to the right of their base characters; or it may simply not support combining characters and so is inserting them visually in the logical order as if they were their own letters...?
Comment 24 reza1615 2011-08-09 09:48:02 UTC
Hi, I checked it in fa.wikipeda and fa.wikibooks these are some samples
1-http://fa.wikibooks.org/w/index.php?title=%D9%88%DB%8C%DA%98%D9%87:%DA%A9%D8%AA%D8%A7%D8%A8&bookcmd=download&collection_id=e600cdf555951c7b&writer=rl&return_to=%DA%A9%D9%88%D8%AF%DA%A9%D8%A7%D9%86%3A%D9%85%D9%86%D8%B8%D9%88%D9%85%D9%87+%D8%AE%D9%88%D8%B1%D8%B4%DB%8C%D8%AF%DB%8C
2-http://fa.wikibooks.org/w/index.php?title=%D9%88%DB%8C%DA%98%D9%87:%DA%A9%D8%AA%D8%A7%D8%A8&bookcmd=download&collection_id=8070f397f7a74170&writer=rl&return_to=%D8%A7%D8%AE%D9%84%D8%A7%D9%82+%D8%A7%D8%B3%D9%84%D8%A7%D9%85%DB%8C
3-http://fa.wikipedia.org/w/index.php?title=%D9%88%DB%8C%DA%98%D9%87:%DA%A9%D8%AA%D8%A7%D8%A8&bookcmd=download&collection_id=c232c1e277b29a56&writer=rl&return_to=%DA%A9%D8%A7%D9%81%DB%8C%E2%80%8C%D8%A7%D8%B3%DA%A9%D8%B1%DB%8C%D9%BE%D8%AA
they have some bugs
1-all of them they have LTF problem 
2-case 1,3 have problem with ل.
3-in case 3 it has problem with اً in معمولاً
4-in ar.wikibooks.org they don't have center direction! where we can change PDF text direction to LTF
5-In the last page ﻫﺎﻫﺎ ﻭ ﻣﺸﺎﺭﮐﺖﻣﻨﺎﺑﻊ ﻣﻘﺎﻟﻪ is incorrect it must be
 مشارکت‌ها و منابع مقاله‌ها
6-Infobox in case 3 has problem it must be like http://fa.wikipedia.org/wiki/%DA%A9%D8%A7%D9%81%DB%8C_%D8%A7%D8%B3%DA%A9%D8%B1%DB%8C%D9%BE%D8%AA
Comment 25 reza1615 2011-08-09 09:52:30 UTC
Excuse me, In last report I had some mistakes LTF => RTF
Comment 26 Behdad Esfahbod 2011-08-09 12:38:27 UTC
FriBidi has an option to NOT reorder non-spacing marks.  The problem is, for correct rendering of complex text FriBidi is not enough.  You can add more heuristics, but it will never be the real thing.
Comment 27 reza1615 2011-08-09 21:25:42 UTC
(In reply to comment #24)
> Hi, I checked it in fa.wikipeda and fa.wikibooks these are some samples
> 1-http://fa.wikibooks.org/w/index.php?title=%D9%88%DB%8C%DA%98%D9%87:%DA%A9%D8%AA%D8%A7%D8%A8&bookcmd=download&collection_id=e600cdf555951c7b&writer=rl&return_to=%DA%A9%D9%88%D8%AF%DA%A9%D8%A7%D9%86%3A%D9%85%D9%86%D8%B8%D9%88%D9%85%D9%87+%D8%AE%D9%88%D8%B1%D8%B4%DB%8C%D8%AF%DB%8C
> 2-http://fa.wikibooks.org/w/index.php?title=%D9%88%DB%8C%DA%98%D9%87:%DA%A9%D8%AA%D8%A7%D8%A8&bookcmd=download&collection_id=8070f397f7a74170&writer=rl&return_to=%D8%A7%D8%AE%D9%84%D8%A7%D9%82+%D8%A7%D8%B3%D9%84%D8%A7%D9%85%DB%8C
> 3-http://fa.wikipedia.org/w/index.php?title=%D9%88%DB%8C%DA%98%D9%87:%DA%A9%D8%AA%D8%A7%D8%A8&bookcmd=download&collection_id=c232c1e277b29a56&writer=rl&return_to=%DA%A9%D8%A7%D9%81%DB%8C%E2%80%8C%D8%A7%D8%B3%DA%A9%D8%B1%DB%8C%D9%BE%D8%AA
> they have some bugs
> 1-all of them they have LTF problem 
> 2-case 1,3 have problem with ل.
> 3-in case 3 it has problem with اً in معمولاً
> 4-in ar.wikibooks.org they don't have center direction! where we can change PDF
> text direction to LTF
> 5-In the last page ﻫﺎﻫﺎ ﻭ ﻣﺸﺎﺭﮐﺖﻣﻨﺎﺑﻊ ﻣﻘﺎﻟﻪ is incorrect it must be
>  مشارکت‌ها و منابع مقاله‌ها
> 6-Infobox in case 3 has problem it must be like
> http://fa.wikipedia.org/wiki/%DA%A9%D8%A7%D9%81%DB%8C_%D8%A7%D8%B3%DA%A9%D8%B1%DB%8C%D9%BE%D8%AA

۱-http://fa.wikibooks.org/wiki/%DA%A9%D9%88%D8%AF%DA%A9%D8%A7%D9%86:%D9%85%D9%86%D8%B8%D9%88%D9%85%D9%87_%D8%AE%D9%88%D8%B1%D8%B4%DB%8C%D8%AF%DB%8C
۲-http://fa.wikibooks.org/wiki/%D8%A7%D8%AE%D9%84%D8%A7%D9%82_%D8%A7%D8%B3%D9%84%D8%A7%D9%85%DB%8C
Comment 28 Volker Haas 2011-08-10 06:48:52 UTC
(In reply to comment #23)
> Hmmm.... the fribidi transformation actually looks legit to me; the combining
> characters should appear after their base characters in the stream, same as in
> Latin ("e", "combining acute" -> renders like "é")
> 
> It looks like fribidi deliberately switched *to* keeping the combining
> characters logically after their base characters some years ago; here's some
> old threads on the subject:
> 
> http://www.mail-archive.com/linux-utf8@nl.linux.org/msg01710.html
> http://lists.freedesktop.org/archives/fribidi/2002-March/000067.html
> 
> It may be that something specific about how the underlying PDF library handles
> fonts and combining characters could be incorrectly pushing them to the right
> of their base characters; or it may simply not support combining characters and
> so is inserting them visually in the logical order as if they were their own
> letters...?

Thanks for the info Brion. If you are right then the error is indeed in the "low level rendering" done by the PDF framework. I started investigating the reportlab source code...

To everyone else:

* could you please always provide minimal examples (shortest possible markup example which exposes some problem)
* clearly describe what you expect vs. what you get
* open separate tickets for separate problems
* keep problems related to different scripts in different tickets. (right now I am focusing on hebrew, after that I'll start with arabic)

Thanks!
Comment 29 Volker Haas 2011-08-11 11:00:14 UTC
Thanks for the info regarding the fribidi library, Behdad! If I read the relevant part of the Unicode spec correctly it has to be expected that some software can't deal with reordered non-spacing marks [1]. Therefore it seems valid to not reorder them. I made the necessary change in pyfribidi [2] and mwlib [3].

The issue Brion raised originally should be fixed. I tested the following for correctness [4]

I am closing this ticket now. Please open specific tickets for other issues.

[1] section 5.13 in http://www.unicode.org/versions/Unicode6.0.0/ch05.pdf
[2] http://pypi.python.org/pypi/pyfribidi
[3] http://code.pediapress.com/git/?p=mwlib.ext/.git;a=commit;h=e4bba86023c78ac800362dfe59edfecb2ff3adbb
[4] http://he.wikipedia.org/w/index.php?title=%D7%9E%D7%A9%D7%AA%D7%9E%D7%A9:Volker.haas&oldid=10980308
Comment 30 Amir Ladsgroup 2011-08-11 20:40:00 UTC
fixed?
I tested it on a random page in Persian wikipedia
http://fa.wikipedia.org/wiki/%D9%BE%D9%84%D8%A7%D8%B3%D9%85%D8%A7%DB%8C_%DA%A9%D9%88%D8%A7%D8%B1%DA%A9_%DA%AF%D9%84%D9%88%D8%A6%D9%88%D9%86

the output is still LTR and have some problem with character ل
Comment 31 reza1615 2011-08-11 20:43:04 UTC
(In reply to comment #30)
> fixed?
> I tested it on a random page in Persian wikipedia
> http://fa.wikipedia.org/wiki/%D9%BE%D9%84%D8%A7%D8%B3%D9%85%D8%A7%DB%8C_%DA%A9%D9%88%D8%A7%D8%B1%DA%A9_%DA%AF%D9%84%D9%88%D8%A6%D9%88%D9%86
> 
> the output is still LTR and have some problem with character ل

the problem is with لا not ل
Comment 32 Amir Ladsgroup 2011-08-11 20:52:12 UTC
another bug is very large size
PDF version of article "تهران" ("Tehran") is 19 mg!
Is there a way to make PDFs smaller? because almost all of readers of Persian Wikipedia is Iranian and law of Iran don't let normal people have internet with more speed of 128 kb/s :(
Comment 33 Siebrand Mazeland 2011-08-11 20:58:39 UTC
(In reply to comment #32)
(In reply to comment #30)
Please open a new issue for this. Comment 32 is not related to this issue, and comment 30 is an issue that is a lot smaller than what we originally started off with.
Comment 34 Amir Ladsgroup 2011-08-11 21:06:23 UTC
(In reply to comment #33)
> (In reply to comment #32)
> (In reply to comment #30)
> Please open a new issue for this. Comment 32 is not related to this issue, and
> comment 30 is an issue that is a lot smaller than what we originally started
> off with.

I made bug 23893 that marked as duplicate of this bug. please reopen this bug or bug 23893
Comment 35 reza1615 2011-08-11 21:11:38 UTC
(In reply to comment #33)
> (In reply to comment #32)
> (In reply to comment #30)
> Please open a new issue for this. Comment 32 is not related to this issue, and
> comment 30 is an issue that is a lot smaller than what we originally started
> off with.
I opened [https://bugzilla.wikimedia.org/show_bug.cgi?id=30326 bug 30326]

*** This bug has been marked as a duplicate of bug 30326 ***

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links