Last modified: 2013-06-18 16:27:40 UTC
Created attachment 5879 [details] PDF and ODF files of bidi document, part of a Hebrew book from the Hebrew wikibooks When trying to export bidi documents collection (download the collection), there are problems with both PDF and ODF files: - Text in PDF files is mirrored (ordered from left to right instead of right to left). For example, instead of תכנות מתקדם, it is written םדקתמ תונכת. - Headers inside the document are displayed as blank squares (probably illegal font was used). - ODF files are assigned as LTR document instead of RTL.
Yeah, RTL is not currently working in the PDF export... additionally, character shaping doesn't happen for Arabic script. The ODF bit might actually be an easier fix, if it mainly comes to marking the document language/direction... though embedded LTR bits might be a problem.
While creating a PDF-version the folling error occures since a few days: POST-Anfrage fehlgeschlagen aus Wikipedia, der freien Enzyklopädie Wechseln zu: Navigation, Suche Die POST-Anfrage an http://pdf1.wikimedia.org:8080/mw-serve/ ist fehlgeschlagen (Empty reply from server). Zurück zur Seite Wikipedia:Hauptseite.
(In reply to comment #2) > While creating a PDF-version the folling error occures since a few days: > > POST-Anfrage fehlgeschlagen > aus Wikipedia, der freien Enzyklopädie > Wechseln zu: Navigation, Suche > > Die POST-Anfrage an http://pdf1.wikimedia.org:8080/mw-serve/ ist fehlgeschlagen > (Empty reply from server). > > Zurück zur Seite Wikipedia:Hauptseite. > This has also been filed as bug 18816
Created attachment 6182 [details] RTL & pdf sample rendering from http://test.wikipedia.org/w/index.php?oldid=66359
*** Bug 23893 has been marked as a duplicate of this bug. ***
Can someone of PediaPress clarify the root cause of this issue? I'd like to get this resolved after two and a half years preferably sooner than later, and it is not clear to me where the issue is coming from.
I believe it needs support for bidirectional text and complex scripts in the underlying ReportLab library that does the PDF output. Some googling indicates there are at least some RTL/bidi patches around, some of which may or may not have gotten merged upstream. Last I saw though using fribidi for Arabic shaping was sufficient only for Arabic language, not for other languages in Arabic script like Farsi and Pashto.
OK. Since PediaPress does not appear to be anywhere interested in getting this fixed, we should just disable the Collection extension where it does not work because of failing script support. I have checked Arabic and Hebrew Wikipedias and there Collection is not enabled, so that appears to be fine.
we hid it in fa.wp by javascript
The problem is not that PediaPress is not interested in fixing this. The problem is that the framework we are using to generate PDFs does not support right-to-left languages properly (reportlab). I have been in discussion with the developers and some Israelis (for QA mainly) and we have made slow progress. Now it looks as if all major issues have been ironed out and that the PDF export is ready for hebrew at least. Check the sample the I rendered ~2 days ago [1] - beware that all contents are random since I do not speak or read hebrew. So far I have some feedback from hebrew speaking persons who think that the quality is good and PDF export is ready for production use for hebrew. As a matter of fact we have contacted the WMF a week ago and informed them that we think the PDF export is now ready to go live in the hebrew wikipedia and scheduled to activate the Collection extension in the hebrew wikipedia later today. Please keep in mind that the PDF export is open source software and everybody can contribute. I have spend a substantial amount of time already to improve rtl support, but as someone who does neither speak or ready any of these languages I make progress slowly. Please contribute! * http://code.pediapress.com/git/mwlib.rl (rtl_support branch) * http://code.pediapress.com/git/?p=mwlib.ext/.git (rtl_support branch) My last tests indicated that for arabic there are still some problems. If anybody is willing to contribute I'd be happy to accept patches. [1] http://pediapress.com/files/he/sample_1.pdf
This bug report specifically mentioned one article [1]. I fixed the remaining issue (wrong direction in source nodes) with [2]. I updated the render servers. The article now looks correct to me (at least the squiggly lines in my browser closely resemble the ones in the PDF). I am closing this ticket, open new ones for specific issues for hebrew. [1]http://he.wikibooks.org/wiki/%D7%AA%D7%9B%D7%A0%D7%95%D7%AA_%D7%9E%D7%AA%D7%A7%D7%93%D7%9D_%D7%91-Java/%D7%96%D7%A8%D7%9E%D7%99%D7%9D [2] http://code.pediapress.com/git/mwlib.rl?p=mwlib.rl;a=commit;h=bde6f53e9159a8869a8b1a6d529d657fd2ca3d5a
I still see wrong direction in the source nodes for that page. Preformatted sections are showing left-aligned, but in a variable-width font instead of a monospace one. The final template on the page seems to render as a tiny table containing only '-', '"', and '-' respectively in each cell. Other than that it looks about right; the bidi layout generally seems to match what I see in a modern browser. Hebrew is simpler than Arabic though as it doesn't have the glyph shaping / ligature things, so that needs to be clearly tested as well.
(In reply to comment #12) > I still see wrong direction in the source nodes for that page. The wrong direction in the source nodes is a caching issue. If you want to make sure that the latest version of the software is used you need to make a (one article) collection and not render by using the "download as PDF" link. That is unfortunate, but I don't know how to fix it, but the problem solves itself pretty quickly... > > Preformatted sections are showing left-aligned, but in a variable-width font > instead of a monospace one. The problem is that the text in question is not recognized as preformatted sections at all...I'll investigage, seems like a mwlib parser issue. > > The final template on the page seems to render as a tiny table containing only > '-', '"', and '-' respectively in each cell. I'll check that. To be honest, I think that table should probably not be printed at all, since it seems to be some navigational template. > > Other than that it looks about right; the bidi layout generally seems to match > what I see in a modern browser. Hebrew is simpler than Arabic though as it > doesn't have the glyph shaping / ligature things, so that needs to be clearly > tested as well.
(In reply to comment #13) > (In reply to comment #12) > > > > Preformatted sections are showing left-aligned, but in a variable-width font > > instead of a monospace one. > > The problem is that the text in question is not recognized as preformatted > sections at all...I'll investigage, seems like a mwlib parser issue. > This is now fixed with http://code.pediapress.com/git/mwlib?p=mwlib;a=commit;h=dc8311e85de779d991fe34d7f09879006801a998
Great. Does Wikimedia need to do anything to get this deployed, or is this all on your end?
I've filed a minor additional bug with the page footer in Hebrew as bug 30223. Arabic support however seems to still be very problematic. I tried exporting a random page from ar.wikibooks.org: https://secure.wikimedia.org/wikibooks/ar/wiki/%D8%B3%D9%84%D9%81%D9%86%D9%8A_3_%D8%AC%D9%86%D9%8A%D9%87:_%D8%A7%D9%84%D8%A5%D8%AA%D8%B5%D8%A7%D9%84%D8%A7%D8%AA_%D9%88%D8%A7%D9%84%D9%85%D8%AC%D8%AA%D9%85%D8%B9_%D9%81%D9%8A_%D9%85%D8%B5%D8%B1 The text positioning is completely wrong; instead of being right-justified things seem to float somewhere in the middle. This may indicate incorrect handling of font metrics? There also appear generic box characters in a large number of places (I think where a zero-width non-joiner appears, which happens *a lot* on that page). Other than the boxes, shaping looks more or less ok but I can't read Arabic myself so I could easily be missing some additional details. PDF attachment to follow.
Created attachment 8884 [details] Arabic page export from ar.wikibooks.org See above comment describing rendering bugs.
I can confirm that the directionality on source nodes has been fixed with a forced reload of the Hebrew page. The font fix for <pre> sections I assume hasn't been installed yet on the generator server.
Created attachment 8888 [details] Screenshot of misplaced hebrew nukta vs web rendering Hebrew also has failures -- when combining characters for vowel markings (nukta) are used, they do not combine, but stack up as separate characters along the line.
Created attachment 8889 [details] Fixed screenshot Not sure what broke on the previous image.
(In reply to comment #15) > Great. Does Wikimedia need to do anything to get this deployed, or is this all > on your end? Deployment of the rendering software is done by me. Activation of the Collection extension is done by WMF.
I investigated the 'nukta issue'. Some preliminary remarks: We are using a python library which implements the bidi algorithm. This algorithm basically reorders characters from their logical (the "direction" of storage) to their visual ordering. The library uses the fribidi c library. Details can be found at [1] After the tests I have done, I believe the fribidi library screws up when reordering: word investigated: חַיְפַא logical ordering (this is how the string is stored) ח 1495 1463 י 1497 1456 פ 1508 1463 א 1488 ERRONEOUS transformation by fribidi א 1488 פ 1508 1463 י 1497 1456 ח 1495 1463 correct transformation (manually transformed): א 1488 1463 פ 1508 1456 י 1497 1463 ח 1495 I checked the manual transformation in the PDF and the result is as expected (same as in the browser). Minimal example in python: first install pyfribidi: easy_install pyfribidi the run python (or ipython): ---- In [35]: import pyfribidi2 In [36]: text = unicode('חַיְפַא', 'utf-8') In [37]: bidi_trans = lambda t: pyfribidi2.log2vis(t, base_direction=pyfribidi2.RTL) In [38]: for c in bidi_trans(text): print c, ord(c) ....: א 1488 פ 1508 1463 י 1497 1456 ח 1495 1463 ---- To me it looks as if the fribidi library needs to be fixed. Help is welcome ;) [1] http://pypi.python.org/pypi/pyfribidi/0.10.0
Hmmm.... the fribidi transformation actually looks legit to me; the combining characters should appear after their base characters in the stream, same as in Latin ("e", "combining acute" -> renders like "é") It looks like fribidi deliberately switched *to* keeping the combining characters logically after their base characters some years ago; here's some old threads on the subject: http://www.mail-archive.com/linux-utf8@nl.linux.org/msg01710.html http://lists.freedesktop.org/archives/fribidi/2002-March/000067.html It may be that something specific about how the underlying PDF library handles fonts and combining characters could be incorrectly pushing them to the right of their base characters; or it may simply not support combining characters and so is inserting them visually in the logical order as if they were their own letters...?
Hi, I checked it in fa.wikipeda and fa.wikibooks these are some samples 1-http://fa.wikibooks.org/w/index.php?title=%D9%88%DB%8C%DA%98%D9%87:%DA%A9%D8%AA%D8%A7%D8%A8&bookcmd=download&collection_id=e600cdf555951c7b&writer=rl&return_to=%DA%A9%D9%88%D8%AF%DA%A9%D8%A7%D9%86%3A%D9%85%D9%86%D8%B8%D9%88%D9%85%D9%87+%D8%AE%D9%88%D8%B1%D8%B4%DB%8C%D8%AF%DB%8C 2-http://fa.wikibooks.org/w/index.php?title=%D9%88%DB%8C%DA%98%D9%87:%DA%A9%D8%AA%D8%A7%D8%A8&bookcmd=download&collection_id=8070f397f7a74170&writer=rl&return_to=%D8%A7%D8%AE%D9%84%D8%A7%D9%82+%D8%A7%D8%B3%D9%84%D8%A7%D9%85%DB%8C 3-http://fa.wikipedia.org/w/index.php?title=%D9%88%DB%8C%DA%98%D9%87:%DA%A9%D8%AA%D8%A7%D8%A8&bookcmd=download&collection_id=c232c1e277b29a56&writer=rl&return_to=%DA%A9%D8%A7%D9%81%DB%8C%E2%80%8C%D8%A7%D8%B3%DA%A9%D8%B1%DB%8C%D9%BE%D8%AA they have some bugs 1-all of them they have LTF problem 2-case 1,3 have problem with ل. 3-in case 3 it has problem with اً in معمولاً 4-in ar.wikibooks.org they don't have center direction! where we can change PDF text direction to LTF 5-In the last page ﻫﺎﻫﺎ ﻭ ﻣﺸﺎﺭﮐﺖﻣﻨﺎﺑﻊ ﻣﻘﺎﻟﻪ is incorrect it must be مشارکتها و منابع مقالهها 6-Infobox in case 3 has problem it must be like http://fa.wikipedia.org/wiki/%DA%A9%D8%A7%D9%81%DB%8C_%D8%A7%D8%B3%DA%A9%D8%B1%DB%8C%D9%BE%D8%AA
Excuse me, In last report I had some mistakes LTF => RTF
FriBidi has an option to NOT reorder non-spacing marks. The problem is, for correct rendering of complex text FriBidi is not enough. You can add more heuristics, but it will never be the real thing.
(In reply to comment #24) > Hi, I checked it in fa.wikipeda and fa.wikibooks these are some samples > 1-http://fa.wikibooks.org/w/index.php?title=%D9%88%DB%8C%DA%98%D9%87:%DA%A9%D8%AA%D8%A7%D8%A8&bookcmd=download&collection_id=e600cdf555951c7b&writer=rl&return_to=%DA%A9%D9%88%D8%AF%DA%A9%D8%A7%D9%86%3A%D9%85%D9%86%D8%B8%D9%88%D9%85%D9%87+%D8%AE%D9%88%D8%B1%D8%B4%DB%8C%D8%AF%DB%8C > 2-http://fa.wikibooks.org/w/index.php?title=%D9%88%DB%8C%DA%98%D9%87:%DA%A9%D8%AA%D8%A7%D8%A8&bookcmd=download&collection_id=8070f397f7a74170&writer=rl&return_to=%D8%A7%D8%AE%D9%84%D8%A7%D9%82+%D8%A7%D8%B3%D9%84%D8%A7%D9%85%DB%8C > 3-http://fa.wikipedia.org/w/index.php?title=%D9%88%DB%8C%DA%98%D9%87:%DA%A9%D8%AA%D8%A7%D8%A8&bookcmd=download&collection_id=c232c1e277b29a56&writer=rl&return_to=%DA%A9%D8%A7%D9%81%DB%8C%E2%80%8C%D8%A7%D8%B3%DA%A9%D8%B1%DB%8C%D9%BE%D8%AA > they have some bugs > 1-all of them they have LTF problem > 2-case 1,3 have problem with ل. > 3-in case 3 it has problem with اً in معمولاً > 4-in ar.wikibooks.org they don't have center direction! where we can change PDF > text direction to LTF > 5-In the last page ﻫﺎﻫﺎ ﻭ ﻣﺸﺎﺭﮐﺖﻣﻨﺎﺑﻊ ﻣﻘﺎﻟﻪ is incorrect it must be > مشارکتها و منابع مقالهها > 6-Infobox in case 3 has problem it must be like > http://fa.wikipedia.org/wiki/%DA%A9%D8%A7%D9%81%DB%8C_%D8%A7%D8%B3%DA%A9%D8%B1%DB%8C%D9%BE%D8%AA ۱-http://fa.wikibooks.org/wiki/%DA%A9%D9%88%D8%AF%DA%A9%D8%A7%D9%86:%D9%85%D9%86%D8%B8%D9%88%D9%85%D9%87_%D8%AE%D9%88%D8%B1%D8%B4%DB%8C%D8%AF%DB%8C ۲-http://fa.wikibooks.org/wiki/%D8%A7%D8%AE%D9%84%D8%A7%D9%82_%D8%A7%D8%B3%D9%84%D8%A7%D9%85%DB%8C
(In reply to comment #23) > Hmmm.... the fribidi transformation actually looks legit to me; the combining > characters should appear after their base characters in the stream, same as in > Latin ("e", "combining acute" -> renders like "é") > > It looks like fribidi deliberately switched *to* keeping the combining > characters logically after their base characters some years ago; here's some > old threads on the subject: > > http://www.mail-archive.com/linux-utf8@nl.linux.org/msg01710.html > http://lists.freedesktop.org/archives/fribidi/2002-March/000067.html > > It may be that something specific about how the underlying PDF library handles > fonts and combining characters could be incorrectly pushing them to the right > of their base characters; or it may simply not support combining characters and > so is inserting them visually in the logical order as if they were their own > letters...? Thanks for the info Brion. If you are right then the error is indeed in the "low level rendering" done by the PDF framework. I started investigating the reportlab source code... To everyone else: * could you please always provide minimal examples (shortest possible markup example which exposes some problem) * clearly describe what you expect vs. what you get * open separate tickets for separate problems * keep problems related to different scripts in different tickets. (right now I am focusing on hebrew, after that I'll start with arabic) Thanks!
Thanks for the info regarding the fribidi library, Behdad! If I read the relevant part of the Unicode spec correctly it has to be expected that some software can't deal with reordered non-spacing marks [1]. Therefore it seems valid to not reorder them. I made the necessary change in pyfribidi [2] and mwlib [3]. The issue Brion raised originally should be fixed. I tested the following for correctness [4] I am closing this ticket now. Please open specific tickets for other issues. [1] section 5.13 in http://www.unicode.org/versions/Unicode6.0.0/ch05.pdf [2] http://pypi.python.org/pypi/pyfribidi [3] http://code.pediapress.com/git/?p=mwlib.ext/.git;a=commit;h=e4bba86023c78ac800362dfe59edfecb2ff3adbb [4] http://he.wikipedia.org/w/index.php?title=%D7%9E%D7%A9%D7%AA%D7%9E%D7%A9:Volker.haas&oldid=10980308
fixed? I tested it on a random page in Persian wikipedia http://fa.wikipedia.org/wiki/%D9%BE%D9%84%D8%A7%D8%B3%D9%85%D8%A7%DB%8C_%DA%A9%D9%88%D8%A7%D8%B1%DA%A9_%DA%AF%D9%84%D9%88%D8%A6%D9%88%D9%86 the output is still LTR and have some problem with character ل
(In reply to comment #30) > fixed? > I tested it on a random page in Persian wikipedia > http://fa.wikipedia.org/wiki/%D9%BE%D9%84%D8%A7%D8%B3%D9%85%D8%A7%DB%8C_%DA%A9%D9%88%D8%A7%D8%B1%DA%A9_%DA%AF%D9%84%D9%88%D8%A6%D9%88%D9%86 > > the output is still LTR and have some problem with character ل the problem is with لا not ل
another bug is very large size PDF version of article "تهران" ("Tehran") is 19 mg! Is there a way to make PDFs smaller? because almost all of readers of Persian Wikipedia is Iranian and law of Iran don't let normal people have internet with more speed of 128 kb/s :(
(In reply to comment #32) (In reply to comment #30) Please open a new issue for this. Comment 32 is not related to this issue, and comment 30 is an issue that is a lot smaller than what we originally started off with.
(In reply to comment #33) > (In reply to comment #32) > (In reply to comment #30) > Please open a new issue for this. Comment 32 is not related to this issue, and > comment 30 is an issue that is a lot smaller than what we originally started > off with. I made bug 23893 that marked as duplicate of this bug. please reopen this bug or bug 23893
(In reply to comment #33) > (In reply to comment #32) > (In reply to comment #30) > Please open a new issue for this. Comment 32 is not related to this issue, and > comment 30 is an issue that is a lot smaller than what we originally started > off with. I opened [https://bugzilla.wikimedia.org/show_bug.cgi?id=30326 bug 30326] *** This bug has been marked as a duplicate of bug 30326 ***