Last modified: 2012-01-19 23:50:38 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T35430, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 33430 - "Create a book" and "Download as PDF" don't wrap Chinese or Japanese lines
"Create a book" and "Download as PDF" don't wrap Chinese or Japanese lines
Status: RESOLVED FIXED
Product: MediaWiki extensions
Classification: Unclassified
Collection (Other open bugs)
unspecified
All All
: High major with 6 votes (vote)
: ---
Assigned To: Nobody - You can work on this!
: accessibility
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2011-12-30 06:34 UTC by Ziyuan Yao
Modified: 2012-01-19 23:50 UTC (History)
7 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments
test script for linebreak check for mixed cjk and non-cjk text (1.17 KB, text/x-python)
2012-01-10 14:46 UTC, Volker Haas
Details

Description Ziyuan Yao 2011-12-30 06:34:19 UTC
PROBLEM:

The Chinese and Japanese languages are the only two languages in the world that don't use spaces to separate words. Instead, they just stick all words together. So it's common to see a whole Chinese/Japanese paragraph without any spaces in it.

Because of this peculiarity, MediaWiki's "Create a book" and "Download as PDF" features don't wrap Chinese/Japanese lines in a generated PDF file, resulting in a whole paragraph rendered on a single line and truncated when the line runs out of the right edge of the page.

HOW TO REPRODUCE THE PROBLEM:

1. Go to http://www.mediawiki.org/wiki/MediaWiki/zh-hans
2. Open the page's "Download as PDF" link in a new browser tab;
3. You will get a PDF file that doesn't wrap a long Chinese line but instead truncates it at the right edge of the page:

"MediaWiki是一个最初用于维基百科的自由wiki程序包,用PHP语言写成。现在,非营利的维基媒体基金会的其他计划、许多其他wiki网站以及本网站(MediaWiki主页)都在使用这个程序包。"

HOW TO FIX IT

1. As a quick workaround, consider inserting special, invisible Unicode control characters into a long Chinese/Japanese line that will cause line wraps;
2. As a dirty rule, if you can't wrap a line at a space, wrap it at the right margin of the page forcibly;
3. In principle, every Chinese/Japanese character can be considered a "word" and therefore a line wrap is allowable before or after any Chinese/Japanese character;
4. Experts may have a better solution than the above.
Comment 1 Benjamin Chen 2011-12-30 07:46:54 UTC
(In reply to comment #0)
> The Chinese and Japanese languages are the only two languages in the world that
> don't use spaces to separate words. 

Not true, many other languages does not use spaces as well :)

IIRC, the wrapping of Chinese paragraphs was fine around September because I used it to generate several files. Probably some changes in the config or extension itself caused this problem.
Comment 2 Ziyuan Yao 2011-12-30 07:58:28 UTC
Benjamin Chen: AFAIK, Only Chinese and Japanese apply. Korean uses square-like characters but it does have spaces between words.
Comment 3 Ziyuan Yao 2011-12-30 11:47:20 UTC
Enabling the Chinese Wikipedia to provide ebook creation properly can help spread Wikipedia knowledge in China freely.
Comment 4 Christoph Kepper 2011-12-30 20:30:18 UTC
Fixing this bug will probably be only a partial success. About 18 month ago we (PediaPress) were experimenting a little bit with Japanese, but we encountered numerous problems (text-direction, layout rules, lack of support both from the community and our tools) that scared us off pursuing this further. Imo it will take a lot of determination and perseverance as well as ongoing support from native speakers/developers to create decent ebooks.
Comment 5 Ziyuan Yao 2011-12-30 21:08:24 UTC
(In reply to comment #4)
> Fixing this bug will probably be only a partial success. About 18 month ago we
> (PediaPress) were experimenting a little bit with Japanese, but we encountered
> numerous problems (text-direction, layout rules, lack of support both from the
> community and our tools) that scared us off pursuing this further. Imo it will
> take a lot of determination and perseverance as well as ongoing support from
> native speakers/developers to create decent ebooks.

First, fixing this bug alone will improve the usefulness of Chinese/Japanese ebooks from 1% to 99.9%.

Second, I suggest MediaWiki reuses a mature HTML rendering engine (e.g. WebKit) or text rendering engine (e.g. Pango), instead of reinventing all the wheels again.

Third, MediaWiki can for now ignore complex formatting features such as "text-direction, layout rules" and just focus on drawing plain text lines and images correctly. "Keep it simple, stupid" for the first version.
Comment 6 Ziyuan Yao 2011-12-30 21:31:07 UTC
I have played around with some Chinese pages on mediawiki.org and so far the only problem I have seen is "no line wrapping". I don't see problems you mentioned like "text-direction"; note that Chinese and Japanese also use the left-to-right text direction just like English. Text direction is only a problem for Middle East languages like Arabic and Hebrew.

I see MediaWiki can already draw basic stuff right: text, images and tables, except line wrapping for Chinese/Japanese.

Here is a simple rule set for line wrapping:

IF there is a whitespace near the page's right margin THEN
        break the line at that whitespace;
ELSE IF there is a Chinese/Japanese character near the page's right margin THEN
        break before or after that character;
ELSE
        break forcibly at the page's right margin (and optionally draw a "soft return" character to indicate this forced break).
Comment 7 Ziyuan Yao 2011-12-30 21:42:48 UTC
Although Chinese and Japanese don't use spaces to separate words, you can actually think there is an "invisible space" before and after every Chinese/Japanese character, and this "invisible space" is always a good line-wrapping point just like normal spaces.

There is actually a Unicode control character U+200B "zero-width space" (http://en.wikipedia.org/wiki/Zero-width_space) for this "invisible space" concept.

With U+200B in mind, we can also simplify our line-wrapping rule set as:

add a U+200B after every Chinese/Japanese character;
IF there is a whitespace (including U+200B) near the page's right margin THEN
        break the line at that whitespace;
ELSE
        break forcibly at the page's right margin (and optionally draw a "soft return" character to indicate this forced break).
Comment 8 Ziyuan Yao 2011-12-30 21:46:06 UTC
Either of the above two rule sets can solve the line wrapping problem, although in the long run I recommend using a mature HTML-to-PDF library instead of reinventing all the wheels.
Comment 9 Ziyuan Yao 2011-12-31 03:46:09 UTC
I just did a little research on what FOSS PDF libraries are available. Here's a good list:

http://en.wikipedia.org/wiki/List_of_PDF_software#Development_libraries

TCPDF (http://en.wikipedia.org/wiki/TCPDF) seems to be a good candidate.
Comment 10 Ziyuan Yao 2012-01-02 11:10:35 UTC
It seems currently MediaWiki's "Collection" extension uses the "ReportLab" PDF library to render PDF files (http://www.mediawiki.org/wiki/Extension:PDF_Writer#Technical).

ReportLab is one of the PDF libraries listed in the above Wikipedia reference.

Maybe we should persuade ReportLab to fix this problem first.
Comment 11 Ziyuan Yao 2012-01-02 11:19:32 UTC
On ReportLab's "Samples" page (http://www.reportlab.com/software/documentation/rml-samples/), there is a "test_031_japanese.pdf" (http://www.reportlab.com/examples/rml/test/test_031_japanese.pdf) which shows that ReportLab can do Japanese text wrapping perfectly, while MediaWiki's "Download as PDF" can't wrap a long Japanese line at all. Why is that? I'm also asking this on ReportLab's mailing list (http://two.pairlist.net/mailman/listinfo/reportlab-users).
Comment 12 Ziyuan Yao 2012-01-02 12:05:17 UTC
Good news, everybody! The solution to this problem has been given by ReportLab's personnel, as follows:

On 2 January 2012 11:33, Yao Ziyuan <yaoziyuan@gmail.com> wrote:
> So now I'm confused. Is it MediaWiki's or ReportLab's fault for the
> line wrapping problem described in the above bug report
> (https://bugzilla.wikimedia.org/show_bug.cgi?id=33430)?
>

MediaWiki (actually PediaPress.de) decided to use our library a few years ago;
we did some work to improve inline images to support equations, but
they did not mention Asian line wrapping at the time and I did not know
about this limitation.

I guess they are simply not using our wordwrap=CJK option.  Our library
needs to be told "this is Japanese/Chinese, use a different algorithm";
it does not auto-detect based on the encoding.

Also, until some time last year, we could not properly handle mixed text
in the same sentence.  We have improved this now.


- Andy
Comment 13 Volker Haas 2012-01-10 14:45:01 UTC
We need to distinguish two different cases:

1) rendering a PDF from the chinese/japanese Wikipedia.

2) rendering a PDF from any other wikipedia which has some chinese/japanese text embedded inside the article.

The example you (Ziyuan) give at the very top is case 2). Your last post suggests that this case can be handled correctly with a recent reportlab version.

I believe this is not true. I checked out the latest reportlab version from their subversion repository and made a little test script (I'll attach that). The result seems to indicate that mixed cjk and and non-cjk text can't be rendered correctly. The line breaks are either correct for cjk of non-cjk text. (line wrapping behaviour can be toggled by enabling or disabling the CJK wordWrapping.)

(I didn't bother to use a proper font for cjk text - but that should not matter, except that all cjk letters are rendered as black boxes.)

Case 1) is a different matter: this should basically work. If not please provide a minimal example / article URL.
Comment 14 Volker Haas 2012-01-10 14:46:35 UTC
Created attachment 9833 [details]
test script for linebreak check for mixed cjk and non-cjk text
Comment 15 Ziyuan Yao 2012-01-10 15:36:52 UTC
First, I don't have MediaWiki installed on my computer so I can't run your test script.

If ReportLab doesn't support line wrapping for mixed cjk and non-cjk text correctly, I suggest we do the following:

Step 1: For every CJK character in the text, insert a Unicode control character U+200B "zero-width space" after it. This is supposed to cause a line-wrapping after a CJK character when a line is full.

Step 2: Disable CJK wordWrapping. Use Western-style word wrapping.

Step 3: Now you should see a long CJK string wrapped at the end of a line.
Comment 16 Ziyuan Yao 2012-01-10 15:44:17 UTC
The line-wrapping rule for CJK/non-CJK mixed text is actually very simple: You should either wrap the line at a whitespace (as in a Western text), or after a CJK character.

So, if possible, use the above rule to pre-wrap a text before feeding it to ReportLab.
Comment 17 Ziyuan Yao 2012-01-10 15:51:36 UTC
OK, now I installed python-reportlab in my Fedora 16 and can run your test script. I understand your problem. I'll test if I can insert U+200B after every CJK character. If U+200B fails, we can insert a normal space after every CJK character. This will definitely wrap a line after a CJK character, but with the drawback that all CJK characters will be separated by spaces (instead of sticking together).
Comment 18 Ziyuan Yao 2012-01-10 16:05:04 UTC
OK. I tried. U+200B doesn't work with ReportLab:

p1 = Paragraph(u"MediaWiki\u200B是\u200B一\u200B个\u200B最\u200B初\u200B用\u200B于\u200B维\u200B基\u200B百\u200B科\u200B的\u200B自\u200B由\u200Bwiki\u200B程\u200B序\u200B包\u200B,\u200B用\u200BPHP\u200B语\u200B言\u200B写\u200B成\u200B。\u200B现\u200B在\u200B,\u200B非\u200B营\u200B利\u200B的\u200B维\u200B基\u200B媒\u200B体\u200B基\u200B金\u200B会\u200B的\u200B其\u200B他\u200B计\u200B划\u200B、\u200B许\u200B多\u200B其\u200B他\u200Bwiki\u200B网\u200B站\u200B以\u200B及\u200B本\u200B网\u200B站\u200B(\u200BMediaWiki\u200B主\u200B页\u200B)\u200B都\u200B在\u200B使\u200B用\u200B这\u200B个\u200B程\u200B序\u200B包\u200B。", s)

But normal spaces do:

p1 = Paragraph(u"MediaWiki 是 一 个 最 初 用 于 维 基 百 科 的 自 由 wiki 程 序 包 , 用 PHP 语 言 写 成 。 现 在 , 非 营 利 的 维 基 媒 体 基 金 会 的 其 他 计 划 、 许 多 其 他 wiki 网 站 以 及 本 网 站 ( MediaWiki 主 页 ) 都 在 使 用 这 个 程 序 包 。", s)
Comment 19 Ziyuan Yao 2012-01-10 16:08:13 UTC
I'll write to ReportLab's mailing list, suggesting them to create a new wordWrap option "mixed", so that ReportLab can directly support wrapping mixed text.
Comment 20 Ziyuan Yao 2012-01-11 03:34:56 UTC
ReportLab says working on this problem is not their priority. So I'm trying to fix it personally in their source code.

I found (and they told me) their source code is actually very old (2006). It's before Unicode went mainstream, which is why they don't support mixed text wrapping well.

So, is it hard for PediaPress to switch to a more "modern" PDF library, such as TCPDF, which I already saw people say is good at Unicode support?
Comment 21 Ziyuan Yao 2012-01-11 04:26:15 UTC
Cite http://en.wikipedia.org/wiki/TCPDF :

"TCPDF is currently the only PHP-based library that includes complete support for UTF-8 Unicode and right-to-left languages, including the bidirectional algorithm.[1]"
Comment 22 Ziyuan Yao 2012-01-11 06:11:47 UTC
OK, Volker Haas, I have come up with a simple way to fix all this:

We will first determine whether a wiki page is "mostly Western" (then we'll use wordwrap=Western) or not (then we'll use wordwrap=CJK).

The definition of "mostly Western" can be: the longest consecutive CJK string in the page is shorter than 10 characters.
Comment 23 Volker Haas 2012-01-11 09:19:55 UTC
As you also found out, reportlab does not support zero width space chars. I needed that for other purposes in the past as well. The best solution/hack I could come up with, was too use space and set the font size to the smallest possible value.

I implemented the following:

In all non-cjk wikis the text is checked for cjk characters. If cjk characters are found fake zero-width-space chars are inserted. I tested this for a couple of articles and the strategy seems to make sense.

As for your suggestion to use another PDF framework as reportlab: doing this is a huge amount of work, therefore this is no option at the moment.

The render servers will be updated in the next 24 hours. I'll close this as fixed.
Comment 24 Ziyuan Yao 2012-01-11 10:31:56 UTC
Great news. Can you give me a PDF that demostrates your smallest-size spaces?
Comment 25 Ziyuan Yao 2012-01-11 10:36:18 UTC
I like your solution for non-cjk wikis (using tiny spaces). But you didn't mention what to do with cjk wikis. I assume you will use wordwrap=CJK for them, right?
Comment 26 Ziyuan Yao 2012-01-11 10:40:57 UTC
I just tried out your "tiniest space" concept in LibreOffice. Perfect! Virtually invisible spaces! You're a genius. No need to show me the PDF now.
Comment 27 Ziyuan Yao 2012-01-11 10:46:51 UTC
One more question: Your tiny-space idea is a universal solution that can also apply to cjk wikis, because a cjk wiki can also contain Western words (which better be wrapped at spaces).

What is the reason you don't apply it to cjk wikis?
Comment 28 Volker Haas 2012-01-11 13:03:00 UTC
For cjk wikis the built-in cjk word wrapping of reportlab is used. This probably breaks non-cjk text that is embedded...But I am pretty sure that at least for japanese the algorithms to break lines are more sophisticated than just splitting after any letter. I am hoping that the built-in reportlab word wrapping function does that. But I am not sure...
Comment 29 Ziyuan Yao 2012-01-11 14:07:10 UTC
First, using ReportLab's cjk wordwrap algorithm will break English words into two lines. This is well demonstrated by your own test script.

Second, also ReportLab's cjk wordwrap algorithm can more sophisticatedly break Japanese sentences, this benefit is very small, while the drawback of cutting Western words in halves is very significant.

In Chinese, some full-width punctuation marks such as ,。;” generally don't appear at the beginning of a line either, but as a Chinese I consider this an expendable rule if we can keep Western words uncut.
Comment 30 Ziyuan Yao 2012-01-11 14:07:48 UTC
s/also/although
Comment 31 Ziyuan Yao 2012-01-11 14:22:29 UTC
Here are two Wikipedia links that talk about the so-called CJK wordwrap rules:

http://en.wikipedia.org/wiki/Word_wrap#Word_wrapping_in_text_containing_Chinese.2C_Japanese.2C_and_Korean

http://en.wikipedia.org/wiki/Line_breaking_rules_in_East_Asian_language#Line_breaking_rules_in_Japanese_text_.28Kinsoku_Shori.29

I have reviewed them all. Not a single of them is as serious as "don't break Western words into two lines". They can be ignored altogether. Most text editors and viewers don't obey these rules anyway.
Comment 32 Ziyuan Yao 2012-01-12 04:32:58 UTC
Found a problem with the tiny space approach: Chinese characters don't take up the full space of a line; there is still much space left on the right side of each line. For example: try http://www.mediawiki.org/wiki/MediaWiki/zh-hans

I guess this is caused by how ReportLab counts the text length of what's already put on a line: after putting each word, it adds that word's length and a normal space's width. But now there are actually two kinds of space width: normal width (as between two English words) and tiny width (as between two Chinese characters). It seems ReportLab thinks all spaces are using the normal width, therefore starting a new line prematurely.

Can this be fixed? Can you let ReportLab count tiny spaces as tiny spaces, not normal spaces?
Comment 33 Ziyuan Yao 2012-01-12 05:07:13 UTC
If we can't easily modify ReportLab to distinguish tiny space widths from normal space widths, I'd rather see this arrangement:

For non-cjk wikis, insert a normal-sized space after each CJK character, and then use wordwrap=Western.

For cjk wikis, use wordwrap=cjk.
Comment 34 Volker Haas 2012-01-12 13:04:48 UTC
I just found out that the latest reportlab version seems to handle non-cjk text inside cjk text (with wordWrap='CJK') correctly. Installation of the newest reportlab version failed, and I didn't realize that. 
--> Merging the latest reportlab version should therefore solve this problem. I'll see if I can do this...

One Problem in non-cjk inside cjk remains: the text isnt' justified correctly anymore, but I'd just ignore that...
Comment 35 Ziyuan Yao 2012-01-12 13:14:08 UTC
Great to hear that. Eager to see a sample PDF of your latest finding.
Comment 36 Ziyuan Yao 2012-01-12 16:40:16 UTC
I confirm. I downloaded and installed the latest snapshot reportlab-20120111203740 successfully and ran your test script. It does wrap both CJK and Western text correctly.
Comment 37 Volker Haas 2012-01-13 14:08:02 UTC
I updated to the latest reportlab version. The problem mixing cjk and non-cjk text should be fixed. The render servers will be updated sometime next week.
Comment 38 Ziyuan Yao 2012-01-13 14:25:05 UTC
Volker: Appreciate your hard work!
Comment 39 Ziyuan Yao 2012-01-19 23:50:38 UTC
Are render servers updated yet? As I still see Chinese lines not take up a page's full width (there's much space left on each Chinese line's right side).

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links