Last modified: 2009-04-08 01:46:30 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T19794, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 17794 - Keep simplifed Chinese characters out of zh-tw please
Keep simplifed Chinese characters out of zh-tw please
Status: RESOLVED FIXED
Product: MediaWiki
Classification: Unclassified
Internationalization (Other open bugs)
1.15.x
All All
: Normal normal (vote)
: ---
Assigned To: Nobody - You can work on this!
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2009-03-05 10:01 UTC by Dan Jacobson
Modified: 2009-04-08 01:46 UTC (History)
2 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Dan Jacobson 2009-03-05 10:01:49 UTC
Hello. There are a few non-Taiwan variant characters that have crept
into the Taiwan message files.

This is a different bug that just plain translating.

It involves adding another layer of caution to catch things even a good translations would miss.

The translations are fine, it is just that a final test must be added
to catch very similar looking wrong characters.

They are not Unicode variants, no, but instead characters that have
never been seen before in Taiwan, as shown by the fact they don't make
the round trip to big5 and then back to Unicode. They are typos for
the common Taiwan version. For example, they are not simplified
Chinese 钩 present in GB2312, nor the common Taiwan version present in
big5, 鉤, but instead a third variant: 鈎.

What I am hoping you will do is add a test to make sure such
characters don't again creep in again.

The test should say "**Non Taiwan characters found in file ...;
Please pick the Taiwan versions (e.g., replace 鈎 with 鉤) before
this version of MediaWiki can be released**" die(1);

Here is the makefile I used:
d=/var/lib/mediawiki/languages/messages
v:$d/MessagesZh_hant.twdiff $d/MessagesZh_tw.twdiff
%.twdiff:%.php
	iconv -ct big5 $?|iconv -f big5|diff -U0 $? -|sed /^@@/d

Note that it is crude, in that it also catches superscript numbers
etc., though all we want to be on the lookout for is the Chinese
characters.

And here is the results. You will notice the missing characters
are the ones that didn't make the round trip to big5 and back.

No I'm not just asking you to correct those characters and forget this
bug.

I'm saying that a test needs to be added to always catch such things
before each MediaWiki release can proceed.

Also consider extending the test to MessagesZh_classical.php etc.

(Lastly, this is not a diff to be applied to anything!)

make v
iconv -ct big5 /var/lib/mediawiki/languages/messages/MessagesZh_hant.php|iconv -f big5|diff -U0 /var/lib/mediawiki/languages/messages/MessagesZh_hant.php -|sed /^@@/d
--- /var/lib/mediawiki/languages/messages/MessagesZh_hant.php	2009-03-01 23:31:04.000000000 +0800
+++ -	2009-03-05 08:07:19.263443332 +0800
-/** Traditional Chinese (中文(繁體))
+/** Traditional Chinese (中文(繁體))
-'usercssjsyoucanpreview'           => "'''提示:''' 在保存前請用'顯示預覧'按鈕來測試您新的 CSS/JS 。",
+'usercssjsyoucanpreview'           => "'''提示:''' 在保存前請用'顯示預'按鈕來測試您新的 CSS/JS 。",
-'edit-hook-aborted'                => '編輯被鈎取消。
+'edit-hook-aborted'                => '編輯被取消。
-'post-expand-template-argument-category'  => '包含着略過模板參數的頁面',
+'post-expand-template-argument-category'  => '包含略過模板參數的頁面',
-'timezonetext'              => '¹輸入當地時間與伺服器時間(UTC)的時差。',
+'timezonetext'              => '輸入當地時間與伺服器時間(UTC)的時差。',
-'timezoneoffset'            => '時差¹:',
+'timezoneoffset'            => '時差:',
-Template:消歧义
-Template:消除歧义
+Template:消歧
+Template:消除歧
-'protect-cascadeon'           => '以下的{{PLURAL:$1|一個|多個}}頁面包含着本頁面的同時,啟動了連鎖保護,因此本頁面目前也被保護,未能編輯。您可以設定本頁面的保護級別,但這並不會對連鎖保護有所影響。',
+'protect-cascadeon'           => '以下的{{PLURAL:$1|一個|多個}}頁面包含本頁面的同時,啟動了連鎖保護,因此本頁面目前也被保護,未能編輯。您可以設定本頁面的保護級別,但這並不會對連鎖保護有所影響。',
-'trackbackremove'   => '([$1删除])',
+'trackbackremove'   => '([$1除])',
-'version-parserhooks'              => '語法鈎',
+'version-parserhooks'              => '語法',
-'version-hooks'                    => '鈎',
+'version-hooks'                    => '',
-'version-parser-function-hooks'    => '語法函數鈎',
+'version-parser-function-hooks'    => 語法函數',
-'version-hook-name'                => '鈎名',
+'version-hook-name'                => '名',
iconv -ct big5 /var/lib/mediawiki/languages/messages/MessagesZh_tw.php|iconv -f big5|diff -U0 /var/lib/mediawiki/languages/messages/MessagesZh_tw.php -|sed /^@@/d
--- /var/lib/mediawiki/languages/messages/MessagesZh_tw.php	2009-03-01 06:04:42.000000000 +0800
+++ -	2009-03-05 08:07:19.292322989 +0800
-/** Chinese (Taiwan) (中文(台灣))
+/** Chinese (Taiwan) (中文(台灣))
- * @author לערי ריינהארט
+ * @author  
-'usercssjsyoucanpreview'    => "'''提示:''' 在保存前請用'顯示預覧'按鈕來測試您新的 CSS/JS 。",
+'usercssjsyoucanpreview'    => "'''提示:''' 在保存前請用'顯示預'按鈕來測試您新的 CSS/JS 。",
-'timezoneoffset'           => '時差¹',
+'timezoneoffset'           => '時差',
-Template:消歧义
-Template:消除歧义
+Template:消歧
+Template:消除歧
-'protect-cascadeon'           => '以下的{{PLURAL:$1|一個|多個}}頁面包含着本頁面的同時,啟動了連鎖保護,因此本頁面目前也被保護,未能編輯。您可以設定本頁面的保護級別,但這並不會對連鎖保護有所影響。',
+'protect-cascadeon'           => '以下的{{PLURAL:$1|一個|多個}}頁面包含本頁面的同時,啟動了連鎖保護,因此本頁面目前也被保護,未能編輯。您可以設定本頁面的保護級別,但這並不會對連鎖保護有所影響。',
-'trackbackremove'   => '([$1删除])',
+'trackbackremove'   => '([$1除])',
Comment 1 Niklas Laxström 2009-03-05 10:16:44 UTC
There needs to be way that does not produce lots of false positives. Otherwise PHP's iconv function could be used.
Comment 2 Dan Jacobson 2009-03-05 10:57:25 UTC
OK, your wish is my command!

$ make v
perl -C -plwe 's/\P{Han}//g;s/./$&\n/g' /var/lib/mediawiki/languages/messages/MessagesZh_hant.php|sort -u > tmpA
iconv -ct big5 tmpA|iconv -f big5|sort -u|comm -31 - tmpA|xargs
义 删 着 覧 鈎
perl -C -plwe 's/\P{Han}//g;s/./$&\n/g' /var/lib/mediawiki/languages/messages/MessagesZh_tw.php|sort -u > tmpA
iconv -ct big5 tmpA|iconv -f big5|sort -u|comm -31 - tmpA|xargs
义 删 着 覧

d=/var/lib/mediawiki/languages/messages
v:$d/MessagesZh_hant.twdiff $d/MessagesZh_tw.twdiff
%.twdiff:%.php
	perl -C -plwe 's/\P{Han}//g;s/./$$&\n/g' $?|sort -u > tmpA
	iconv -ct big5 tmpA|iconv -f big5|sort -u|comm -31 - tmpA|xargs
Comment 3 Dan Jacobson 2009-03-07 22:11:21 UTC
I'm sure all the items in my Makefile could be done with PHP 'preg'
stuff and arrays.

Also you might want to add a normalization check if you don't have one
already.

Here's an example of normalization. Note it wouldn't catch the
characters mentioned earlier in this bug. Also you don't want to
convert blindly as here, but make a diff to catch them...

#!/usr/bin/perl
# use best Unicodes, at least so iconv -f utf8 -t big5
# won't hit any illegal chars.
# Copyright       : http://www.fsf.org/copyleft/gpl.html
# Author          : Dan Jacobson http://jidanni.org/
# Created On      : 2006
# Last Modified On: Wed Nov  5 08:54:53 2008
# Update Count    : 24
use strict;
use warnings FATAL => 'all';
use open qw/:std :encoding(utf8)/;
use Unicode::Normalize q(decompose);
while(<>){
    $_=decompose($_);
    s/没/沒/g;
    s/━/-/g;
    s/«/《/g; #ㄍ
    s/ / /g;
    print;
}
# Local Variables:
# compile-command: "echo 老老參參歷歷|normalize"
# End:
 
Comment 4 Dan Jacobson 2009-03-08 12:10:24 UTC
Bug #17859 asks for removal of the current crop of Simplified etc. Chinese that
accidentally has entered Traditional translations.

That bug is in addition to this bug. This bug instead asks for permanent tests to be put in place to stop such characters creeping in in the future.
Comment 5 Niklas Laxström 2009-03-08 12:25:55 UTC
Also, the test is useless if there is no translator to fix it.
Comment 6 Siebrand Mazeland 2009-03-08 13:12:46 UTC
I suggest you start contributing to the zh-tw localisation on http://translatewiki.net instead of ordering others to fix alleged issues. There is a fall back chain which substitutes non-localised messages. Whenever the message is localised, it will be used instead of the message from the fallback.

There are plenty of possibilities to talk to the currently active zh translators.

Closed as WONTFIX.
Comment 7 Dan Jacobson 2009-03-11 00:17:51 UTC
Never mind. I'll just run my tests at home and submit the diffs.
Comment 8 Dan Jacobson 2009-04-01 00:13:40 UTC
Thank you very much for fixing my bug.
I sent patches to the translatewiki staff and they were applied.

At times all the simplified characters were cleaned up.

However now some are creeping back in:

GET http://radioscanningtw.jidanni.org/index.php?title=Special:Allmessages\&uselang=zh-tw|
	w3m -dump -T text/html>/tmp/allmess
perl -C -plwe 's/\P{Han}//g;s/./$&\n/g' /tmp/allmess|sort -u > /tmp/tmpA
iconv -ct big5 /tmp/tmpA|iconv -f big5|sort -u|comm -31 - /tmp/tmpA|xargs
义 着 鈎

This is caused by "holes in what a zh-tw site sees, where the
underlying simplified characters shine through."

I'm not sure if the above is an exact test. Please tell me a better
test if there is one.

This test finds much more:

GET http://translatewiki.net/wiki/Special:AllMessages?uselang=zh-tw|\
	w3m -dump -T text/html>/tmp/allmess
perl -C -plwe 's/\P{Han}//g;s/./$&\n/g' /tmp/allmess|sort -u > /tmp/tmpA
iconv -ct big5 /tmp/tmpA|iconv -f big5|sort -u|comm -31 - /tmp/tmpA|xargs
个 为 义 删 动 变 号 嘅 对 将 录 户 护 无 时 显 来 标 欢 没 着 码 称 组 讨 记 论 评 译 语 说 跃 输 过 这 选 鈎 页 项

Anyway, keep simplifed Chinese characters out of zh-tw please.
I don't know if Hong Kong likes them, but they are highly inappropriate for Taiwan.

If there is a better way to report this tell me. I have already sent many patches to translatewiki and they were applied... but then fell back off.
Comment 9 Niklas Laxström 2009-04-01 11:59:02 UTC
Special:Allmessages also shows messages from fallback language.
Comment 10 Dan Jacobson 2009-04-02 00:08:12 UTC
OK, then could you please make traditional versions of these.
$ egrep '义|着|鈎' /tmp/allmess
|disambiguationspage                    |Template:消除含糊 Template:消歧义
|對話                                   |Template:消除歧义 Template:消歧義
|edit-hook-aborted                      |編輯被鈎取消。它並無給出解釋。
|post-expand-template-argument-category |包含着略過模板參數的頁面
|version-hook-name                      |鈎名
|version-hooks                          |鈎
|version-parser-function-hooks          |語法函數鈎
|version-parserhooks                    |語法鈎

Use
s/义/義/g;
s/着/著/g;
s/鈎/鉤/g;
Comment 11 Shinjiman 2009-04-07 02:13:54 UTC
The Traditional Chinese Characters (zh-Hant) are not only imply for use in Taiwan (zh-TW), so that the characters like 着 and 鈎 would not be changed to 著 and 鉤, which the last two are done in zh-TW purposes. And both 著 and 鉤 are already done in zh-TW localisation.
Comment 12 Dan Jacobson 2009-04-07 02:31:20 UTC
Thank you, however they still are not getting in:
$ w3m -dump http://test.wikipedia.org/wiki/Special:Version?uselang=zh-tw |egrep '鉤|鈎|alpha'
MediaWiki$ 1.15alpha (r48811)
語法鈎
語法函數鈎
$ w3m -dump http://radioscanningtw.jidanni.org/index.php?title=特殊:版本資訊 |egrep '鉤|鈎|alpha'
MediaWiki$ 1.15alpha (r49146)
鈎
鈎名 利用於
Comment 13 Shinjiman 2009-04-07 03:16:50 UTC
Done in r49257.
Comment 14 Dan Jacobson 2009-04-08 01:46:30 UTC
Please see http://www.mediawiki.org/wiki/Special:Code/MediaWiki/49257#c2160 about the leftover problem.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links