Last modified: 2011-01-25 00:08:24 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T19020, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 17020 - Chinese needs sensible fallback character encoding set
Chinese needs sensible fallback character encoding set
Status: RESOLVED FIXED
Product: MediaWiki
Classification: Unclassified
Internationalization (Other open bugs)
1.13.x
PC Windows Server 2003
: Normal normal (vote)
: ---
Assigned To: Nobody - You can work on this!
/languages/Language.php
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2009-01-14 14:32 UTC by zayoo
Modified: 2011-01-25 00:08 UTC (History)
6 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description zayoo 2009-01-14 14:32:37 UTC
===English===
Why there is a "$this->load();" on line 1495 of "function fallback8bitEncoding()" in languages/Language.php of Mediawiki 1.13.3?
I'm a Simplified Chinese user and our default charset is gb2312. When we use Mozilla Firefox to access somewhere Mediawiki and search something Chinese, as click into the address bar and press enter, it shows a page with a mess title.
That means (not only in FireFox but also in IE) if you click a url to "http://zh.wikipedia.org/w/index.php?title=首页", it works nice; but when you type it into the address bar and press enter, it goes into an empty page with title "Ê×Ò³".
I'm trying to solve the problem and I have found a method, that is, delete "$this->load();" on line 1495 in languages/Language.php. After that, all works well. Is it useful here (for some other language) and can it be deleted?

zayoo

===简体中文===
请问Mediawiki 1.13.3 languages/Language.php的第1495行(function fallback8bitEncoding()中的)语句$this->load();有什么作用?
我是一名简体中文用户,我们的默认字符集是gb2312。当使用Firefox浏览Mediawiki平台的站点并搜索中文时,如果点击浏览器地址栏并按回车键,将会进入一个标题为乱码的页面。
意思是(不论Firefox还是IE)如果点击“http://zh.wikipedia.org/w/index.php?title=首页”,它能够进入正常的页面;而如果将其输入浏览器地址栏并按回车键,将会进入一个标题为“Ê×Ò³”的错误页面。
我在尝试解决此问题时,发现可通过删除languages/Language.php第1495行的“$this->load();”能够完美地解决此问题。那么这行语句(在其它语言中)是否确有必要存在并能否去掉?

zayoo
Comment 1 Niklas Laxström 2009-01-14 15:12:31 UTC
It is enough for you to describe problem, it is not necessary to throw guesses why it may be so, which may mislead developers.

Now, this seems to be problem with fallback encodings. Further, it looks like that Chinese does not have any fallback encoding specified. And I guess it uses the default one which is windows-1252. This happens when non-utf8 text is inputted, which is the case when not following links.

Can you try to set the $fallback8bitEncoding to gb2312 in appropriate Messages file helps? Alternative you can configure your browser to use utf-8 by default.
Comment 2 zayoo 2009-01-16 05:48:47 UTC
There is no use changing default charset, while setting $fallback8bitEncoding failed(maybe I don't know how to set it).

I've found this line means to change charset while it finds the title entered non-utf8.

return $this->iconv( $this->fallback8bitEncoding(), "utf-8", $s );
Language.php 1491 function checkTitleEncoding( $s )

Here I don't know what $this->fallback8bitEncoding() is, and I can't have it printed. But when leave the first parameter empty, that is iconv("", "utf-8", $s );, it also works nice, maybe it uses the default charset of server for input.

Either delete $this->load(); in function fallback8bitEncoding() or have return $this->iconv("", "utf-8", $s ); works nice.

I have no ability testing function load() because it's widely used and is about cache. I just found when $this->load(); exists, there may be a redirect(that causes the problem) while non-utf8 title entered, and while deleted, the redirect disappears.

What is important, a programmer who uses single-byte charactors only cannot think the way of those who uses multi-byte charactors.(haha~)

At last I want to say, I'm not good at English, this is my first time reporting bug(is it?) to corporated programmers, and, I'm one of such GNU/Windows, I can't master Linux and even php. So my thought may be strange and have you troubled.

Now my wiki have $this->load(); deleted and works nice, while other Chinese wikis(including zh.wikipedia.org) not:
http://www.ipal.org.cn/i/
(Warning: Chinese) but only can be tested if you use Chinese.

And I'll have more discusions with other Chinese programmers, and test new wikis in virture machine, while have modification before installation or have language changed.

Yours faithfully,
zayoo
Comment 3 Brion Vibber 2009-01-31 01:38:41 UTC
The Chinese case is probably also complicated by the existence of separate simplified and traditional Chinese locales.

zayoo's testing with the unset stuff may indicate that in some cases iconv() does autodetection (and is actually working in this case!), which is spiffy if so... don't know how reliable that will be, though. Additionally, this may or may not work depending on what actual iconv or mb_string configs are set up. This'll need some research and testing...
Comment 4 Siebrand Mazeland 2009-02-08 09:35:58 UTC
Any takers? CC-ing Shinjiman and philip, as I assume they have more knowledge on the matter than us Westerners do :)
Comment 5 Shinjiman 2009-04-24 18:44:21 UTC
Just added the fallback encoding for the folowing message files (as r49829):

Traditional Chinese (zh-Hant): Windows-950 -> CP950 -> Big5
Simplified Chinese  (zh-Hans): Windows-936 -> CP936 -> GB2312
Hong Kong Chinese   (zh-HK)  : Big5-HKSCS

P.S. For the generic Chinese language (zh), there's more than one codepage are used so an idea to using more than one encoding as fallback is considerable. For example, try the first encoding. If found, go to the page; otherwise try the second encoding and so on.
Comment 6 Brion Vibber 2009-04-27 22:46:42 UTC
Is there a reliable way to identify whether the encoding conversion appears to be successful that can distinguish these?

Alternatively, can we make use of things like Accept-Language headers to aid in our guess?
Comment 7 Shinjiman 2009-04-28 01:17:05 UTC
The Accept-Language headers can be changed per user's preferences on their browser.
But briefly can be guessed what sort of the non-Unicode encoding that they wanted.
Comment 8 zayoo 2009-04-28 04:45:24 UTC
This problem remains on 1.13.5, and may remain on 1.14.0.

It occured an error only once when $this->load(); in function fallback8bitEncoding() be deleted. I'm now having return $this->iconv( $this->fallback8bitEncoding(), "utf-8", $s ); changed into return $this->iconv( "", "utf-8", $s ); on two servers and they have worked for a long time.

More research shows when the server is Windows(English) and client is Windows(Chinese), it also works nice as if the server is also Chinese.
Comment 9 Niklas Laxström 2010-09-13 17:58:15 UTC
What's the status of this bug in MediaWiki 1.16?
Comment 10 zayoo 2010-09-14 08:03:41 UTC
It works very nice in Mediawiki 1.16, while both client and server are Chinese, and non-unicode is set to Chinese(PRC), using both IE and Firefox. I will test it in English system soon.

PHP warnings occur several times about something undefined, so I made this in LocalSettings.php:

if (!isset($_SERVER['REQUEST_URI']))
{
    if(!isset($_SERVER['SCRIPT_NAME'])) $_SERVER['SCRIPT_NAME']='';
    $_SERVER['REQUEST_URI'] = $_SERVER['SCRIPT_NAME'];
    if(isset($_SERVER['QUERY_STRING'])) {
        $_SERVER['REQUEST_URI'] .= "?" . $_SERVER['QUERY_STRING'];
    }
}
if (!isset($_SERVER['REQUEST_METHOD'])) {$_SERVER['REQUEST_METHOD']='GET';}

By the way, when I use IIRF for rewrite, Chinese comes into massy code sometimes (repeatable for certain titles), especially the number of Chinese is ODD or there're ASCII letters inside, on both clicking a link and typing into the bar. It occurs only when the server uses Chinese for non-unicode, but becomes normal when set English for non-unicode. It is a bug of PHP (not Mediawiki), but never occur on ASP pages (Why?). Fortunately, I did these and it works for most of the titles. In iirf.ini:

RewriteRule ^/$ /i/index.php?title=%E9%A6%96%E9%A1%B5 [L,QSA]
RewriteRule ^/zh[/]*$ /i/index.php?title=%E9%A6%96%E9%A1%B5 [L,QSA]
RewriteRule ^/zh/(.*)[_\x20]\((.*)\)$ /i/index.php?title=$1|||$2|| [L,QSA]
RewriteRule ^/zh/(.*)$ /i/index.php?title=$1| [L,QSA]

and in LocalSettings.ini:

if (isset($_GET['title']))
{
	$_GET['title']=str_replace("|||","_(",$_GET['title']);
	$_GET['title']=str_replace("||",")",$_GET['title']);
	$_GET['title']=str_replace("|","",$_GET['title']);
}

I know it's not a good method. I need some help.

And I also want Extension:SpecialUploadLocal for 1.16.0. I tried but failed - there are too many classes involved in uploading. I think it's better to take this function as an official (embedded) one.

Another question: can it be ignored for upper-lower case and/or traditional-simplified Chinese for titles in Mediawiki?
Comment 11 zayoo 2010-10-05 09:10:45 UTC
Everything is OK now, and the bug of PHP is even fixed by a new PHP version. This topic can be closed.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links