Last modified: 2010-09-13 18:05:01 UTC

Wikimedia Bugzilla is closed!

Wikimedia has migrated from Bugzilla to Phabricator. Bug reports should be created and updated in Wikimedia Phabricator instead. Please create an account in Phabricator and add your Bugzilla email address to it.
Wikimedia Bugzilla is read-only. If you try to edit or create any bug report in Bugzilla you will be shown an intentional error message.
In order to access the Phabricator task corresponding to a Bugzilla report, just remove "static-" from its URL.
You could still run searches in Bugzilla or access your list of votes but bug reports will obviously not be up-to-date in Bugzilla.
Bug 12444 - Incorrectly truncated multibyte UTF-8 char
Incorrectly truncated multibyte UTF-8 char
Product: MediaWiki
Classification: Unclassified
Internationalization (Other open bugs)
All All
: Normal minor with 1 vote (vote)
: ---
Assigned To: Nobody - You can work on this!
: patch
Depends on:
  Show dependency treegraph
Reported: 2007-12-29 00:47 UTC by Alexander Sigachov
Modified: 2010-09-13 18:05 UTC (History)
1 user (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---

change of preg_match, preg_replace in checkTitleEncoding (817 bytes, patch)
2007-12-29 00:47 UTC, Alexander Sigachov

Description Alexander Sigachov 2007-12-29 00:47:36 UTC
Created attachment 4481 [details]
change of preg_match, preg_replace in checkTitleEncoding

Problem: some links en Russian language interface are very long, example category page link like

looks like

"from" parameter is often truncated at the middle of multibyte char

getGPCVal function in WebRequest.php uses checkTitleEncoding

checkTitleEncoding function of Language.php uses

preg_match( '/^([\x00-\x7f]|[\xc0-\xdf][\x80-\xbf]|' .
                '[\xe0-\xef][\x80-\xbf]{2}|[\xf0-\xf7][\x80-\xbf]{3})+$/', $s );

to check is string in UTF8 or not.

But rests of incorrectly truncated multibyte UTF-8 char in the end of the string do not match this regexp.

So checkTitleEncoding wrongly converts truncated UTF-8 line to fallback8bitEncoding.

As a result, link "next 200 pages" on following category page of Russian Wikisource works incorrectly.Категория:Поэзия_Максимилиана_Александровича_Волошина

Some articles of the category are not visible neither on the first, nor on the second category page.

I suggest to change regular expression to consider possible scraps of UTF codes of chars in the end of a line
Comment 1 Niklas Laxström 2009-06-19 12:27:55 UTC
Why is the from truncated? Is there some kind of limit? Wouldn't it be broken anyway even if the encoding is correct?
Comment 2 Niklas Laxström 2010-09-13 18:05:01 UTC
Cannot reproduce anymore with the example category.

Note You need to log in before you can comment on or make changes to this bug.