Last modified: 2010-09-13 18:05:01 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T14444, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 12444 - Incorrectly truncated multibyte UTF-8 char
Incorrectly truncated multibyte UTF-8 char
Status: RESOLVED FIXED
Product: MediaWiki
Classification: Unclassified
Internationalization (Other open bugs)
1.12.x
All All
: Normal minor with 1 vote (vote)
: ---
Assigned To: Nobody - You can work on this!
: patch
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2007-12-29 00:47 UTC by Alexander Sigachov
Modified: 2010-09-13 18:05 UTC (History)
1 user (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments
change of preg_match, preg_replace in checkTitleEncoding (817 bytes, patch)
2007-12-29 00:47 UTC, Alexander Sigachov
Details

Description Alexander Sigachov 2007-12-29 00:47:36 UTC
Created attachment 4481 [details]
change of preg_match, preg_replace in checkTitleEncoding

Problem: some links en Russian language interface are very long, example category page link like

http://ru.wikisource.org/w/index.php?title=Category:CatName&from=PageName

looks like

http://ru.wikisource.org/w/index.php?title=%D0%9A%D0%B0%D1%82%D0%B5%D0%B3%D0%BE%D1%80%D0%B8%D1%8F:%D0%9F%D0%BE%D1%8D%D0%B7%D0%B8%D1%8F_%D0%9C%D0%B0%D0%BA%D1%81%D0%B8%D0%BC%D0%B8%D0%BB%D0%B8%D0%B0%D0%BD%D0%B0_%D0%90%D0%BB%D0%B5%D0%BA%D1%81%D0%B0%D0%BD%D0%B4%D1%80%D0%BE%D0%B2%D0%B8%D1%87%D0%B0_%D0%92%D0%BE%D0%BB%D0%BE%D1%88%D0%B8%D0%BD%D0%B0&from=%D0%9F%D1%83%D1%81%D1%82%D1%8B%D0%BD%D1%8F+%28%D0%98+%D1%8F+%D0%B1%D1%8B%D0%BB+%D1%81%D0%BE%D1%81%D0%BB%D0%B0%D0%BD+%D0%B2+%D0%B3%D0%BB%D1%83%D0%B1%D1%8C+%D1%81%D1%82%D0%B5%D0%BF%D0%B5%D0%B9+%E2%80%94+%D0%92%D0%BE%D0%BB%D0%BE%D1%88%D0%B8%D0

"from" parameter is often truncated at the middle of multibyte char

getGPCVal function in WebRequest.php uses checkTitleEncoding

checkTitleEncoding function of Language.php uses

preg_match( '/^([\x00-\x7f]|[\xc0-\xdf][\x80-\xbf]|' .
                '[\xe0-\xef][\x80-\xbf]{2}|[\xf0-\xf7][\x80-\xbf]{3})+$/', $s );

to check is string in UTF8 or not.

But rests of incorrectly truncated multibyte UTF-8 char in the end of the string do not match this regexp.

So checkTitleEncoding wrongly converts truncated UTF-8 line to fallback8bitEncoding.

As a result, link "next 200 pages" on following category page of Russian Wikisource works incorrectly. 
 
http://ru.wikisource.org/wiki/Категория:Поэзия_Максимилиана_Александровича_Волошина

Some articles of the category are not visible neither on the first, nor on the second category page.

I suggest to change regular expression to consider possible scraps of UTF codes of chars in the end of a line
Comment 1 Niklas Laxström 2009-06-19 12:27:55 UTC
Why is the from truncated? Is there some kind of limit? Wouldn't it be broken anyway even if the encoding is correct?
Comment 2 Niklas Laxström 2010-09-13 18:05:01 UTC
Cannot reproduce anymore with the example category.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links