Last modified: 2010-09-13 18:05:01 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T14444, the corresponding Phabricator task for complete and up-to-date bug report information.

Bug 12444 - Incorrectly truncated multibyte UTF-8 char


Summary:	Incorrectly truncated multibyte UTF-8 char

Status:	RESOLVED FIXED

Product:	MediaWiki
Classification:	Unclassified
Component:	Internationalization (Other open bugs)
Version:	1.12.x
Hardware:	All All

Importance:	Normal minor with 1 vote (vote)
Target Milestone:	---
Assigned To:	Nobody - You can work on this!

URL:
Whiteboard:
Keywords:	patch

Depends on:
Blocks:
	Show dependency tree / graph

Reported:	2007-12-29 00:47 UTC by Alexander Sigachov
Modified:	2010-09-13 18:05 UTC (History)
CC List:	1 user (show)

See Also:
Web browser:	---
Mobile Platform:	---
Assignee Huggle Beta Tester:	---

Attachments
change of preg_match, preg_replace in checkTitleEncoding (817 bytes, patch) 2007-12-29 00:47 UTC, Alexander Sigachov	Details
Add an attachment (proposed patch, testcase, etc.)

Description Alexander Sigachov 2007-12-29 00:47:36 UTC

Created attachment 4481 [details]
change of preg_match, preg_replace in checkTitleEncoding

Problem: some links en Russian language interface are very long, example category page link like

http://ru.wikisource.org/w/index.php?title=Category:CatName&from=PageName

looks like

http://ru.wikisource.org/w/index.php?title=%D0%9A%D0%B0%D1%82%D0%B5%D0%B3%D0%BE%D1%80%D0%B8%D1%8F:%D0%9F%D0%BE%D1%8D%D0%B7%D0%B8%D1%8F_%D0%9C%D0%B0%D0%BA%D1%81%D0%B8%D0%BC%D0%B8%D0%BB%D0%B8%D0%B0%D0%BD%D0%B0_%D0%90%D0%BB%D0%B5%D0%BA%D1%81%D0%B0%D0%BD%D0%B4%D1%80%D0%BE%D0%B2%D0%B8%D1%87%D0%B0_%D0%92%D0%BE%D0%BB%D0%BE%D1%88%D0%B8%D0%BD%D0%B0&from=%D0%9F%D1%83%D1%81%D1%82%D1%8B%D0%BD%D1%8F+%28%D0%98+%D1%8F+%D0%B1%D1%8B%D0%BB+%D1%81%D0%BE%D1%81%D0%BB%D0%B0%D0%BD+%D0%B2+%D0%B3%D0%BB%D1%83%D0%B1%D1%8C+%D1%81%D1%82%D0%B5%D0%BF%D0%B5%D0%B9+%E2%80%94+%D0%92%D0%BE%D0%BB%D0%BE%D1%88%D0%B8%D0

"from" parameter is often truncated at the middle of multibyte char

getGPCVal function in WebRequest.php uses checkTitleEncoding

checkTitleEncoding function of Language.php uses

preg_match( '/^([\x00-\x7f]|[\xc0-\xdf][\x80-\xbf]|' .
                '[\xe0-\xef][\x80-\xbf]{2}|[\xf0-\xf7][\x80-\xbf]{3})+$/', $s );

to check is string in UTF8 or not.

But rests of incorrectly truncated multibyte UTF-8 char in the end of the string do not match this regexp.

So checkTitleEncoding wrongly converts truncated UTF-8 line to fallback8bitEncoding.

As a result, link "next 200 pages" on following category page of Russian Wikisource works incorrectly. 
 
http://ru.wikisource.org/wiki/Категория:Поэзия_Максимилиана_Александровича_Волошина

Some articles of the category are not visible neither on the first, nor on the second category page.

I suggest to change regular expression to consider possible scraps of UTF codes of chars in the end of a line

Comment 1 Niklas Laxström 2009-06-19 12:27:55 UTC

Why is the from truncated? Is there some kind of limit? Wouldn't it be broken anyway even if the encoding is correct?

Comment 2 Niklas Laxström 2010-09-13 18:05:01 UTC

Cannot reproduce anymore with the example category.

Wikimedia Bugzilla is closed!

Search

Personal tools

Navigation

Links