Last modified: 2010-09-13 18:05:01 UTC
Created attachment 4481 [details] change of preg_match, preg_replace in checkTitleEncoding Problem: some links en Russian language interface are very long, example category page link like http://ru.wikisource.org/w/index.php?title=Category:CatName&from=PageName looks like http://ru.wikisource.org/w/index.php?title=%D0%9A%D0%B0%D1%82%D0%B5%D0%B3%D0%BE%D1%80%D0%B8%D1%8F:%D0%9F%D0%BE%D1%8D%D0%B7%D0%B8%D1%8F_%D0%9C%D0%B0%D0%BA%D1%81%D0%B8%D0%BC%D0%B8%D0%BB%D0%B8%D0%B0%D0%BD%D0%B0_%D0%90%D0%BB%D0%B5%D0%BA%D1%81%D0%B0%D0%BD%D0%B4%D1%80%D0%BE%D0%B2%D0%B8%D1%87%D0%B0_%D0%92%D0%BE%D0%BB%D0%BE%D1%88%D0%B8%D0%BD%D0%B0&from=%D0%9F%D1%83%D1%81%D1%82%D1%8B%D0%BD%D1%8F+%28%D0%98+%D1%8F+%D0%B1%D1%8B%D0%BB+%D1%81%D0%BE%D1%81%D0%BB%D0%B0%D0%BD+%D0%B2+%D0%B3%D0%BB%D1%83%D0%B1%D1%8C+%D1%81%D1%82%D0%B5%D0%BF%D0%B5%D0%B9+%E2%80%94+%D0%92%D0%BE%D0%BB%D0%BE%D1%88%D0%B8%D0 "from" parameter is often truncated at the middle of multibyte char getGPCVal function in WebRequest.php uses checkTitleEncoding checkTitleEncoding function of Language.php uses preg_match( '/^([\x00-\x7f]|[\xc0-\xdf][\x80-\xbf]|' . '[\xe0-\xef][\x80-\xbf]{2}|[\xf0-\xf7][\x80-\xbf]{3})+$/', $s ); to check is string in UTF8 or not. But rests of incorrectly truncated multibyte UTF-8 char in the end of the string do not match this regexp. So checkTitleEncoding wrongly converts truncated UTF-8 line to fallback8bitEncoding. As a result, link "next 200 pages" on following category page of Russian Wikisource works incorrectly. http://ru.wikisource.org/wiki/Категория:Поэзия_Максимилиана_Александровича_Волошина Some articles of the category are not visible neither on the first, nor on the second category page. I suggest to change regular expression to consider possible scraps of UTF codes of chars in the end of a line
Why is the from truncated? Is there some kind of limit? Wouldn't it be broken anyway even if the encoding is correct?
Cannot reproduce anymore with the example category.