Last modified: 2014-02-13 04:41:48 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T25629, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 23629 - incorrect UTF-8 processing on output of page and section titles
incorrect UTF-8 processing on output of page and section titles
Status: RESOLVED WONTFIX
Product: Wikimedia
Classification: Unclassified
lucene-search-2 (Other open bugs)
unspecified
All All
: Normal normal (vote)
: ---
Assigned To: Nobody - You can work on this!
http://ru.wikipedia.org/w/index.php?t...
: utf8
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2010-05-22 22:24 UTC by Innocenti Maresin
Modified: 2014-02-13 04:41 UTC (History)
4 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Innocenti Maresin 2010-05-22 22:24:43 UTC
The search system used in most WikiMedia projects makes errors in search result page. There is no apparent flaw in matching algorithm, but <span class="searchmatch"> tags are placed incorrectly when the search term contain multibyte characters and appears in the title of a wikipage or its section. Probably, matching algorithm provides substring lengths and offsets in characters (code points), which are incorrectly interpreted as byte offsets by HTML generating engine.
Comment 1 Bugmeister Bot 2011-08-19 19:12:42 UTC
Unassigning default assignments. http://article.gmane.org/gmane.science.linguistics.wikipedia.technical/54734
Comment 2 orenbochman 2011-12-24 15:18:57 UTC
Please attach an example query that causes this error.
Comment 3 Innocenti Maresin 2011-12-24 16:30:19 UTC
Let us browse exactly to the query mentioned by me in the bugzilla's "URL" field and examine the resulting document.

% wget 'http://ru.wikipedia.org/w/index.php?title=Special:Search&fulltext=1&search=%D0%B0&ns4=1&uselang=en'
--20:09:28--  http://ru.wikipedia.org/w/index.php?title=Special:Search&fulltext=1&search=%D0%B0&ns4=1&uselang=en
           => `index.php?title=Special:Search&fulltext=1&search=а&ns4=1&uselang=en'
…
20:09:30 (124.34 KB/s) - `index.php?title=Special:Search&fulltext=1&search=а&ns4=1&uselang=en' stored [41804/41804]

% hexdump -C -s 0x5d90 -n 128 index.php\?title=Special:Search\&fulltext=1\&search=а\&ns4=1\&uselang=en
00005d90  d0 be d0 b2 20 7c 20 3c  73 70 61 6e 20 63 6c 61  |.... | <span cla|
00005da0  73 73 3d 27 73 65 61 72  63 68 6d 61 74 63 68 27  |ss='searchmatch'|
00005db0  3e d0 3c 2f 73 70 61 6e  3e 90 2e d0 9a d1 80 d1  |>.</span>.......|
00005dc0  8b d0 bc d0 be d0 b2 20  7c 20 32 30 30 38 2d 31  |....... | 2008-1|
00005dd0  31 2d 30 39 20 7c 20 39  37 34 35 20 7c 20 d0 9f  |1-09 | 9745 | ..|
00005de0  d0 b0 d1 82 d1 80 d1 83  d0 bb d0 b8 d1 80 d1 83  |................|
00005df0  d1 8e d1 89 d0 b8 d0 b9  2c 20 d0 be d1 82 d0 ba  |........, ......|
00005e00  d0 b0 d1 82 d1 8b d0 b2  d0 b0 d1 8e d1 89 d0 b8  |................|

Here you can see invalid byte string 0xd0 (without continuation bytes) at offset 0x00005db1 and misplaced continuation byte 0x90 at 0x00005db9.
This is U+0410 — Cyrillic letter "А" — split to 2 portions. This is clearly visible in a browser too, as replacement characters. Is this exercise really so complicated or boring for MediaWiki programmers?
Comment 4 orenbochman 2011-12-25 03:23:06 UTC
Thanks for the prompt response. 

I'm fairly new to Bugzilla and missed the URL you gave. Also your second response is very helpful since I have not had to fix problems involving multibyte Unicode characters.

Your original bug report points to the Result Rendering Stage of search. 

I'm now trying to narrow down the source of the bug.

I have found that there are bugs in Java's (Multibyte) Unicode implementation which carried over to the version of Wikipedia's search library, Lucene. While Lucene has fixed these we are still working with the old version. 
Another second option could be the highlighter code which is being upgraded.

Anyhow I'll also be adding some unit test to make sure this issue does not reccur once it is fixed.

I'll update here as soon once I find out more.
Comment 5 Andre Klapper 2013-03-26 11:19:19 UTC
[Merging "MediaWiki extensions/Lucene Search" into "Wikimedia/lucene-search2", see bug 46542. You can filter bugmail for: search-component-merge-20130326 ]
Comment 6 Chad H. 2014-02-13 04:40:54 UTC
Don't have this problem with the new search engine.

Example query: https://ru.wikipedia.org/w/index.php?title=Special:Search&fulltext=1&search=%D0%B0&ns4=1&uselang=en&srbackend=CirrusSearch

Closing WONTFIX as lsearchd has been end of life'd.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links