Last modified: 2009-06-26 14:39:07 UTC
Here the search string I give is "逢甲", so why is it like I merely typed "甲'? $ w3m -dump "http://taizhongbus.jidanni.org/index.php?search=逢甲&fulltext=搜索" Problem 1: raw $1: 有關搜索中公的更多詳情,參見$1。 1. 大甲-龜殼村-海墘 (344字節) Problem: 2 it also matches on only one character of my two character query: 4. 大甲-海尾子 (685字節) 5. 大甲-外埔-土城 (421字節) 12. 大甲-龜殼村-海墘 (344字節) 13. 大甲-豐原 (884字節) The website is online, for you to test.
Ok, it looks like the splitting of characters (done to compensate for the lack of word spacing in Chinese text) is happening after the boolean search query is constructed, leading to failure: The input: '逢甲' is translated to a boolean query for a single required word: '+逢甲" which then gets split up by character, then encoded to compensate for encoding bugs: '+ U8e980a2 U8e794b2' The '+' gets detached from the characters, so has no affect, and the search backend will returns results that contain either character instead of requiring both. As a workaround, you can quote the multi-character string, which ends up encoding correctly for a phrase search: '+" U8e980a2 U8e794b2"'
OK, comparing > http://radioscanningtw.jidanni.org/index.php?search=學甲&ns0=1&title=特殊:搜尋&fulltext=Search > http://radioscanningtw.jidanni.org/index.php?search='學甲'&ns0=1&title=特殊:搜尋&fulltext=Search > http://radioscanningtw.jidanni.org/index.php?search="學甲"&ns0=1&title=特殊:搜尋&fulltext=Search it is clear only the final form gives correct results. Could you fellows glue the + that has fallen off, back on, there behind the scenes? Wouldn't that be better than Asian sites' users thinking Search is broken, or MediaWiki needing to add instructions telling Asian users to double "quote" "every" Asian "string" they want to search.
Totally not fair: why is it fixed on Wikipedia, so that > http://zh.wikipedia.org/w/index.php?title=Special:Search&search=%E5%AD%B8%E7%94%B2&fulltext=Search > http://zh.wikipedia.org/w/index.php?title=Special:Search&search=%22%E5%AD%B8%E7%94%B2%22&fulltext=Search return the same (correct) results, but then on vanilla MediaWiki, one needs to (tell users to) quote > http://radioscanningtw.jidanni.org/index.php?title=%E7%89%B9%E6%AE%8A:%E6%90%9C%E5%B0%8B&search=%E5%AD%B8%E7%94%B2&fulltext=%E6%90%9C%E5%B0%8B > http://radioscanningtw.jidanni.org/index.php?title=%E7%89%B9%E6%AE%8A:%E6%90%9C%E5%B0%8B&search=%22%E5%AD%B8%E7%94%B2%22&fulltext=%E6%90%9C%E5%B0%8B if one wants proper results.
Alas, I see WMF doesn't use SpecialSearch.php anymore, but these extensions instead, $ w3m -dump http://zh.wikipedia.org/wiki/Special:Version | grep Search MWSearch MWSearch plugin Brion Vibber and OpenSearchXml OpenSearch XML interface Brion Vibber So the best I can do for now is put a message in MediaWiki:Searchresulttext: "If searching Chinese, try your search again with quote marks, 逢甲 -> "逢甲" . Sorry".
SpecialSearch.php provides the front-end UI, and is indeed used on Wikimedia sites. MWSearch provides an alternate back-end. PostgreSQL users also have a different search back-end. Unsurprisingly, different back-ends have different properties and do not all share the same bugs.
Created attachment 6211 [details] CJK quoter How about this patch. Seems to work and maybe not break anything else. All I'm trying to do is type those quote marks that Brion mentioned for the user behind the scenes, instead of asking them up front to type them in, via some embarrassing message. Otherwise what is the logic of distributing a broken search without the least warning to the user? But as Wikipedia uses a better search, repairing this worse search will be a uphill battle, as without being forced to eat your own medicine, you won't have any impetus to improve it. So mediawiki should distribute the good stuff it uses itself instead. Anyway, note that I only patched zh-hans. This will not help the other CJK languages that already have their own languages/classes/Language*.php. Fortunately zh-tw doesn't, so it will get this fix. As far as patch quality, well, as it seems nobody cares much about this old search function, just chuck it in, better than nothing. All I know is it works for me here on MySQL Linux etc.
Patch as written can result in double-quoting, causing searches to fail if quotes were used in the original search term. With no quotes in input it seems ok... should be possible to tweak it to not add double-quotes.
OK, tomorrow I will make the patch first scan to see if the user has put any double quote marks in their input, and not tamper with their input if so. Glad to know this is the right place to fix this bug, so I needn't look deeper under the hood. Other CJK languages are welcome to make similar fixes, I'll just concentrate on Zh here.
Implementation committed in r52338: Big fixup for Chinese word breaks and variant conversions in the MySQL search backend... - removed redunant variant terms for Chinese, which forces all search indexing to canonical zh-hans - added parens to properly group variants for languages such as Serbian which do need them at search time - added quotes to properly group multi-word terms coming out of stripForSearch, as for Chinese where we segment up the characters. This is based on Language::hasWordBreaks() check. - also cleaned up LanguageZh_hans::stripForSearch() to just do segmentation and pass on the Unicode stripping to the base Language implementation, avoiding scary code duplication. Segmentation was already pulled up to LanguageZh, but was being run again at the second level. :P - made a fix to Chinese word segmentation to handle the case where a Han character is followed by a Latin char or numeral; a space is now added after as well. Spaces are then normalized for prettiness.
"Other CJK languages are welcome to make similar fixes, I'll just concentrate on Zh here." Not all CJK languages omit interword spaces and not all languages which omit interword spaces are CJK: * Korean does use spaces between words. Quite possibly a full-width space character rather than ASCII 0x20. * Thai and Khmer (Cambodian) do not use spaces between words. * Note that both Unicode and HTML include means of indicating invisible word breaks for such languages. Then again a quick Google seems to indicate that the HTML "WBR" tag is neither official nor interpreted to have the same semantics by everybody. Another approach would be to harvest Han compounds from souces such as EDICT, CEDICT, and the various Wiktionaries. Google does morphological analysis to determine which strings of Han characters are compounds that should be treated as words. Andrew Dunbar (hippietrail)
Glad Chinese is finally fixed. No need for anymore "try Google instead" in MediaWiki:Searchresulttext! > Another approach would be to harvest Han compounds from souces such as EDICT, Well my wikis' compounds are all police department and bus stop names: http://jidanni.org/comp/wiki/article-category.html .