Last modified: 2009-06-26 14:39:07 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T10445, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 8445 - Multiple search terms are not enforced properly for Chinese
Multiple search terms are not enforced properly for Chinese
Status: RESOLVED FIXED
Product: MediaWiki
Classification: Unclassified
Search (Other open bugs)
1.16.x
All All
: Normal normal (vote)
: ---
Assigned To: Brion Vibber
: patch
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2006-12-31 14:41 UTC by Dan Jacobson
Modified: 2009-06-26 14:39 UTC (History)
2 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments
CJK quoter (824 bytes, patch)
2009-06-09 06:02 UTC, Dan Jacobson
Details

Description Dan Jacobson 2006-12-31 14:41:11 UTC
Here the search string I give is "逢甲", so why is it like I merely
typed "甲'?

$ w3m -dump "http://taizhongbus.jidanni.org/index.php?search=逢甲&fulltext=搜索"
Problem 1: raw $1:
有關搜索中公的更多詳情,參見$1。
 1. 大甲-龜殼村-海墘 (344字節)
Problem: 2 it also matches on only one character of my two character query:

 4. 大甲-海尾子 (685字節)
 5. 大甲-外埔-土城 (421字節)

12. 大甲-龜殼村-海墘 (344字節)
13. 大甲-豐原 (884字節)

The website is online, for you to test.
Comment 1 Brion Vibber 2008-12-23 02:12:30 UTC
Ok, it looks like the splitting of characters (done to compensate for the lack of word spacing in Chinese text) is happening after the boolean search query is constructed, leading to failure:

The input:
'逢甲'

is translated to a boolean query for a single required word:
'+逢甲"

which then gets split up by character, then encoded to compensate for encoding bugs:
'+  U8e980a2  U8e794b2'

The '+' gets detached from the characters, so has no affect, and the search backend will returns results that contain either character instead of requiring both.

As a workaround, you can quote the multi-character string, which ends up encoding correctly for a phrase search:
'+"  U8e980a2  U8e794b2"'
Comment 2 Dan Jacobson 2009-05-20 13:36:02 UTC
OK, comparing
> http://radioscanningtw.jidanni.org/index.php?search=學甲&ns0=1&title=特殊:搜尋&fulltext=Search
> http://radioscanningtw.jidanni.org/index.php?search='學甲'&ns0=1&title=特殊:搜尋&fulltext=Search
> http://radioscanningtw.jidanni.org/index.php?search="學甲"&ns0=1&title=特殊:搜尋&fulltext=Search
it is clear only the final form gives correct results.

Could you fellows glue the + that has fallen off, back on, there
behind the scenes?

Wouldn't that be better than Asian sites' users thinking Search is broken, or MediaWiki
needing to add instructions telling Asian users to double "quote" "every" Asian "string"
they want to search.
Comment 4 Dan Jacobson 2009-05-27 20:50:09 UTC
Alas, I see WMF doesn't use SpecialSearch.php anymore, but
these extensions instead,

$ w3m -dump http://zh.wikipedia.org/wiki/Special:Version | grep Search
MWSearch      MWSearch plugin          Brion Vibber and
OpenSearchXml OpenSearch XML interface Brion Vibber

So the best I can do for now is put a message in
MediaWiki:Searchresulttext: "If searching Chinese, try your search
again with quote marks, 逢甲 -> "逢甲" . Sorry".
Comment 5 Brion Vibber 2009-05-27 21:12:12 UTC
SpecialSearch.php provides the front-end UI, and is indeed used on Wikimedia sites.

MWSearch provides an alternate back-end. PostgreSQL users also have a different search back-end. Unsurprisingly, different back-ends have different properties and do not all share the same bugs.
Comment 6 Dan Jacobson 2009-06-09 06:02:03 UTC
Created attachment 6211 [details]
CJK quoter

How about this patch. Seems to work and maybe not break anything else.
All I'm trying to do is type those quote marks that Brion mentioned
for the user behind the scenes, instead of asking them up front to
type them in, via some embarrassing message. Otherwise what is the
logic of distributing a broken search without the least warning to the
user?

But as Wikipedia uses a better search, repairing this worse search
will be a uphill battle, as without being forced to eat your own
medicine, you won't have any impetus to improve it.

So mediawiki should distribute the good stuff it uses itself instead.

Anyway, note that I only patched zh-hans. This will not help the other
CJK languages that already have their own
languages/classes/Language*.php. Fortunately zh-tw doesn't, so it will
get this fix.

As far as patch quality, well, as it seems nobody cares much about
this old search function, just chuck it in, better than nothing.

All I know is it works for me here on MySQL Linux etc.
Comment 7 Brion Vibber 2009-06-23 23:24:44 UTC
Patch as written can result in double-quoting, causing searches to fail if quotes were used in the original search term. With no quotes in input it seems ok... should be possible to tweak it to not add double-quotes.

Comment 8 Dan Jacobson 2009-06-23 23:45:56 UTC
OK, tomorrow I will make the patch first scan to see if the user has put
any double quote marks in their input, and not tamper with their input if so.

Glad to know this is the right place to fix this bug, so I needn't look deeper
under the hood.

Other CJK languages are welcome to make similar fixes, I'll just
concentrate on Zh here.
Comment 9 Brion Vibber 2009-06-24 02:28:23 UTC
Implementation committed in r52338:

Big fixup for Chinese word breaks and variant conversions in the MySQL search backend...
- removed redunant variant terms for Chinese, which forces all search indexing to canonical zh-hans
- added parens to properly group variants for languages such as Serbian which do need them at search time
- added quotes to properly group multi-word terms coming out of stripForSearch, as for Chinese where we segment up the characters. This is based on Language::hasWordBreaks() check.
- also cleaned up LanguageZh_hans::stripForSearch() to just do segmentation and pass on the Unicode stripping to the base Language implementation, avoiding scary code duplication. Segmentation was already pulled up to LanguageZh, but was being run again at the second level. :P
- made a fix to Chinese word segmentation to handle the case where a Han character is followed by a Latin char or numeral; a space is now added after as well. Spaces are then normalized for prettiness.
Comment 10 Andrew Dunbar 2009-06-24 05:29:36 UTC
"Other CJK languages are welcome to make similar fixes, I'll just
concentrate on Zh here."

Not all CJK languages omit interword spaces and not all languages which omit interword spaces are CJK:

* Korean does use spaces between words. Quite possibly a full-width space character rather than ASCII 0x20.
* Thai and Khmer (Cambodian) do not use spaces between words.
* Note that both Unicode and HTML include means of indicating invisible word breaks for such languages. Then again a quick Google seems to indicate that the HTML "WBR" tag is neither official nor interpreted to have the same semantics by everybody.

Another approach would be to harvest Han compounds from souces such as EDICT, CEDICT, and the various Wiktionaries. Google does morphological analysis to determine which strings of Han characters are compounds that should be treated as words.

Andrew Dunbar (hippietrail)
Comment 11 Dan Jacobson 2009-06-26 14:39:07 UTC
Glad Chinese is finally fixed. No need for anymore "try Google instead"
in MediaWiki:Searchresulttext!
> Another approach would be to harvest Han compounds from souces such as EDICT,
Well my wikis' compounds are all police department and bus stop names:
http://jidanni.org/comp/wiki/article-category.html .

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links