Last modified: 2010-05-15 15:29:33 UTC
I'm having "issues" with searching that I'm not exactly sure how to solve, and all of these are evident at the specified URL. In essence, if I search for the word "four", I get absolutely no results. The SQL in question is roughly: SELECT * from searchindex where MATCH (si_text) AGAINST ('+four' IN BOOLEAN MODE); (this is for MySQL 4, naturally). But, if I turn around and do a decidedly MySQL 3.x query: SELECT * from searchindex where si_text LIKE '%four%'; I get back the two entries I expect. This seems to tell me that the searchindex table is "Ok". To doublecheck, I dumped the table, deleted it, recreated it, and reimported the data (thus recreated the indexes). Same result. The real goal here is to show all matches for the word "EC" - I don't want "suspect" to be matched, but I want "-20 EC." and similar entries (EC is a date measurement). To let MySQL search for these smaller words, I've already modified the my.cnf and set it at 2 characters, and then rebuilt my index (REPAIR searchindex QUICK). But, somewhere in the wiki code (at the very least in the display settings), searches are being done as strings, and not word boundaries. Is there anyway to force a word boundary? To make matters worse, searching for "ec" at http://gamegrene.com/wiki/ "works" (because of my edit to my.cnf) but matches on "suspect". However, searching for "ur", which should match on "procedure", doesn't return any results (but "procedure" does, as opposed to "four"). ARggGh!
These are limitations of MySQL's full text search engine. You need to adjust MySQL's stopword list (which ignores "four") and minimum word length (which ignores "EC"). Please see: http: //dev.mysql.com/doc/mysql/en/Fulltext_Fine-tuning.html
As mentioned in the initial report, I already have revised MySQL's fulltext index: "To let MySQL search for these smaller words, I've already modified the my.cnf and set it at 2 characters, and then rebuilt my index (REPAIR searchindex QUICK)." - otherwise, I wouldn't get any results at all for EC, which I am (as per the original report). As for "four", that I didn't know, and I'll correct that shortly.
Just to reiterate clearer: * I've increased the full text search to 2 letters. * I've rebuilt the table indexes with no success. * I've deleted, recreated, and reimported the searchindex table. * I want to search on word boundaries such that "EC" does not match "suspect". * When searching for "EC" at Gamegrene, we get five pages that I know match. * However, I don't know what exactly is matched. If MySQL MATCH() does word boundaries, then the MW display does string searching (as it always shows "suspect"). * "ur" as in "procedure" shows no matches; "procedure" does.
http://gamegrene.com/wiki/Special:Search?search=ec&fulltext=Search only returns pages which contain "EC" by itself. Can you clarify what exactly your problem is?
From three of my machines (different IPs, logged in or not), and another person's machine entirely, we're NOT seeing "EC" by itself (word boundary). We're seeing EC as a string. For instance, one of the returned results shows the below, which is matching on "effect", "secret", and "ineffective". # Avazian Box (2331 bytes) 1: ...d quickly. This advancement came with the side effect of immense greed. Many highly advanced magnetic ... 3: ...g new magnetic propulsion technologies, formed a secret team intending to thwart the ongoing conflict. 5: ...which rendered all weapons of Avazian origin ineffective, and the absorption of the magnetic field wou...
"ec" is matched in the middle of a word. Other two character sequences are typically not matched in the middle of a word. The desired behavior is to match ec only when it is a whole word, not in the middle of words.
Can you explain what you mean by "match"? As far as I can tell, the search is *ONLY* returning pages in which "EC" appears as a distinct word when asked to search for "EC". Nothing else. No other pages are returned. So, is this about the *searching*? Or, is it about the *highlighting* of text extracts in the search results display? Can you please clarify?
Brion - exactly, that's what I don't know (from a previous entry): "When searching for "EC" at Gamegrene, we get five pages that I know match. If MySQL MATCH() does word boundaries, then the MW display does string searching (as it always shows "suspect")." If MySQL MATCH() does do word boundaries, then yeah, I guess I'm reporting a bug in the display code (specifically, showHit() in SearchEngine.php). Thanks for the patience.
Morbus, for general information on the fulltext search engine see http://dev.mysql.com/doc/mysql/en/ Fulltext_Boolean.html Matches are on full words unless you use the * operator (eg, search for "apple*" finds "applet" and "applesauce" but search for "apple" does not). Changed summary and sample URL to reflect the problem.
This not been heavily tested yet, but the following revision in SearchEngine.php:showHit() seems to do what I want: $pat1 = "/(.*)(\b" . implode( "|", $this->mSearchterms ) . "\b)(.*)/i"; The generated pattern then becomes /(.*)(\bEC\b)(.*)/i or, in the case of multiple searches /(.*)(\b20|EC\b)(.*)/i. This code is currently live at the provided URL, so you can test as needed.
Sorry - the correct revision is: $pat1 = "/(.*)(\b" . implode( "\b|\b", $this->mSearchterms ) . "\b)(.*)/i"; which creates a pattern like /(.*)(\b20\b|\bEC\b)(.*)/i.
Fixed in r26269 for mainline, r26271 for lucenesearch extension.