Last modified: 2010-05-15 15:29:33 UTC

Wikimedia Bugzilla is closed!

Wikimedia has migrated from Bugzilla to Phabricator. Bug reports should be created and updated in Wikimedia Phabricator instead. Please create an account in Phabricator and add your Bugzilla email address to it.
Wikimedia Bugzilla is read-only. If you try to edit or create any bug report in Bugzilla you will be shown an intentional error message.
In order to access the Phabricator task corresponding to a Bugzilla report, just remove "static-" from its URL.
You could still run searches in Bugzilla or access your list of votes but bug reports will obviously not be up-to-date in Bugzilla.
Bug 278 - Search results highlight partial word matches
Search results highlight partial word matches
Product: MediaWiki
Classification: Unclassified
Search (Other open bugs)
All All
: Low minor with 1 vote (vote)
: ---
Assigned To: Nobody - You can work on this!
Depends on:
  Show dependency treegraph
Reported: 2004-09-03 01:18 UTC by Morbus Iff
Modified: 2010-05-15 15:29 UTC (History)
1 user (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Description Morbus Iff 2004-09-03 01:18:38 UTC
I'm having "issues" with searching that I'm not exactly sure how to solve, and all of these are evident at the specified URL. In 
essence, if I search for the word "four", I get absolutely no results. The SQL in question is roughly: SELECT * from searchindex 
where MATCH (si_text) AGAINST ('+four' IN BOOLEAN MODE); (this is for MySQL 4, naturally). But, if I turn around and do a 
decidedly MySQL 3.x query: SELECT * from searchindex where si_text LIKE '%four%'; I get back the two entries I expect. This 
seems to tell me that the searchindex table is "Ok". To doublecheck, I dumped the table, deleted it, recreated it, and reimported 
the data (thus recreated the indexes). Same result.

The real goal here is to show all matches for the word "EC" - I don't want "suspect" to be matched, but I want "-20 EC." and 
similar entries (EC is a date measurement). To let MySQL search for these smaller words, I've already modified the my.cnf and set 
it at 2 characters, and then rebuilt my index (REPAIR searchindex QUICK). But, somewhere in the wiki code (at the very least in 
the display settings), searches are being done as strings, and not word boundaries. Is there anyway to force a word boundary? To 
make matters worse, searching for "ec" at "works" (because of my edit to my.cnf) but matches on 
"suspect". However, searching for "ur", which should match on "procedure", doesn't return any results (but "procedure" does, as 
opposed to "four").

Comment 1 Brion Vibber 2004-09-03 02:43:34 UTC
These are limitations of MySQL's full text search engine. You need to adjust MySQL's stopword list (which ignores "four") and 
minimum word length (which ignores "EC"). Please see: http:
Comment 2 Morbus Iff 2004-09-03 11:55:24 UTC
As mentioned in the initial report, I already have revised MySQL's fulltext index: "To let MySQL search for these smaller words, I've already modified 
the my.cnf and set it at 2 characters, and then rebuilt my index (REPAIR searchindex QUICK)." - otherwise, I wouldn't get any results at all for EC, 
which I am (as per the original report). As for "four", that I didn't know, and I'll correct that shortly.
Comment 3 Morbus Iff 2004-09-03 13:30:30 UTC
Just to reiterate clearer:

 * I've increased the full text search to 2 letters.
 * I've rebuilt the table indexes with no success.
 * I've deleted, recreated, and reimported the searchindex table.
 * I want to search on word boundaries such that "EC" does not match "suspect".
 * When searching for "EC" at Gamegrene, we get five pages that I know match.
 * However, I don't know what exactly is matched. If MySQL MATCH() does word
   then the MW display does string searching (as it always shows "suspect").
 * "ur" as in "procedure" shows no matches; "procedure" does.
Comment 4 Brion Vibber 2004-09-03 16:03:30 UTC only returns pages which contain "EC" by itself.

Can you clarify what exactly your problem is?
Comment 5 Morbus Iff 2004-09-03 16:30:47 UTC
From three of my machines (different IPs, logged in or not), and another
person's machine entirely, we're NOT seeing "EC" by itself (word boundary).
We're seeing EC as a string. For instance, one of the returned results shows the
below, which is matching on "effect", "secret", and "ineffective".

# Avazian Box (2331 bytes)
1: ...d quickly. This advancement came with the side effect of immense 
greed. Many highly advanced magnetic ...
3: ...g new magnetic propulsion technologies, formed a secret team 
intending to thwart the ongoing conflict.
5: ...which rendered all weapons of Avazian origin ineffective, and the 
absorption of the magnetic field wou...
Comment 6 Jamesday 2004-09-03 16:52:11 UTC
"ec" is matched in the middle of a word. Other two character sequences are
typically not matched in the middle of a word. The desired behavior is to match
ec only when it is a whole word, not in the middle of words.
Comment 7 Brion Vibber 2004-09-03 16:59:29 UTC
Can you explain what you mean by "match"? As far as I can tell, the search is *ONLY* returning pages in which "EC" 
appears as a distinct word when asked to search for "EC". Nothing else. No other pages are returned.

So, is this about the *searching*?

Or, is it about the *highlighting* of text extracts in the search results display?

Can you please clarify?
Comment 8 Morbus Iff 2004-09-03 17:29:48 UTC
Brion - exactly, that's what I don't know (from a previous entry):  "When
searching for "EC" at Gamegrene, we get five pages that I know match. If MySQL
MATCH() does word boundaries, then the MW display does string searching (as it
always shows "suspect")."

If MySQL MATCH() does do word boundaries, then yeah, I guess I'm reporting a bug
in the display code (specifically, showHit() in SearchEngine.php).

Thanks for the patience.
Comment 9 Brion Vibber 2004-09-03 17:34:42 UTC
Morbus, for general information on the fulltext search engine see

Matches are on full words unless you use the * operator (eg, search for "apple*" finds "applet" and "applesauce" but search 
for "apple" does not).

Changed summary and sample URL to reflect the problem.
Comment 10 Morbus Iff 2004-09-03 20:25:59 UTC
This not been heavily tested yet, but the following revision
in SearchEngine.php:showHit() seems to do what I want: 

  $pat1 = "/(.*)(\b" . implode( "|", $this->mSearchterms ) . "\b)(.*)/i";

The generated pattern then becomes /(.*)(\bEC\b)(.*)/i or, in the case of
multiple searches /(.*)(\b20|EC\b)(.*)/i. This code is currently live 
at the provided URL, so you can test as needed.
Comment 11 Morbus Iff 2004-09-03 20:28:26 UTC
Sorry - the correct revision is:

 $pat1 = "/(.*)(\b" . implode( "\b|\b", $this->mSearchterms ) . "\b)(.*)/i";

which creates a pattern like /(.*)(\b20\b|\bEC\b)(.*)/i.

Comment 12 Brion Vibber 2007-10-01 13:11:51 UTC
Fixed in r26269 for mainline, r26271 for lucenesearch extension.

Note You need to log in before you can comment on or make changes to this bug.