Last modified: 2013-03-25 14:56:31 UTC
The current code in 1.6.3's SearchMySQL4.php is really close to allowing full Boolean logic in searches. These are things like: +ninja -pirate (to match all articles with "ninja", but not "pirate") "ninja pirate" (to match all articles with "ninja" followed immediately by "pirate") ninj* (to match ninja, ninjutsu, ninjato) However, the "special" characters +, -, *, etc. are stripped by SearchEngine's method legalSearchChars. In addition, SearchMySQL4.php currently always adds '+' before each word to be searched if no Boolean operators are present. The attached patch (against MediaWiki 1.6.3) adds two config options to change this behavior: 1) $wgSearchMysqlStrictMatch = true; This keeps the current default behavior to prepend '+' to all searches that don't otherwise specify a Boolean logic operator if set to 'true', but allows admins to set it to 'false' to default searches to Boolean 'OR' (where pages matching multiple words are ranked higher) 2) $wgSearchMysqlBooleanLogic = false; This keeps the current default behavior to strip MySQL Boolean logic characters (-, +, <, >, ~, *) if set to false. If set to true, it changes SearchMySQL4.php to override legalSearchChars() to include the Boolean logic chars. I'll attach the patch to this bug as soon as it's submitted.
Created attachment 1606 [details] Patch to enable MySQL fulltext Boolean logic search characters This patch should work against 1.6.3 or svn trunk, although I haven't tested the latter.
A config option here seems totally unnecessary; rather it should simply accept overrides as given, as the code is intended to do (but may not be doing at the moment, as it's buggy). Consider also bug 4021.
I was mainly concerned for performance implications if I enabled the overrides and/or Boolean logic by default; since they're off now, we may be hiding a potential performance problem. Since it's my first MediaWiki patch, I wanted to ensure no impact on performance without a config change -- I know search performance is critical for Wikipedia. Then again, Wikipedia uses Lucene search, not MySQL, right? As we're requiring MySQL 4.x or higher now, we can depend on the database handling the heavy lifting for the Boolean logic. Does that mean I should feel free to enable this by default? If so, I'm happy to take out the config options.
1) User-specified boolean operations are/should be allowed. 2) AND must be the default logical mode where not specified. A default of OR produces useless results most of the time and is contrary to user expectations.
OK, good comments. For 2), if we make AND the default, how can we specify "OR" mode? In MySQL, it's unfortunately the *absence* of a Boolean operator that specifies an "or" search. If we default to prepending a + if no other operator is present, we effectively disable "OR" mode search. I can think of two alternatives: 1) Use OR mode search by default, but use a second MATCH ... AGAINST to provide Lucene-style relevancy ranking, so that MySQL ranks rows with both hits highest. This prevents the 'useless' rows from showing up too high in the match. 2) Provide an alternate operator (pipe perhaps?) that tells us to strip it, but not prepend '+' in this case.
Is this still an issue with the new search?
Currently I know we support phrases ("apple pie"), wildcard suffixes (app*), and exclusions (apple -pie). I'm not sure about the priority bumping (>apple <pie), negation (apple ~pie), or parenthetical grouping. Handling logical ORs is tricky currently since OR is indicated on the backend syntax by the _lack_ of a +, which we have to add to make normal behavior make sense. What I would recommend is a nice parsing layer so that a query like this: apple OR cherry pie will translate to the MySQL backend syntax: +(apple cherry) +pie This'll allow us to do nice boolean queries like with Lucene or Google without hardcoding the non-intuitive MySQL syntax where we have to explicitly say that things are required. See full docs at http://dev.mysql.com/doc/refman/5.0/en/fulltext-boolean.html
Comment on attachment 1606 [details] Patch to enable MySQL fulltext Boolean logic search characters Thanks for submitting a patch to MediaWiki. Unfortunately I'm going to have to mark the patch as obsolete because it no longer merges into our current code.
Should be trivial to rebase this against HEAD.
(In reply to comment #9) > Should be trivial to rebase this against HEAD. SearchMySQL4 doesn't exist in HEAD anymore.
(In reply to comment #10) > (In reply to comment #9) > > Should be trivial to rebase this against HEAD. > > SearchMySQL4 doesn't exist in HEAD anymore. Yes, but I figured the same patch could rebase against SearchMySQL.