Last modified: 2014-09-23 19:36:31 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T23314, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 21314 - Proposed query enhancement: regexps and Type:Text search
Proposed query enhancement: regexps and Type:Text search
Status: NEW
Product: MediaWiki extensions
Classification: Unclassified
Semantic MediaWiki (Other open bugs)
unspecified
All All
: Normal enhancement (vote)
: ---
Assigned To: Markus Krötzsch
: patch, patch-need-review
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2009-10-27 19:55 UTC by Harold Solbrig
Modified: 2014-09-23 19:36 UTC (History)
3 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments
Patch for proposed query search enhancement (6.79 KB, patch)
2009-10-27 19:55 UTC, Harold Solbrig
Details

Description Harold Solbrig 2009-10-27 19:55:37 UTC
Created attachment 6727 [details]
Patch for proposed query search enhancement

The attached patch is one that we have been using locally.  It implements the following
1)      It makes _text fields queryable (which they aren’t at the moment).
2)      It changes wild card searches to case insensitive.  (If this has backwards compatibility issues, we can get more clever, but we need case insensitive regular expressions)
3)      It adds the ability to do regular expressions (mySQL REGEXP) in the wild card.

To do wild card, you first have to enable wild card searches ($smwgQComparators = ‘<|>|!|~’;)
Then, if the match pattern starts with ‘/’ and ends with either ‘/’ or ‘/i’ (e.g. [[text::~/abc*defg*/i]] then the REGEXP comparison is used in the SQL call. The "i" on the end indicates case insensitive search.

The regex thing thing is only so neat at the moment, as many of the regular expression characters get caught elsewhere in the parser, so we will want to do a bit more work – either we need to identify replacement characters for the standard RE characters of we need to borrow from the DPL2 approach and use special characters for "|", "[", and "{".

There is one issue that I need some help with – the only way that I’ve found to do a case insensitive search is to add “COLLATE utf8_general_ci “ immediately following the LIKE or REGEXP phrase.  The problem, however, is that this has to be “ COLLATE latin1_general_ci “ if the database was built as latin character set or UTF8 binary(?) and utf8 otherwise.  Lines 373 and 563 have these phrases.
Comment 1 Markus Krötzsch 2009-11-03 16:15:28 UTC
Both regular expressions and pattern matching on text blobs is generally assumed to lead to decreased performance. Did you do any checks regarding this? Does your extension allow for arbitrarily complex regular expressions (besides the syntactic constraints that you mentioned)?
Comment 2 Harold Solbrig 2009-11-03 17:11:07 UTC
(In reply to comment #1)

>> Both regular expressions and pattern matching on text blobs is generally
assumed to lead to decreased performance. Did you do any checks regarding this?

No - we needed this capability and, in our circumstances, it performed quite well. Note that this is only an issue, however, *if* you search text blobs - there is (almost) no change to the behavior if you don't.  All we've done is given the user an option that they didn't previously have. I state "almost" because the fix, as submitted, does a case insensitive search on both "traditional" wild cards and regular expressions.  This decision could be reversed, however, if there is any concern

>> Does your extension allow for arbitrarily complex regular expressions (besides the syntactic constraints that you mentioned)? 

The extension passes everything between the slashes to the MySQL REGEX operator (http://dev.mysql.com/doc/refman/5.1/en/regexp.html), so the capability is whatever is allowed here.  This is obviously less than ideal because of the database dependency, but, at the moment it seemed like the only viable option.

Of some concern is the fact that this may well be a security issue, as it is probably possible to inject SQL into the match text.  Note, however, that this situation existed prior to our patch - the wild card search had exactly the same issues.  If this is of concern, we would be happy to do what is necessary to lock both searches down.
Comment 3 p858snake 2011-04-30 00:09:07 UTC
*Bulk BZ Change: +Patch to open bugs with patches attached that are missing the keyword*
Comment 4 Sumana Harihareswara 2011-12-23 18:38:49 UTC
Marking "need-review" basically because it sounds like there's an unfinished code review conversation here.  If I'm wrong, please feel free to replace with "reviewed."
Comment 5 Sal Quintanilla 2012-06-15 15:46:12 UTC
Late comment:  For general deployment, case insensitivity needs its own request indicator.  A case sensitive ~*Mark* is a better wildcard search for people named Mark than the case insensitive version.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links