Last modified: 2014-11-17 10:35:52 UTC
The term search index currently uses on-the-fly conversion to utf8 (and then lower case) to perform comparisons. That means a full table scan followed by a file sort on a table that is likely to contain several dozen million rows. That's likely to kill the DB server. To avoid this, there should be a dedicated search key column holding the normalized key (similar to the way a search key column is used for category sorting and finding external links). The same normalization shall apply to the index term when inserted and the search term when generating the query. In particular, the following normalization shall apply: * unicode normalization (NFC) * trim leading and trailing whitespace (ideally, all unicode whitespace chars) * lower case (ideally, using the implementation from the appropriate Language class). * optionally, apply a configurable regular expression for stripping separators (e.g. per default stripping all internal whitespace and hyphens, so "foobar" will match "foo-bar" and "foo bar"). This will provide case-insensitive matches with some flexibility regarding whitespace, etc. If only exact matches are desired, the "soft" result could be filtered programmatically before returning it to the caller.
Marking as critical because this problem poses a serious problem for database cluster performance.
Names are a problem because they often contain letters from other languages. It is common to strip letters for all accented signs and use the base form. In normalized form it should be simple to write a function that does this for all languages. This must also be done for the search term, and it imply that the search will be using several terms.
https://gerrit.wikimedia.org/r/#/c/30984/
Patch in Gerrit has the status "Merged". Is there something left or can this be closed as RESOLVED FIXED?
Verified in Wikidata demo time for sprint 21