Last modified: 2014-02-12 23:37:59 UTC
When doing a search with the apostrophe character U+0027 "apostrophe/single quote" available on most keyboard, results should match other Unicode apostrophe-like characters like the preferred apostrophe U+2019 and others. In 2009 there was a discussion about "Different apostrophe signs and MediaWiki internal search" see http://www.gossamer-threads.com/lists/wiki/wikitech/169177 This doesn't seem to have been implemented. This is related to bug 36313 for autocompletion. Basically indexing should convert all apostrophes to U+0027, and searching should convert all apostrophes to U+0027. So articles containing U+2019 for exemple would be matches when search with U+0027, U+2019 or other apostrophes. From the 2009 discussion, the list of apostrophes was: U+0027 APOSTROPHE U+2018 LEFT SINGLE QUOTATION MARK U+2019 RIGHT SINGLE QUOTATION MARK U+201B SINGLE HIGH-REVERSED-9 QUOTATION MARK U+2032 PRIME U+00B4 ACUTE ACCENT U+0060 GRAVE ACCENT U+FF40 FULLWIDTH GRAVE ACCENT U+FF07 FULLWIDTH APOSTROPHE I would add other characters for which U+0027 is often used as an accessible substitute like some modifier letters and saltillo: U+02B9 MODIFIER LETTER PRIME U+02BB MODIFIER LETTER TURNED COMMA U+02BC MODIFIER LETTER APOSTROPHE U+02BD MODIFIER LETTER REVERSED COMMA U+02BE MODIFIER LETTER RIGHT HALF RING U+02BF MODIFIER LETTER LEFT HALF RING U+0384 GREEK TONOS U+1FBF GREEK PSILI U+A78B LATIN CAPITAL LETTER SALTILLO U+A78C LATIN SMALL LETTER SALTILLO Webkit-based browsers already do this kind of stripping and merge U+0027, U+2018, U+2019, U+FF07. However there are many cases where merge all the proposed characters would help regular keyboard input. The proposed solution in 2009 was to use a strip function: function stripForSearch( $string ) { $s = preg_replace( '/\xe2\x80\x99/', '\'', $string ); return parent::stripForSearch( $s );
At the moment a good example showing the problem are the following two searches on fr.w: https://fr.wikipedia.org/w/index.php?title=Spécial%3ARecherche&profile=default&search=%22prince+d%27Ithaque%22&fulltext=Search&searchengineselect=mediawiki https://fr.wikipedia.org/w/index.php?title=Spécial%3ARecherche&profile=default&search=%22prince+d%27Ithaque%22&fulltext=Search&searchengineselect=mediawiki
oops the second search is meant to be: https://fr.wikipedia.org/w/index.php?title=Spécial%3ARecherche&profile=default&search=%22prince+d’Ithaque%22&fulltext=Search&searchengineselect=mediawiki Another example is searching for "O'" on fr.w: https://fr.wikipedia.org/w/index.php?search=o%27&title=Spécial%3ARecherche&fulltext=1 The article "O'" (which is a redirect to "O") https://fr.wikipedia.org/w/index.php?title=O%27&redirect=no is found as an exact match but "O’" https://fr.wikipedia.org/wiki/O’ and "Oʻ" https://fr.wikipedia.org/wiki/Oʻ are not on the first page of the search results.
[Merging "MediaWiki extensions/Lucene Search" into "Wikimedia/lucene-search2", see bug 46542. You can filter bugmail for: search-component-merge-20130326 ]
*** Bug 47881 has been marked as a duplicate of this bug. ***
Widening scope a tiny bit. If we're going to do this it should be done all at once. AntiSpoof's sort of the idea I'm thinking here. Repurposing into a Cirrus bug as lsearchd has been end-of-lifed and won't be fixed further.
Chad, Were you thinking this should be done in Cirrus for all languages by pushing analysis configuration to Elasticsearch? Something along those lines would be pretty flexible, allowing, for example, us to boost perfect matches of the typed unicode characters above the squashed ones. I'm not saying that is a good idea, just something that is possible.
(In reply to comment #6) > Chad, > > Were you thinking this should be done in Cirrus for all languages by pushing > analysis configuration to Elasticsearch? Something along those lines would > be > pretty flexible, allowing, for example, us to boost perfect matches of the > typed unicode characters above the squashed ones. Yeah that was pretty much my thinking. > I'm not saying that is a > good idea, just something that is possible. I think it's a good idea, eventually. I set priority so low on purpose :)
Added see also bug. I think we should do this when we pull the unicode plugin in to Elasticsearch.
Looks like apostrophes came up on The Daily WTF: <http://thedailywtf.com/Articles/Lightspeed-is-Too-Slow-for-MY-Luggage.aspx> (specifically <http://img.thedailywtf.com/images/14/q1/e95/Pic-5.jpg>). (In reply to comment #6) > Were you thinking this should be done in Cirrus for all languages by pushing > analysis configuration to Elasticsearch? Something along those lines would > be pretty flexible, allowing, for example, us to boost perfect matches of the > typed unicode characters above the squashed ones. We already do some input normalization at some level of the stack (for example, multiple underscores get squashed and input such as "AbrAhAm LincoLn" works if there's a redirect at "Abraham lincoln"). It's difficult to look at the provided screenshot and not think that the software has failed our readers. Unless you think these should be MediaWiki page redirects (#REDIRECT)? I think we should do better normalization for search inputs. Any rough idea how big of a project this would be to implement?
(In reply to comment #9) > We already do some input normalization at some level of the stack (for > example, multiple underscores get squashed and input such as "AbrAhAm LincoLn" > works if there's a redirect at "Abraham lincoln"). To be more explicit on these points: https://en.wikipedia.org/w/index.php?title=Special%3ASearch&search=AbrAhAm+LincoLn https://en.wikipedia.org/w/index.php?title=Special%3ASearch&search=_____AbrAhAm_____LincoLn_____ We may be able to implement apostrophe normalization at the same level.
I'll have a look at this when I can. For now I'll leave the component set to CirrusSearch. It looks like PHP implements the same normalization components that I can use in Elasticsearch (http://php.net/manual/en/class.normalizer.php) so I'll have to evaluate doing that normalization there as well. I imagine we'll if we do it in php it'll have to be optional because the normalizer requires PHP 5 >= 5.3.0 and PECL intl >= 1.0.0.
In case anyone comes to this from http://thedailywtf.com/Articles/Lightspeed-is-Too-Slow-for-MY-Luggage.aspx#Pic-5, they should have a look at Bug 59666 which should plug that particular embarrassing hole.