Last modified: 2013-10-03 00:02:15 UTC
CirrusSearch seems to stem the word "used" to "us" sometimes! <elasticsearch>/nikwiki_general/_analyze?analyzer=text&text=used returns { "tokens": [ { "token": "us", "start_offset": 0, "end_offset": 4, "type": "<ALPHANUM>", "position": 1 } ] }
I might be able to fix this by switching stemmers. I'll do some more research tomorrow.
Change 86854 had a related patch set uploaded by Manybubbles: Tests for places where kstem beats porter stemmer. https://gerrit.wikimedia.org/r/86854
Switching stemmers. Implementation: https://gerrit.wikimedia.org/r/#/c/86853/ Regression tests: https://gerrit.wikimedia.org/r/#/c/86854/
MErged.
"The kstem token filter is a high performance filter for english" http://www.elasticsearch.org/guide/reference/index-modules/analysis/kstem-tokenfilter/ So I don't need to test what the effects are of this change for other languages?
Change 86854 merged by jenkins-bot: Tests for places where kstem beats porter stemmer. https://gerrit.wikimedia.org/r/86854
Right, this only effects English. Unfortunately (or fortunately for a small set of use cases) there aren't as many different options for languages other than English. I believe we have five options, in order of how much they increase recall and decrease precision: 1. No stemming 2. Minimal (just possessives) 3. KStem 4. Porter Stemmer 5. Porter Stemmer via Snowball A few other languages have "minimal" (or "light") stemmers in addition to their more aggressive versions. In all cases other than English at this point we use the Elasticsearch default which is the more aggressive version. Switching from the Elasticsearch default to a customized version isn't hard and we're totally willing to do it.
Sorry for going offtopic with my stupid questions, mainly I'd like to make a list of possible weaknesses e.g. for Italian analysis so that users can specifically test them a bit. (In reply to comment #7) > Right, this only effects English. > > Unfortunately (or fortunately for a small set of use cases) there aren't as > many different options for languages other than English. I believe we have > five options, in order of how much they increase recall and decrease > precision: > 1. No stemming > 2. Minimal (just possessives) > 3. KStem > 4. Porter Stemmer > 5. Porter Stemmer via Snowball > > A few other languages have "minimal" (or "light") stemmers in addition to > their > more aggressive versions. In all cases other than English at this point we > use > the Elasticsearch default which is the more aggressive version. Our default is standard i.e. http://www.elasticsearch.org/guide/reference/index-modules/analysis/standard-tokenizer/ or the language default for those which have one ( http://www.elasticsearch.org/guide/reference/index-modules/analysis/lang-analyzer/ ) so the stopwords we're using are those linked from http://www.elasticsearch.org/guide/reference/index-modules/analysis/snowball-analyzer/ ? > > Switching from the Elasticsearch default to a customized version isn't hard > and > we're totally willing to do it. Good! I guess you'll need help from native speakers and that they'll need some pointers from the docs on how to help. 30 languages < 285, so maybe – when you start expanding to many languages – as a starting point cutoff_frequency can be used to replace stopwords lists where one is not available as mentioned in https://gibrown.wordpress.com/2013/05/01/three-principles-for-multilingal-indexing-in-elasticsearch/ ? That would be a possible enhancement to file separately.
Yeah, it is probably worth opening a new bug with specific things, but you are right about help from native speakers. As far as stopwords go there is a thing in elasticsearch called a common_terms query that can be used to kind of simulate having stopwords. In some respects it is better than having stopwords so folks can turn them off and use it instead. But getting it working with the query syntax that we use now is going to be rough. Additionally we probably want to turn CirrusSearch on even for languages that aren't in that 30 mostly because we're likely to be better than lucene-search. Except in Esperanto.
verified on test2wiki.