Last modified: 2013-10-03 00:02:15 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T56022, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 54022 - CirrusSearch seems to stem the word "used" to "us"!
CirrusSearch seems to stem the word "used" to "us"!
Status: VERIFIED FIXED
Product: MediaWiki extensions
Classification: Unclassified
CirrusSearch (Other open bugs)
unspecified
All All
: Normal normal (vote)
: ---
Assigned To: Nik Everett
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2013-09-11 16:55 UTC by Nik Everett
Modified: 2013-10-03 00:02 UTC (History)
3 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Nik Everett 2013-09-11 16:55:59 UTC
CirrusSearch seems to stem the word "used" to "us" sometimes!

<elasticsearch>/nikwiki_general/_analyze?analyzer=text&text=used returns
{
  "tokens": [
    {
      "token": "us",
      "start_offset": 0,
      "end_offset": 4,
      "type": "<ALPHANUM>",
      "position": 1
    }
  ]
}
Comment 1 Nik Everett 2013-10-01 01:59:51 UTC
I might be able to fix this by switching stemmers.  I'll do some more research tomorrow.
Comment 2 Gerrit Notification Bot 2013-10-01 13:57:20 UTC
Change 86854 had a related patch set uploaded by Manybubbles:
Tests for places where kstem beats porter stemmer.

https://gerrit.wikimedia.org/r/86854
Comment 3 Nik Everett 2013-10-01 13:58:46 UTC
Switching stemmers.

Implementation: https://gerrit.wikimedia.org/r/#/c/86853/
Regression tests: https://gerrit.wikimedia.org/r/#/c/86854/
Comment 4 Nik Everett 2013-10-01 14:48:07 UTC
MErged.
Comment 5 Nemo 2013-10-01 14:57:03 UTC
"The kstem token filter is a high performance filter for english"
http://www.elasticsearch.org/guide/reference/index-modules/analysis/kstem-tokenfilter/

So I don't need to test what the effects are of this change for other languages?
Comment 6 Gerrit Notification Bot 2013-10-01 15:06:23 UTC
Change 86854 merged by jenkins-bot:
Tests for places where kstem beats porter stemmer.

https://gerrit.wikimedia.org/r/86854
Comment 7 Nik Everett 2013-10-01 15:10:56 UTC
Right, this only effects English.

Unfortunately (or fortunately for a small set of use cases) there aren't as many different options for languages other than English.  I believe we have five options, in order of how much they increase recall and decrease precision:
1.  No stemming
2.  Minimal (just possessives)
3.  KStem
4.  Porter Stemmer
5.  Porter Stemmer via Snowball

A few other languages have "minimal" (or "light") stemmers in addition to their more aggressive versions.  In all cases other than English at this point we use the Elasticsearch default which is the more aggressive version.

Switching from the Elasticsearch default to a customized version isn't hard and we're totally willing to do it.
Comment 8 Nemo 2013-10-01 17:28:58 UTC
Sorry for going offtopic with my stupid questions, mainly I'd like to make a list of possible weaknesses e.g. for Italian analysis so that users can specifically test them a bit.

(In reply to comment #7)
> Right, this only effects English.
> 
> Unfortunately (or fortunately for a small set of use cases) there aren't as
> many different options for languages other than English.  I believe we have
> five options, in order of how much they increase recall and decrease
> precision:
> 1.  No stemming
> 2.  Minimal (just possessives)
> 3.  KStem
> 4.  Porter Stemmer
> 5.  Porter Stemmer via Snowball
> 
> A few other languages have "minimal" (or "light") stemmers in addition to
> their
> more aggressive versions.  In all cases other than English at this point we
> use
> the Elasticsearch default which is the more aggressive version.

Our default is standard i.e. http://www.elasticsearch.org/guide/reference/index-modules/analysis/standard-tokenizer/ or the language default for those which have one ( http://www.elasticsearch.org/guide/reference/index-modules/analysis/lang-analyzer/ ) so the stopwords we're using are those linked from http://www.elasticsearch.org/guide/reference/index-modules/analysis/snowball-analyzer/ ?

> 
> Switching from the Elasticsearch default to a customized version isn't hard
> and
> we're totally willing to do it.

Good! I guess you'll need help from native speakers and that they'll need some pointers from the docs on how to help.
30 languages < 285, so maybe – when you start expanding to many languages – as a starting point cutoff_frequency can be used to replace stopwords lists where one is not available as mentioned in https://gibrown.wordpress.com/2013/05/01/three-principles-for-multilingal-indexing-in-elasticsearch/ ? That would be a possible enhancement to file separately.
Comment 9 Nik Everett 2013-10-01 17:39:06 UTC
Yeah, it is probably worth opening a new bug with specific things, but you are right about help from native speakers.

As far as stopwords go there is a thing in elasticsearch called a common_terms query that can be used to kind of simulate having stopwords.  In some respects it is better than having stopwords so folks can turn them off and use it instead.  But getting it working with the query syntax that we use now is going to be rough.

Additionally we probably want to turn CirrusSearch on even for languages that aren't in that 30 mostly because we're likely to be better than lucene-search.  Except in Esperanto.
Comment 10 Nik Everett 2013-10-03 00:02:15 UTC
verified on test2wiki.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links