Last modified: 2014-08-13 19:53:45 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T56875, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 54875 - Automatic stopwords for the 200+ languages without their own analyzer available
Automatic stopwords for the 200+ languages without their own analyzer available
Status: NEW
Product: MediaWiki extensions
Classification: Unclassified
CirrusSearch (Other open bugs)
master
All All
: Low enhancement (vote)
: ---
Assigned To: Nobody - You can work on this!
Elasticsearch_1.1
: upstream
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2013-10-02 14:57 UTC by Nemo
Modified: 2014-08-13 19:53 UTC (History)
2 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Nemo 2013-10-02 14:57:09 UTC
Split from bug 54022: apart from the 30 languages currently supported, rather than use the default analyzer bare we should probably use stopwords calculated in an automatic way, while we wait for a custom ones to be made.
It seems cutoff_frequency setting and common_terms query may be used for this purpose.

I'd say that this is currently low priority but should probably be done before expanding elasticsearch beyond the ~30 supported languages.
Comment 1 Nik Everett 2013-10-02 14:58:45 UTC
I'm not sure this should be a hard requirement before expanding beyond the ~30 languages with built in stop words.  I certainly agree we should do it though.
Comment 3 Nik Everett 2013-10-16 21:23:49 UTC
I believe that is what nemo was referring to.  The problem (right now) is that was use query string queries rather than term queries.  For what we do, it makes a lot of sense.  Anyway, query string queries don't play nice right yet with common terms queries.  They could possibly be made to but I'm not sure about that yet.  It'd probably make more sense to make this change in elasticsearch and for us to just flip the switch to turn it on.
Comment 4 Nemo 2013-10-17 08:29:25 UTC
I don't know anything about implementation details but yes, that would seem the most elegant way to handle it from the small hints I gathered around. However, it may also be viable to automatically generate "standard" stopwords lists for each language, from what I understand.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links