Last modified: 2014-06-30 21:27:15 UTC
Created attachment 8054 [details] a patch for CJKFilter.java and its test. With language=ja setting, CJKFilter wrongly tokenize CJK string if this string starts with non-CJK characters. Example: A string "abC1C2C3", where C1 C2 C3 mean a CJK characters, is tokenized into a token stream (abC1, C1C2, C2C3). This should be (ab, C1C2, C2C3, C3C4). This behavior causes an odd snippet in search result. A token stream (abC1, C1C2, C2C3) is combined into a word "abC1C1C2C3".
Created attachment 8060 [details] a patch for CJKFilter.java and its test. The previous patch has the wrong code. Tokens without a CJK character will be filtered wrong. I replace the patch.
Unassigning default assignments. http://article.gmane.org/gmane.science.linguistics.wikipedia.technical/54734
Jun, I'm sorry that it is taking so long for a developer to review your patch! I have added the "need-review" keyword to indicate that a your patch awaits review. Thank you for the patch.
Jun, I'm asking Oren Bochman to take a look at your patch. You might also be interested in working with him more generally to improve our Lucene search extension.
If you really want to work on this I think you can try to incorporate some existing project into the extension: http://stackoverflow.com/questions/5834371/is-there-any-good-open-source-or-freely-available-chinese-segmentation-algorithm
Hi, It is a welcome news that my patch might be reviewed. I have used a patched lucene-search for almost one year at my site, however, I am not sure my patch is valid. Lucene-Search extension is a fundamental tool at my site. I don't know there is anything I can do, though, I will learn the implementation of CJK support more closely.
Jun: Thanks again for the patch. Are you interested in using developer access to directly suggest any future MediaWiki and MediaWiki extension improvements into our Git source control system? https://www.mediawiki.org/wiki/Developer_access https://www.mediawiki.org/wiki/Git/Workflow#How_to_submit_a_patch
[Merging "MediaWiki extensions/Lucene Search" into "Wikimedia/lucene-search2", see bug 46542. You can filter bugmail for: search-component-merge-20130326 ]
In the meantime, lucene-search in Wikimedia has reached its end of life and will not be improved further. Jun Mizuno: It would be awesome if you could check if the problem still exists in the CirrusSearch extension that is being working on (it is also a Lucene-based search for MediaWiki, backed by Elasticsearch instead of Wikimedia's home-grown lsearchd).
I don't see this issue in Cirrus/Elastic. Marking WONTFIX since lsearchd is end of life but adding cirrus-fixed tag.