Last modified: 2014-06-30 21:27:15 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T28997, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 26997 - CJKFilter wrongly tokenize a CJK and non-CJK mixed string.
CJKFilter wrongly tokenize a CJK and non-CJK mixed string.
Status: RESOLVED WONTFIX
Product: Wikimedia
Classification: Unclassified
lucene-search-2 (Other open bugs)
unspecified
All All
: Normal normal (vote)
: ---
Assigned To: Nobody - You can work on this!
: patch
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2011-01-27 21:44 UTC by Jun Mizuno
Modified: 2014-06-30 21:27 UTC (History)
7 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments
a patch for CJKFilter.java and its test. (5.32 KB, patch)
2011-01-27 21:44 UTC, Jun Mizuno
Details
a patch for CJKFilter.java and its test. (5.35 KB, patch)
2011-01-28 14:20 UTC, Jun Mizuno
Details

Description Jun Mizuno 2011-01-27 21:44:39 UTC
Created attachment 8054 [details]
a patch for CJKFilter.java and its test.

With language=ja setting,
CJKFilter wrongly tokenize CJK string
if this string starts with non-CJK characters.

Example:
A string "abC1C2C3", where C1 C2 C3 mean a CJK characters, is tokenized into
a token stream (abC1, C1C2, C2C3).
This should be (ab, C1C2, C2C3, C3C4).

This behavior causes an odd snippet in search result.
A token stream (abC1, C1C2, C2C3) is combined into a word "abC1C1C2C3".
Comment 1 Jun Mizuno 2011-01-28 14:20:40 UTC
Created attachment 8060 [details]
a patch for CJKFilter.java and its test.

The previous patch has the wrong code.
Tokens without a CJK character will be filtered wrong.
I replace the patch.
Comment 2 Bugmeister Bot 2011-08-19 19:12:42 UTC
Unassigning default assignments. http://article.gmane.org/gmane.science.linguistics.wikipedia.technical/54734
Comment 3 Sumana Harihareswara 2011-11-14 16:18:37 UTC
Jun, I'm sorry that it is taking so long for a developer to review your patch!  I have added the "need-review" keyword to indicate that a your patch awaits review.  Thank you for the patch.
Comment 4 Sumana Harihareswara 2011-12-22 16:05:40 UTC
Jun, I'm asking Oren Bochman to take a look at your patch.  You might also be interested in working with him more generally to improve our Lucene search extension.
Comment 5 Liangent 2011-12-23 05:02:22 UTC
If you really want to work on this I think you can try to incorporate some existing project into the extension: http://stackoverflow.com/questions/5834371/is-there-any-good-open-source-or-freely-available-chinese-segmentation-algorithm
Comment 6 Jun Mizuno 2011-12-23 06:17:38 UTC
Hi, 
It is a welcome news that my patch might be reviewed. 
I have used a patched lucene-search for almost one year at my site, however,
I am not sure my patch is valid.

Lucene-Search extension is a fundamental tool at my site.
I don't know there is anything I can do, though, 
I will learn the implementation of CJK support more closely.
Comment 7 Sumana Harihareswara 2012-05-25 03:20:17 UTC
Jun: Thanks again for the patch.  Are you interested in using developer access to directly suggest any future MediaWiki and MediaWiki extension improvements into our Git source control system?

https://www.mediawiki.org/wiki/Developer_access

https://www.mediawiki.org/wiki/Git/Workflow#How_to_submit_a_patch
Comment 8 Andre Klapper 2013-03-26 11:19:40 UTC
[Merging "MediaWiki extensions/Lucene Search" into "Wikimedia/lucene-search2", see bug 46542. You can filter bugmail for: search-component-merge-20130326 ]
Comment 9 Andre Klapper 2013-11-12 13:06:58 UTC
In the meantime, lucene-search in Wikimedia has reached its end of life and will not be improved further. 
Jun Mizuno: It would be awesome if you could check if the problem still exists in the CirrusSearch extension that is being working on (it is also a Lucene-based search for MediaWiki, backed by Elasticsearch instead of Wikimedia's home-grown lsearchd).
Comment 10 Chad H. 2014-06-30 21:27:15 UTC
I don't see this issue in Cirrus/Elastic. Marking WONTFIX since lsearchd is end of life but adding cirrus-fixed tag.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links