Last modified: 2013-03-26 11:25:18 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T23002, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 21002 - Wrong processing of the apostrophe by the search engine in Ukrainian
Wrong processing of the apostrophe by the search engine in Ukrainian
Status: RESOLVED FIXED
Product: Wikimedia
Classification: Unclassified
lucene-search-2 (Other open bugs)
unspecified
All All
: Normal normal (vote)
: ---
Assigned To: Robert Stojnic
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2009-10-05 13:40 UTC by Yevhen Shulha
Modified: 2013-03-26 11:25 UTC (History)
3 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Yevhen Shulha 2009-10-05 13:40:01 UTC
In Ukrainian language the apostrophe usually appears in the middle of the word to mark the specific pronunciation of certain sounds. The problem is that the apostrophe symbol («’», U+2019) is probably threated by the search engine as a quotation mark, thus treating the word which it contains as two separate words. For example, the word «xxxxx’yyyyyy», will be recognized as two words xxxxx and yyyyyy. This makes such words impossible to find, and makes totally impossible to give the articles the names with the apostrophe. 

This bug was never reported before, because on the keyboard the are some other symbols, looking as and used instead of the apostrophe: «‘», «'», «`», but according to Ukrainian typographic standard, the only right symbol is — «’», U+2019.

This bug doesn’t show up, if instead of the U+2019 symbol the ' mark is used, which is the temporary solution, widely used in Ukrainian wikipedia for the moment. But to keep Ukrainian wikipedia in line with the rules of the language, the U+2019 apostrophe should be processed correctly as well.
Comment 1 Brion Vibber 2009-10-05 16:11:57 UTC
Assigning to Robert and moving to lucene search component.
Comment 2 Yevhen Shulha 2009-10-06 08:38:33 UTC
The possible hint to the solution may be that fact, that the apostrophe in Ukrainian is never used with space before or after it. But the quotation mark does have a space nearby. It may help distinguishing them. 
Comment 3 Robert Stojnic 2009-10-06 09:40:51 UTC
This should be easy to do, would need to add extra characters as apostrophe chars and reindex uk.wiki. Do you want all 4 as possible apostrophes or only the "proper" one?
Comment 4 Yevhen Shulha 2009-10-06 10:54:03 UTC
I guess we'd better quick discuss that in Ukrainian wiki. I'll let you know by tomorrow. Thanks!
Comment 5 Yevhen Shulha 2009-10-07 16:17:41 UTC
We decided that we need at least two symbols: «'» and «’» (U+2019), as the former is already used in many articles, and we will probably need some transition period when both symbols will be used equally. It would be great as well, if these symbols would be interchangeable from the search engine's point of view, so that the query «xx’yy» would find both «xx'yy» and «xx’yу» words. Thank you!
Comment 6 Robert Stojnic 2009-10-19 23:16:30 UTC
Fixed in r57932, needs index rebuild to go live (should be done in next couple of days). 
Comment 7 Yevhen Shulha 2009-10-20 06:08:12 UTC
Thank you very much! I'll test it when the database will be reindexed and will let you know if everything is going well. 
Comment 8 Yevhen Shulha 2009-10-21 07:26:06 UTC
Everything is fine, thank you!
Comment 9 Andre Klapper 2013-03-26 11:25:18 UTC
[Merging "MediaWiki extensions/Lucene Search" into "Wikimedia/lucene-search2", see bug 46542. You can filter bugmail for: search-component-merge-20130326 ]

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links