Last modified: 2013-08-29 17:01:46 UTC

Wikimedia Bugzilla is closed!

Wikimedia has migrated from Bugzilla to Phabricator. Bug reports should be created and updated in Wikimedia Phabricator instead. Please create an account in Phabricator and add your Bugzilla email address to it.
Wikimedia Bugzilla is read-only. If you try to edit or create any bug report in Bugzilla you will be shown an intentional error message.
In order to access the Phabricator task corresponding to a Bugzilla report, just remove "static-" from its URL.
You could still run searches in Bugzilla or access your list of votes but bug reports will obviously not be up-to-date in Bugzilla.
Bug 1836 - Strip combining diacritical marks and convert precomposed characters when searching
Strip combining diacritical marks and convert precomposed characters when sea...
Status: RESOLVED FIXED
Product: MediaWiki
Classification: Unclassified
Search (Other open bugs)
unspecified
All All
: Normal enhancement with 2 votes (vote)
: ---
Assigned To: Nobody - You can work on this!
: i18n
: 5752 (view as bug list)
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2005-04-06 20:05 UTC by Helge Hielscher
Modified: 2013-08-29 17:01 UTC (History)
5 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Helge Hielscher 2005-04-06 20:05:14 UTC
The search shouldn't take Combining diacritical marks into account.
http://en.wikipedia.org/wiki/Combining_diacritical_mark

e.g. searching Александр Сергеевич Пушкин should reveal pages with Алекса́ндр
Серге́евич Пу́шкин as well as pages with Александр Сергеевич Пушкин
on the other hand searching for Алекса́ндр Серге́евич Пу́шкин should also find
pages that contain only Александр Сергеевич Пушкин without accent
Comment 1 a.lukyanov 2005-04-08 07:16:30 UTC
Yes, it would be very useful — for instance, for Russian words (which are
optionally accentuated with the "combining acute accent"), and also for Arabic
and Hebrew words (where vowels are optionally indicated with marks over/under
the consonant letters).
Comment 2 Minh Nguyễn 2006-04-29 06:20:48 UTC
This is essential for Vietnamese, in which most words have accent marks. New
users expect the search system to strip the diacritical marks (and also
understand Đ/đ ↔ D/d), but when it doesn't, the user is led to believe that we
don't have the article they're looking for.
Comment 3 Minh Nguyễn 2006-04-29 06:47:50 UTC
Perhaps the search function should ignore diacritics in article titles when the
user has entered a query that contains no diacritics. If the user has entered in
diacritics, the software should respect that. It would also be nice if there
were a MediaWiki message in which a list of diacritics could be customized per
wiki or locale, since different languages distinguish letters and diacritics
differently.

I've filed a separate Bug 5752 for the issue I described in Comment 2, since
article titles at vi: use precomposed characters, which should nonetheless be
converted to the base ASCII characters when searching.
Comment 4 Brion Vibber 2006-04-29 21:26:52 UTC
*** Bug 5752 has been marked as a duplicate of this bug. ***
Comment 5 Brion Vibber 2006-04-30 00:23:53 UTC
*** Bug 5752 has been marked as a duplicate of this bug. ***
Comment 6 Minh Nguyễn 2006-04-30 00:32:49 UTC
Changing summary to include the issue discussed at Bug 5752, which Brion wants
to merge with this bug.
Comment 7 Helge Hielscher 2006-04-30 13:16:50 UTC
About precomposed characters: MacOs X is avoiding the problem by using a special
kind of Unicode for filenames (UTF-8-MAC), where precomposed characters are
always converted to their composed variants.
Comment 8 Helge Hielscher 2006-04-30 13:30:52 UTC
see Unicode Standard Annex #15: Unicode Normalization Forms for details:
http://www.unicode.org/reports/tr15/ 
Comment 9 Brion Vibber 2006-05-01 06:52:42 UTC
We know what Unicode is, thanks. :) MediaWiki already transforms all input 
to NFC and includes a normalization conversion library built-in.
Comment 10 Robert Stojnic 2007-07-13 18:53:14 UTC
Fixed in Lucene Search 2. Diacritics are stripped, Đ-đ has been set as alias to D/d in Vietnamese. This also includes Hebrew pointing.

If you feel that stripping all diacritics is wrong for your language, reopen this bug. 

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links