Last modified: 2013-08-29 17:01:46 UTC
The search shouldn't take Combining diacritical marks into account.
e.g. searching Александр Сергеевич Пушкин should reveal pages with Алекса́ндр
Серге́евич Пу́шкин as well as pages with Александр Сергеевич Пушкин
on the other hand searching for Алекса́ндр Серге́евич Пу́шкин should also find
pages that contain only Александр Сергеевич Пушкин without accent
Yes, it would be very useful — for instance, for Russian words (which are
optionally accentuated with the "combining acute accent"), and also for Arabic
and Hebrew words (where vowels are optionally indicated with marks over/under
the consonant letters).
This is essential for Vietnamese, in which most words have accent marks. New
users expect the search system to strip the diacritical marks (and also
understand Đ/đ ↔ D/d), but when it doesn't, the user is led to believe that we
don't have the article they're looking for.
Perhaps the search function should ignore diacritics in article titles when the
user has entered a query that contains no diacritics. If the user has entered in
diacritics, the software should respect that. It would also be nice if there
were a MediaWiki message in which a list of diacritics could be customized per
wiki or locale, since different languages distinguish letters and diacritics
I've filed a separate Bug 5752 for the issue I described in Comment 2, since
article titles at vi: use precomposed characters, which should nonetheless be
converted to the base ASCII characters when searching.
*** Bug 5752 has been marked as a duplicate of this bug. ***
Changing summary to include the issue discussed at Bug 5752, which Brion wants
to merge with this bug.
About precomposed characters: MacOs X is avoiding the problem by using a special
kind of Unicode for filenames (UTF-8-MAC), where precomposed characters are
always converted to their composed variants.
see Unicode Standard Annex #15: Unicode Normalization Forms for details:
We know what Unicode is, thanks. :) MediaWiki already transforms all input
to NFC and includes a normalization conversion library built-in.
Fixed in Lucene Search 2. Diacritics are stripped, Đ-đ has been set as alias to D/d in Vietnamese. This also includes Hebrew pointing.
If you feel that stripping all diacritics is wrong for your language, reopen this bug.