Last modified: 2013-08-29 17:01:46 UTC
The search shouldn't take Combining diacritical marks into account. http://en.wikipedia.org/wiki/Combining_diacritical_mark e.g. searching Александр Сергеевич Пушкин should reveal pages with Алекса́ндр Серге́евич Пу́шкин as well as pages with Александр Сергеевич Пушкин on the other hand searching for Алекса́ндр Серге́евич Пу́шкин should also find pages that contain only Александр Сергеевич Пушкин without accent
Yes, it would be very useful — for instance, for Russian words (which are optionally accentuated with the "combining acute accent"), and also for Arabic and Hebrew words (where vowels are optionally indicated with marks over/under the consonant letters).
This is essential for Vietnamese, in which most words have accent marks. New users expect the search system to strip the diacritical marks (and also understand Đ/đ ↔ D/d), but when it doesn't, the user is led to believe that we don't have the article they're looking for.
Perhaps the search function should ignore diacritics in article titles when the user has entered a query that contains no diacritics. If the user has entered in diacritics, the software should respect that. It would also be nice if there were a MediaWiki message in which a list of diacritics could be customized per wiki or locale, since different languages distinguish letters and diacritics differently. I've filed a separate Bug 5752 for the issue I described in Comment 2, since article titles at vi: use precomposed characters, which should nonetheless be converted to the base ASCII characters when searching.
*** Bug 5752 has been marked as a duplicate of this bug. ***
Changing summary to include the issue discussed at Bug 5752, which Brion wants to merge with this bug.
About precomposed characters: MacOs X is avoiding the problem by using a special kind of Unicode for filenames (UTF-8-MAC), where precomposed characters are always converted to their composed variants.
see Unicode Standard Annex #15: Unicode Normalization Forms for details: http://www.unicode.org/reports/tr15/
We know what Unicode is, thanks. :) MediaWiki already transforms all input to NFC and includes a normalization conversion library built-in.
Fixed in Lucene Search 2. Diacritics are stripped, Đ-đ has been set as alias to D/d in Vietnamese. This also includes Hebrew pointing. If you feel that stripping all diacritics is wrong for your language, reopen this bug.