Last modified: 2013-08-29 17:01:46 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T3836, the corresponding Phabricator task for complete and up-to-date bug report information.

Bug 1836 - Strip combining diacritical marks and convert precomposed characters when searching


Summary:	Strip combining diacritical marks and convert precomposed characters when sea...

Status:	RESOLVED FIXED

Product:	MediaWiki
Classification:	Unclassified
Component:	Search (Other open bugs)
Version:	unspecified
Hardware:	All All

Importance:	Normal enhancement with 2 votes (vote)
Target Milestone:	---
Assigned To:	Nobody - You can work on this!

URL:
Whiteboard:
Keywords:	i18n

Duplicates:	5752 (view as bug list)
Depends on:
Blocks:
	Show dependency tree / graph

Reported:	2005-04-06 20:05 UTC by Helge Hielscher
Modified:	2013-08-29 17:01 UTC (History)
CC List:	5 users (show)

See Also:
Web browser:	---
Mobile Platform:	---
Assignee Huggle Beta Tester:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Description Helge Hielscher 2005-04-06 20:05:14 UTC

The search shouldn't take Combining diacritical marks into account.
http://en.wikipedia.org/wiki/Combining_diacritical_mark

e.g. searching Александр Сергеевич Пушкин should reveal pages with Алекса́ндр
Серге́евич Пу́шкин as well as pages with Александр Сергеевич Пушкин
on the other hand searching for Алекса́ндр Серге́евич Пу́шкин should also find
pages that contain only Александр Сергеевич Пушкин without accent

Comment 1 a.lukyanov 2005-04-08 07:16:30 UTC

Yes, it would be very useful — for instance, for Russian words (which are
optionally accentuated with the "combining acute accent"), and also for Arabic
and Hebrew words (where vowels are optionally indicated with marks over/under
the consonant letters).

Comment 2 Minh Nguyễn 2006-04-29 06:20:48 UTC

This is essential for Vietnamese, in which most words have accent marks. New
users expect the search system to strip the diacritical marks (and also
understand Đ/đ ↔ D/d), but when it doesn't, the user is led to believe that we
don't have the article they're looking for.

Comment 3 Minh Nguyễn 2006-04-29 06:47:50 UTC

Perhaps the search function should ignore diacritics in article titles when the
user has entered a query that contains no diacritics. If the user has entered in
diacritics, the software should respect that. It would also be nice if there
were a MediaWiki message in which a list of diacritics could be customized per
wiki or locale, since different languages distinguish letters and diacritics
differently.

I've filed a separate Bug 5752 for the issue I described in Comment 2, since
article titles at vi: use precomposed characters, which should nonetheless be
converted to the base ASCII characters when searching.

Comment 4 Brion Vibber 2006-04-29 21:26:52 UTC

*** Bug 5752 has been marked as a duplicate of this bug. ***

Comment 5 Brion Vibber 2006-04-30 00:23:53 UTC

*** Bug 5752 has been marked as a duplicate of this bug. ***

Comment 6 Minh Nguyễn 2006-04-30 00:32:49 UTC

Changing summary to include the issue discussed at Bug 5752, which Brion wants
to merge with this bug.

Comment 7 Helge Hielscher 2006-04-30 13:16:50 UTC

About precomposed characters: MacOs X is avoiding the problem by using a special
kind of Unicode for filenames (UTF-8-MAC), where precomposed characters are
always converted to their composed variants.

Comment 8 Helge Hielscher 2006-04-30 13:30:52 UTC

see Unicode Standard Annex #15: Unicode Normalization Forms for details:
http://www.unicode.org/reports/tr15/

Comment 9 Brion Vibber 2006-05-01 06:52:42 UTC

We know what Unicode is, thanks. :) MediaWiki already transforms all input 
to NFC and includes a normalization conversion library built-in.

Comment 10 Robert Stojnic 2007-07-13 18:53:14 UTC

Fixed in Lucene Search 2. Diacritics are stripped, Đ-đ has been set as alias to D/d in Vietnamese. This also includes Hebrew pointing.

If you feel that stripping all diacritics is wrong for your language, reopen this bug.

Wikimedia Bugzilla is closed!

Search

Personal tools

Navigation

Links