Last modified: 2014-02-20 21:16:11 UTC
User:Atitarev from Wiktionary has complained that the normalisation used by Lucene does not suit Hindi and Arabic. In the examples I have been given, composing characters such as U+093C are used add diacritics to characters, and the resulting combinations have no composed form in Unicode. It is requested that the composing marks be stripped before search indexing is done, so that titles which differ only by the combining marks they contain can be returned in "did you mean" and autocomplete results. A list of affected characters will be given as a comment or attachment.
The discussion can be seen here, but here are the diacritics and characters provided to me: Hindi: First of all, the pairs with nuqta (a dot underneath) and without it should be searchable the same way Roman letters with diacritics and without are searchable. * क़/क ख़/ख ग़/ग ज़/ज फ़/फ ड़/ड ढ़/ढ The letters are not identical but So that if a user typed खून, ख़ून would also be listed. * Words containing diacritics ॉ (candra), ् (virama) should be equal to those without them: चॉकलेट / चाकलेट, सन् / सन. Similar to the way English words entries with a space are equal to those having a hyphen (-) between them. ---- Arabic: * Different forms of alif: ا, أ, إ, ﺁ and ٱ should be searchable together, e.g. أمس and امس, etc. * Words containing any of these diacritics could be searchable as if they don't have them and the other way around: ـَ fatHa, ـِ kasra, ـُ Damma, ـْ sukuun, ـّ shadda, ـٰ dagger 'alif. ---- * ـٌ tanwiin al-Damm (تنوين الضم) * ـٍ tanwiin al-kasr (تنوين الكسر) * ـً tanwiin al-fatH (تنوين الفتح) ---- Persian often uses a zero-width nonjoiner (& # x200C;) as in ویکیپدیا. People who don’t know how to access it tend to substitute a space: ویکی پدیا. It’s a misspelling, but lots of people can’t help it. In languages like Khmer and Thai that do not use word spaces, there is often a zero-width space (& # x200B;) as in តើអ្នកនិយាយភាសាអង់គ្លេសទេ. More often than not, it is simply left out (តើអ្នកនិយាយភាសាអង់គ្លេសទេ). Both spellings are correct. I think Anatoli neglected to mention the word-final Arabic pair ه/ة. The final letter ة may be typed as ه.
Just adding a note that stripping diacritics from latin letters is not always the correct thing to do. It is obvious that we need to support different models for different languages.
(In reply to comment #1) > The discussion can be seen here, but here are the diacritics and characters > provided to me: > > > Hindi: > First of all, the pairs with nuqta (a dot underneath) and without it should be > searchable the same way Roman letters with diacritics and without are > searchable. > * क़/क ख़/ख ग़/ग ज़/ज फ़/फ ड़/ड ढ़/ढ > The letters are not identical but So that if a user typed खून, ख़ून would also > be listed. > * Words containing diacritics ॉ (candra), ् (virama) should be equal to > those without them: चॉकलेट / चाकलेट, सन् / सन. Similar to the way English words > entries with a space are equal to those having a hyphen (-) between them. > ---- > Arabic: > * Different forms of alif: ا, أ, إ, ﺁ and ٱ should be searchable > together, e.g. أمس and امس, etc. > * Words containing any of these diacritics could be searchable as if they > don't have them and the other way around: > ـَ fatHa, ـِ kasra, ـُ Damma, ـْ sukuun, ـّ shadda, ـٰ dagger 'alif. > ---- > * ـٌ tanwiin al-Damm (تنوين الضم) > * ـٍ tanwiin al-kasr (تنوين الكسر) > * ـً tanwiin al-fatH (تنوين الفتح) > ---- > Persian often uses a zero-width nonjoiner (& # x200C;) as in ویکیپدیا. People > who don’t know how to access it tend to substitute a space: ویکی پدیا. It’s a > misspelling, but lots of people can’t help it. > > In languages like Khmer and Thai that do not use word spaces, there is often a > zero-width space (& # x200B;) as in តើអ្នកនិយាយភាសាអង់គ្លេសទេ. More often > than not, it is simply left out (តើអ្នកនិយាយភាសាអង់គ្លេសទេ). Both spellings are > correct. > > I think Anatoli neglected to mention the word-final Arabic pair ه/ة. The final > letter ة may be typed as ه. Actually चॉकलेट can also be written as चौकलेट or चोकलेट . However, everything other than चॉकलेट is grammatically incorrect. But, if equivalence is to be added, it should be चॉकलेट and चौकलेट, not चाकलेट. Reason being that a lot of unwanted equivalences would be introduced as well, like हॉल (hall) and हाल (condition someone is in). The handling for halant/viram is correctly stated as equivalence. However, there is more to it. Five characters in hindi when followed by halant, can be replaced by an anuswara on the next character. All five represent nasal sounds, which can be represented by anuswara. For example, सङ्गीत/संगीत, सम्वत/संवत The five characters are ङ ञ ण न म But not all cases of anuswara can be equated to each one, since each has a different sound. There is a grammatical rule which decides this. The rule depends on the character next to these five characters. On a case basis: क ख ग घ are preceded by ङ च छ ज झ are preceded by ञ ट ठ ड ढ are preceded by ण त थ द ध are preceded by न प फ बी भ are preceded by म Note that this is similar the utf8 encoding order. The four alphabets come in the stated order before before the respective nasal alphabet. So, if I type in सन् , I would expect संतान to show up, but not संभव. However, this limitation of equating is an ideal case with perfect grammar. In actual usage, न् has been used in place of ङ् ञ् and ण् but not म् since it is an entirely different sound. So, if I type in सन्, I would also expect संगीत, संजय, संडे to show up, but still not संभव. Hope I have clarified this clearly enough. PS:The nuqta stuff is correct.
Bug 33548 is related to this. Its about the appearance of devanagari diacritics in the "did you know" results.
(In reply to Dave Ross from comment #1) > Arabic: > * Different forms of alif: ا, أ, إ, ﺁ and ٱ should be searchable > together, e.g. أمس and امس, etc. امس : search=امس https://ar.wikipedia.org/w/index.php?title=%D8%AE%D8%A7%D8%B5%3A%D8%A8%D8%AD%D8%AB&profile=default&search=امس&fulltext=Search&uselang=en There is a page named "امس" on this wiki. أمس : https://ar.wikipedia.org/w/index.php?title=%D8%AE%D8%A7%D8%B5%3A%D8%A8%D8%AD%D8%AB&profile=default&search=أمس&fulltext=Search&uselang=en Create the page "أمس" on this wiki! أمس : https://ar.wikipedia.org/w/index.php?title=%D8%AE%D8%A7%D8%B5%3A%D8%A8%D8%AD%D8%AB&profile=default&search=أمس&fulltext=Search&uselang=en&srbackend=CirrusSearch Create the page "أمس" on this wiki!
Needs reassessment with Cirrus.