Last modified: 2014-02-20 21:16:11 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T29055, the corresponding Phabricator task for complete and up-to-date bug report information.

Bug 27055 - Devanagari and Arabic combining character handling


Summary:	Devanagari and Arabic combining character handling

Status:	NEW

Product:	MediaWiki extensions
Classification:	Unclassified
Component:	CirrusSearch (Other open bugs)
Version:	unspecified
Hardware:	All All

Importance:	Normal enhancement (vote)
Target Milestone:	---
Assigned To:	Nobody - You can work on this!

URL:
Whiteboard:
Keywords:	i18n, utf8

Depends on:
Blocks:
	Show dependency tree / graph

Reported:	2011-01-31 02:20 UTC by Tim Starling
Modified:	2014-02-20 21:16 UTC (History)
CC List:	11 users (show)

See Also:
Web browser:	---
Mobile Platform:	---
Assignee Huggle Beta Tester:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Description Tim Starling 2011-01-31 02:20:39 UTC

User:Atitarev from Wiktionary has complained that the normalisation used by Lucene does not suit Hindi and Arabic. In the examples I have been given, composing characters such as U+093C are used add diacritics to characters, and the resulting combinations have no composed form in Unicode. It is requested that the composing marks be stripped before search indexing is done, so that titles which differ only by the combining marks they contain can be returned in "did you mean" and autocomplete results.

A list of affected characters will be given as a comment or attachment.

Comment 1 Dave Ross 2011-02-05 15:41:31 UTC

The discussion can be seen here, but here are the diacritics and characters provided to me:

Hindi:
First of all, the pairs with nuqta (a dot underneath) and without it should be searchable the same way Roman letters with diacritics and without are searchable.
* क़/क ख़/ख ग़/ग ज़/ज फ़/फ ड़/ड ढ़/ढ
The letters are not identical but So that if a user typed खून, ख़ून would also be listed.
* Words containing diacritics ॉ (candra), ् (virama) should be equal to those without them: चॉकलेट / चाकलेट, सन् / सन. Similar to the way English words entries with a space are equal to those having a hyphen (-) between them.
----
Arabic:
* Different forms of alif: ا, أ‎, إ‎, ﺁ‎ and ٱ‎‎ should be searchable together, e.g. أمس and امس, etc.
* Words containing any of these diacritics could be searchable as if they don't have them and the other way around:
ـَ fatHa, ـِ kasra, ـُ Damma, ـْ sukuun, ـّ shadda, ـٰ dagger 'alif.
----
* ـٌ tanwiin al-Damm (تنوين الضم)
* ـٍ tanwiin al-kasr (تنوين الكسر)
* ـً tanwiin al-fatH (تنوين الفتح)
----
Persian often uses a zero-width nonjoiner (& # x200C;) as in ویکی‌پدیا. People who don’t know how to access it tend to substitute a space: ویکی پدیا. It’s a misspelling, but lots of people can’t help it.

In languages like Khmer and Thai that do not use word spaces, there is often a zero-width space (& # x200B;) as in តើអ្នកនិយាយភាសាអង់គ្លេសទេ. More often than not, it is simply left out (តើអ្នកនិយាយភាសាអង់គ្លេសទេ). Both spellings are correct.

I think Anatoli neglected to mention the word-final Arabic pair ه/ة. The final letter ة may be typed as ه.

Comment 2 Niklas Laxström 2011-09-06 11:28:46 UTC

Just adding a note that stripping diacritics from latin letters is not always the correct thing to do. It is obvious that we need to support different models for different languages.

Comment 3 Siddhartha Ghai 2012-01-06 06:28:55 UTC

(In reply to comment #1)
> The discussion can be seen here, but here are the diacritics and characters
> provided to me:
> 
> 
> Hindi:
> First of all, the pairs with nuqta (a dot underneath) and without it should be
> searchable the same way Roman letters with diacritics and without are
> searchable.
>     * क़/क ख़/ख ग़/ग ज़/ज फ़/फ ड़/ड ढ़/ढ 
> The letters are not identical but So that if a user typed खून, ख़ून would also
> be listed.
>     * Words containing diacritics ॉ (candra), ् (virama) should be equal to
> those without them: चॉकलेट / चाकलेट, सन् / सन. Similar to the way English words
> entries with a space are equal to those having a hyphen (-) between them. 
> ----
> Arabic:
>     * Different forms of alif: ا, أ‎, إ‎, ﺁ‎ and ٱ‎‎ should be searchable
> together, e.g. أمس and امس, etc.
>     * Words containing any of these diacritics could be searchable as if they
> don't have them and the other way around: 
> ـَ fatHa, ـِ kasra, ـُ Damma, ـْ sukuun, ـّ shadda, ـٰ dagger 'alif. 
> ----
>     * ـٌ tanwiin al-Damm (تنوين الضم) 
>     * ـٍ tanwiin al-kasr (تنوين الكسر) 
>     * ـً tanwiin al-fatH (تنوين الفتح) 
> ----
> Persian often uses a zero-width nonjoiner (& # x200C;) as in ویکی‌پدیا. People
> who don’t know how to access it tend to substitute a space: ویکی پدیا. It’s a
> misspelling, but lots of people can’t help it.
> 
> In languages like Khmer and Thai that do not use word spaces, there is often a
> zero-width space (& # x200B;) as in តើអ្នកនិយាយភាសាអង់គ្លេសទេ. More often
> than not, it is simply left out (តើអ្នកនិយាយភាសាអង់គ្លេសទេ). Both spellings are
> correct.
> 
> I think Anatoli neglected to mention the word-final Arabic pair ه/ة. The final
> letter ة may be typed as ه.

Actually चॉकलेट can also be written as चौकलेट or चोकलेट . However, everything other than चॉकलेट is grammatically incorrect. But, if equivalence is to be added, it should be चॉकलेट and चौकलेट, not चाकलेट. Reason being that a lot of unwanted equivalences would be introduced as well, like हॉल (hall) and हाल (condition someone is in).

The handling for halant/viram is correctly stated as equivalence. However, there is more to it. Five characters in hindi when followed by halant, can be replaced by an anuswara on the next character. All five represent nasal sounds, which can be represented by anuswara. For example, सङ्गीत/संगीत, सम्वत/संवत

The five characters are ङ ञ ण न म

But not all cases of anuswara can be equated to each one, since each has a different sound.
There is a grammatical rule which decides this. The rule depends on the character next to these five characters. On a case basis:

क ख ग घ are preceded by ङ
च छ ज झ are preceded by ञ
ट ठ ड ढ are preceded by ण
त थ द ध are preceded by न
प फ बी भ are preceded by म

Note that this is similar the utf8 encoding order. The four alphabets come in the stated order before before the respective nasal alphabet.

So, if I type in सन् , I would expect संतान to show up, but not संभव.

However, this limitation of equating is an ideal case with perfect grammar. In actual usage, न् has been used in place of ङ् ञ् and ण् but not म् since it is an entirely different sound. So, if I type in सन्, I would also expect संगीत, संजय, संडे to show up, but still not संभव. Hope I have clarified this clearly enough.

PS:The nuqta stuff is correct.

Comment 4 Siddhartha Ghai 2012-01-06 06:32:59 UTC

Bug 33548 is related to this. Its about the appearance of devanagari diacritics in the "did you know" results.

Comment 5 Andre Klapper 2014-02-13 23:19:41 UTC

(In reply to Dave Ross from comment #1)
> Arabic:
>     * Different forms of alif: ا, أ‎, إ‎, ﺁ‎ and ٱ‎‎ should be searchable
> together, e.g. أمس and امس, etc.

امس : search=امس
https://ar.wikipedia.org/w/index.php?title=%D8%AE%D8%A7%D8%B5%3A%D8%A8%D8%AD%D8%AB&profile=default&search=امس&fulltext=Search&uselang=en
There is a page named "امس" on this wiki.

أمس :
https://ar.wikipedia.org/w/index.php?title=%D8%AE%D8%A7%D8%B5%3A%D8%A8%D8%AD%D8%AB&profile=default&search=أمس&fulltext=Search&uselang=en
Create the page "أمس" on this wiki!

أمس :
https://ar.wikipedia.org/w/index.php?title=%D8%AE%D8%A7%D8%B5%3A%D8%A8%D8%AD%D8%AB&profile=default&search=أمس&fulltext=Search&uselang=en&srbackend=CirrusSearch
Create the page "أمس" on this wiki!

Comment 6 Chad H. 2014-02-14 18:11:09 UTC

Needs reassessment with Cirrus.

Wikimedia Bugzilla is closed!

Search

Personal tools

Navigation

Links