Last modified: 2009-11-23 10:19:02 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T20764, the corresponding Phabricator task for complete and up-to-date bug report information.

Bug 18764 - Search in yi: should ignore diacritics and identify ligatures


Summary:	Search in yi: should ignore diacritics and identify ligatures

Status:	RESOLVED FIXED

Product:	Wikimedia
Classification:	Unclassified
Component:	lucene-search-2 (Other open bugs)
Version:	unspecified
Hardware:	All All

Importance:	Normal enhancement (vote)
Target Milestone:	---
Assigned To:	Robert Stojnic

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:
	Show dependency tree / graph

Reported:	2009-05-11 10:48 UTC by Percy Mett
Modified:	2009-11-23 10:19 UTC (History)
CC List:	3 users (show)

See Also:
Web browser:	---
Mobile Platform:	---
Assignee Huggle Beta Tester:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Description Percy Mett 2009-05-11 10:48:14 UTC

This relates specifically to all projects using Yiddish (yi).

Yiddish has a number of ligatures. When a search term includes such a ligature it should be able to identify the corresponding term spelt fully without using the ligature.

Likewise searches should ignore the presence of diacritics which some writers use.

Currently Wikimedia projects fail to make this identification. As a result it is necessary to set up numerous synonyms for pages to catch alternative (but essentially identical) spellings of the same word. This applies to almost every word in the language.

By way of comparison, Google search makes the correct identifications. [English Wikimedia projects successfully convert u/c letters in the middle of words.]

I can supply a list of Unicode codes to be identified.

Comment 1 Robert Stojnic 2009-05-11 11:09:38 UTC

We currently use unicode decomposition in order to get rid of all of diacritics, but from what you're saying I gather that it doesn't do the job for Yiddish. A table of unicode characters mapping one to the other form for Yiddish would be very useful.

Comment 2 Percy Mett 2009-06-15 09:32:04 UTC

A simple test by entering a search term with diacritics shows that they are not stripped.


The following should be ignored
HEBREW POINT PATAH  05B7
HEBREW POINT QAMATS  05B8
HEBREW POINT DAGESH OR MAPIQ  05BC
HEBREW POINT RAFE 05BF


The following should be identified with their decomposed forms
HEBREW LIGATURE YIDDISH DOUBLE VAV 05F0  = 05DS 05DS
HEBREW LIGATURE YIDDISH VAV YOD    05F1  = 05DS 05D9
HEBREW LIGATURE YIDDISH DOUBLE YOD 05F2  = 05D9 05D9
HEBREW LIGATURE YIDDISH YOD YOD PATAH FB1F = 05D9 05D9
HEBREW LETTER YOD WITH HIRIQ  FB1D  = 05D9


These are the most common ones

Comment 3 Brion Vibber 2009-07-13 19:15:58 UTC

Assigning to Robert for followup.

Comment 4 Robert Stojnic 2009-10-20 22:55:20 UTC

The decomposed forms you suggest are not part of the unicode standard. 

Can you give us some sample search terms with and without diacritics to have something to test with.

Comment 5 Percy Mett 2009-11-18 11:39:36 UTC

For example

פּאריז  --> פאריז
װאנט --> וואנט

Comment 6 Robert Stojnic 2009-11-22 02:45:29 UTC

OK, I've added the exceptions you requested on yi projects. Since these are not part of unicode standard and I don't know yiddish if you want further exceptions you would need to explicitly tell us which.

Comment 7 Robert Stojnic 2009-11-22 02:47:14 UTC

Would be even better if you could provide patches like this r59327 so these don't need to be retyped.

Comment 8 Percy Mett 2009-11-22 09:14:08 UTC

Thanks for this. When will it become effective?

Will try to do provide patches in future.

I am not clear what is not part of the Unicode standard. Is U+05F0 (in װאנט) not a Unicode point? U+05BC?

Comment 9 Robert Stojnic 2009-11-22 17:00:23 UTC

It is deployed on yiwiki/wiktionary/wikisource ... Unicode has way of decomposing characters into simpler characters, i.e. to remove accents, but your custom decompositions rules are not part of it.

Comment 10 Percy Mett 2009-11-22 23:27:52 UTC

I am puzzled by this. I typed 
נאװעמבער
into the search box in yiwiki. It does not find the article named
נאוועמבער

However, if I type
 
NoVember

in the search box in the English Wikipedia, it does find the article named

November

I am not clear what this has to do with Unicode decomposition.

Comment 11 Robert Stojnic 2009-11-23 09:09:35 UTC

I get identical search results for both. But it looks like you want "Go" to directly go to the article ... In that case we would need to modify the TitleKey extension in non-trivial ways, and if you want linking to work, then also MediaWiki internals again in non-trivial ways..

Comment 12 Percy Mett 2009-11-23 10:19:02 UTC

You are right. I am sorry that I did not express myself sufficiently clearly in the original message.

I hadn't realized that this is so complicated to implement. The strange thing is that it **does** work the other way round! If I type a word with diacritics, the Go box will produce a dropdown list with the corresponding terms which do contain diacritics.

Thank you for your assistance (and for your patience).

Note You need to log in before you can comment on or make changes to this bug.

Wikimedia Bugzilla is closed!

Search

Personal tools

Navigation

Links