Last modified: 2014-02-28 23:41:47 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T52167, the corresponding Phabricator task for complete and up-to-date bug report information.

Bug 50167 - VisualEditor: Support unicode equivalence for client side text searches


Summary:	VisualEditor: Support unicode equivalence for client side text searches

Status:	ASSIGNED

Product:	VisualEditor
Classification:	Unclassified
Component:	Data Model (Other open bugs)
Version:	unspecified
Hardware:	All All

Importance:	Normal enhancement
Target Milestone:	---
Assigned To:	Editing team bugs – take if you're interested!

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:	ve-multi-lingual
	Show dependency tree / graph

Reported:	2013-06-25 11:02 UTC by Ed Sanders
Modified:	2014-02-28 23:41 UTC (History)
CC List:	5 users (show)

See Also:
Web browser:	---
Mobile Platform:	---
Assignee Huggle Beta Tester:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Description Ed Sanders 2013-06-25 11:02:08 UTC

Planned features, such as searching for an existing reference by content, will require us to implement some http://en.wikipedia.org/wiki/Unicode_equivalence .

We will probably want to use NFKD ("Characters are decomposed by compatibility, and multiple combining characters are arranged in a specific order.") to catch cases such as 'ﬀ' === 'ff', and we will probably want to strip combining characters (i.e. all accents), so that 'Amelie' === 'Amélie'.

https://github.com/walling/unorm looks like a good library for the job. We may want to fork it into UnicodeJS.

Comment 1 D Chan 2013-06-25 12:30:02 UTC

We probably shouldn't strip down beyond NFKD. For some languages, 'ä' should be equivalent to 'a'; for others, it shouldn't be equivalent to anything; for still others, it should be equivalent to 'ae'.

Will it be feasible to implement language-specific search on top of this?

Comment 2 Ed Sanders 2013-06-25 14:04:31 UTC

I don't see why not. We may want to add things like 'ß' => 'ss' in German, or final vs. non-final sigma in Greek (https://en.wikipedia.org/wiki/Sigma#Character_Encodings)

Comment 3 D Chan 2013-06-28 11:47:00 UTC

It's worth noting that in most software, many common grapheme clusters are displayed more correctly when encoded as a single unicode character than when encoded with combining characters. For example, 'sgrîn' ("sgr\u00EEn") displays correctly in my version of Firefox on Linux, but the equivalent decomposed string 'sgrîn' ("sgri\u0302n") shows up with the dot still on the i and the accent in the wrong place (either uncentered over the i, or over the n, depending on the font).

Therefore, while we may want to search and process text using decomposed forms, we should probably use the composed forms for display.

Comment 4 Ed Sanders 2013-06-28 12:54:09 UTC

Agreed. You're likely to do that naturally when displaying search results but it would be a consideration if you try to highlight the matching substring in the result (a non-trivial problem when normalisation is involved)

Comment 5 D Chan 2013-07-11 14:39:54 UTC

So, if I'm understanding correctly, when the user starts a search we want to generate a normalised copy of the entire document in NFKD. (Otherwise we've got a problem keeping two copies in sync). Is this acceptable efficiency-wise?

Could we force the characters in the document model to be in NFC? Could Parsoid provide the article text in NFC? (This is partially off-topic, but we probably want to consider different normalisation issues together).

Note You need to log in before you can comment on or make changes to this bug.

Wikimedia Bugzilla is closed!

Search

Personal tools

Navigation

Links