Last modified: 2014-02-28 23:41:47 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T52167, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 50167 - VisualEditor: Support unicode equivalence for client side text searches
VisualEditor: Support unicode equivalence for client side text searches
Status: ASSIGNED
Product: VisualEditor
Classification: Unclassified
Data Model (Other open bugs)
unspecified
All All
: Normal enhancement
: ---
Assigned To: Editing team bugs – take if you're interested!
:
Depends on:
Blocks: ve-multi-lingual
  Show dependency treegraph
 
Reported: 2013-06-25 11:02 UTC by Ed Sanders
Modified: 2014-02-28 23:41 UTC (History)
5 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Ed Sanders 2013-06-25 11:02:08 UTC
Planned features, such as searching for an existing reference by content, will require us to implement some http://en.wikipedia.org/wiki/Unicode_equivalence .

We will probably want to use NFKD ("Characters are decomposed by compatibility, and multiple combining characters are arranged in a specific order.") to catch cases such as 'ff' === 'ff', and we will probably want to strip combining characters (i.e. all accents), so that 'Amelie' === 'Amélie'.

https://github.com/walling/unorm looks like a good library for the job. We may want to fork it into UnicodeJS.
Comment 1 D Chan 2013-06-25 12:30:02 UTC
We probably shouldn't strip down beyond NFKD. For some languages, 'ä' should be equivalent to 'a'; for others, it shouldn't be equivalent to anything; for still others, it should be equivalent to 'ae'.

Will it be feasible to implement language-specific search on top of this?
Comment 2 Ed Sanders 2013-06-25 14:04:31 UTC
I don't see why not. We may want to add things like 'ß' => 'ss' in German, or final vs. non-final sigma in Greek (https://en.wikipedia.org/wiki/Sigma#Character_Encodings)
Comment 3 D Chan 2013-06-28 11:47:00 UTC
It's worth noting that in most software, many common grapheme clusters are displayed more correctly when encoded as a single unicode character than when encoded with combining characters. For example, 'sgrîn' ("sgr\u00EEn") displays correctly in my version of Firefox on Linux, but the equivalent decomposed string 'sgrîn' ("sgri\u0302n") shows up with the dot still on the i and the accent in the wrong place (either uncentered over the i, or over the n, depending on the font).

Therefore, while we may want to search and process text using decomposed forms, we should probably use the composed forms for display.
Comment 4 Ed Sanders 2013-06-28 12:54:09 UTC
Agreed. You're likely to do that naturally when displaying search results but it would be a consideration if you try to highlight the matching substring in the result (a non-trivial problem when normalisation is involved)
Comment 5 D Chan 2013-07-11 14:39:54 UTC
So, if I'm understanding correctly, when the user starts a search we want to generate a normalised copy of the entire document in NFKD. (Otherwise we've got a problem keeping two copies in sync). Is this acceptable efficiency-wise?

Could we force the characters in the document model to be in NFC? Could Parsoid provide the article text in NFC? (This is partially off-topic, but we probably want to consider different normalisation issues together).

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links