Last modified: 2013-04-08 17:05:04 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T46085, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 44085 - VisualEditor: ve.dm.SurfaceFragment.wordBoundaryPattern treats non-lower-ASCII word characters as boundaries
VisualEditor: ve.dm.SurfaceFragment.wordBoundaryPattern treats non-lower-ASCI...
Status: RESOLVED FIXED
Product: VisualEditor
Classification: Unclassified
Data Model (Other open bugs)
unspecified
All All
: High major
: VE-deploy-2013-04-01
Assigned To: Ed Sanders
: i18n
Depends on:
Blocks: ve-multi-lingual
  Show dependency treegraph
 
Reported: 2013-01-18 00:30 UTC by Trevor Parscal
Modified: 2013-04-08 17:05 UTC (History)
4 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Trevor Parscal 2013-01-18 00:30:47 UTC
See http://inimino.org/~inimino/blog/javascript_cset for some work in this area.
Comment 1 Roan Kattouw 2013-03-12 02:13:32 UTC
Bit of clarification:

When the user clicks the link button in the toolbar and they haven't selected any text, we expand the selection in both directions from the cursor position and select the word the cursor is in, make that a link, then show the link inspector. The code that expands the selection to a full word is in ve.dm.SurfaceFragment, and apparently treats non-ASCII characters as word boundaries. The practical bug that this leads to is that if you put the cursor in the middle of "Möckernbrücke" (or "égalité", if you prefer French) and click the link button, only "ckernbr" (or "galit", respectively) will be selected and linkified. Obviously this is a problem for i18n in languages using an extended Latin alphabet like German, French and Polish, but it's a total nightmare for non-Latin languages like Russian, Hebrew and Japanese.
Comment 2 Ed Sanders 2013-03-13 12:05:12 UTC
Acutually Chinese & Japanese don't have any word boundaries at all. The only way to detect them is with a dictionary. We'll need a special case for these languages so we don't end up selecting entire sentences.
Comment 3 Ed Sanders 2013-03-13 12:43:01 UTC
http://xregexp.com/ has unicode character class support. We may be able to pick out the data we need from it instead of using the whole library.
Comment 4 Ed Sanders 2013-03-13 13:49:56 UTC
To begin with a patch to add some test structure and fix what we have already: https://gerrit.wikimedia.org/r/#/c/53564
Comment 5 D Chan 2013-03-13 16:31:23 UTC
If you're going to do lexicon-based word boundary detection in Chinese, maybe you could use a word list stored in a client-side Bloom Filter. 

I don't know if it's as much of a problem in Japanese; you could probably use (?<=\P{Han})(?=\p{Han}) as a good start (i.e. there is a word break be.
Comment 6 Ed Sanders 2013-03-13 17:19:08 UTC
As an incremental improvement I've expanded the letters and numbers groups to their Unicode categories: https://gerrit.wikimedia.org/r/#/c/53583/
We still need to think about which punctuation categories to add.
Comment 7 Ed Sanders 2013-03-16 11:35:28 UTC
The Unicode standard has a fair amount to say on the matter. Ideally we would implement their standard.

http://www.unicode.org/reports/tr29/#Word_Boundaries
Comment 8 Ed Sanders 2013-03-18 16:48:05 UTC
Like this: https://gerrit.wikimedia.org/r/#/c/54480 (well, apart from non-BMP characters...)

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links