Last modified: 2013-04-08 17:05:04 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T46085, the corresponding Phabricator task for complete and up-to-date bug report information.

Bug 44085 - VisualEditor: ve.dm.SurfaceFragment.wordBoundaryPattern treats non-lower-ASCII word characters as boundaries


Summary:	VisualEditor: ve.dm.SurfaceFragment.wordBoundaryPattern treats non-lower-ASCI...

Status:	RESOLVED FIXED

Product:	VisualEditor
Classification:	Unclassified
Component:	Data Model (Other open bugs)
Version:	unspecified
Hardware:	All All

Importance:	High major
Target Milestone:	VE-deploy-2013-04-01
Assigned To:	Ed Sanders

URL:
Whiteboard:
Keywords:	i18n

Depends on:
Blocks:	ve-multi-lingual
	Show dependency tree / graph

Reported:	2013-01-18 00:30 UTC by Trevor Parscal
Modified:	2013-04-08 17:05 UTC (History)
CC List:	4 users (show)

See Also:
Web browser:	---
Mobile Platform:	---
Assignee Huggle Beta Tester:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Description Trevor Parscal 2013-01-18 00:30:47 UTC

See http://inimino.org/~inimino/blog/javascript_cset for some work in this area.

Comment 1 Roan Kattouw 2013-03-12 02:13:32 UTC

Bit of clarification:

When the user clicks the link button in the toolbar and they haven't selected any text, we expand the selection in both directions from the cursor position and select the word the cursor is in, make that a link, then show the link inspector. The code that expands the selection to a full word is in ve.dm.SurfaceFragment, and apparently treats non-ASCII characters as word boundaries. The practical bug that this leads to is that if you put the cursor in the middle of "Möckernbrücke" (or "égalité", if you prefer French) and click the link button, only "ckernbr" (or "galit", respectively) will be selected and linkified. Obviously this is a problem for i18n in languages using an extended Latin alphabet like German, French and Polish, but it's a total nightmare for non-Latin languages like Russian, Hebrew and Japanese.

Comment 2 Ed Sanders 2013-03-13 12:05:12 UTC

Acutually Chinese & Japanese don't have any word boundaries at all. The only way to detect them is with a dictionary. We'll need a special case for these languages so we don't end up selecting entire sentences.

Comment 3 Ed Sanders 2013-03-13 12:43:01 UTC

http://xregexp.com/ has unicode character class support. We may be able to pick out the data we need from it instead of using the whole library.

Comment 4 Ed Sanders 2013-03-13 13:49:56 UTC

To begin with a patch to add some test structure and fix what we have already: https://gerrit.wikimedia.org/r/#/c/53564

Comment 5 D Chan 2013-03-13 16:31:23 UTC

If you're going to do lexicon-based word boundary detection in Chinese, maybe you could use a word list stored in a client-side Bloom Filter. 

I don't know if it's as much of a problem in Japanese; you could probably use (?<=\P{Han})(?=\p{Han}) as a good start (i.e. there is a word break be.

Comment 6 Ed Sanders 2013-03-13 17:19:08 UTC

As an incremental improvement I've expanded the letters and numbers groups to their Unicode categories: https://gerrit.wikimedia.org/r/#/c/53583/
We still need to think about which punctuation categories to add.

Comment 7 Ed Sanders 2013-03-16 11:35:28 UTC

The Unicode standard has a fair amount to say on the matter. Ideally we would implement their standard.

http://www.unicode.org/reports/tr29/#Word_Boundaries

Comment 8 Ed Sanders 2013-03-18 16:48:05 UTC

Like this: https://gerrit.wikimedia.org/r/#/c/54480 (well, apart from non-BMP characters...)

Note You need to log in before you can comment on or make changes to this bug.

Wikimedia Bugzilla is closed!

Search

Personal tools

Navigation

Links