Last modified: 2013-10-02 19:15:19 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T43151, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 41151 - Linktrail&prefix is wrongly applied to CJK characters
Linktrail&prefix is wrongly applied to CJK characters
Status: RESOLVED FIXED
Product: Parsoid
Classification: Unclassified
token-stream transforms (Other open bugs)
unspecified
All All
: Low normal
: ---
Assigned To: C. Scott Ananian
:
Depends on:
Blocks: ve-nonenglish 43332
  Show dependency treegraph
 
Reported: 2012-10-18 07:18 UTC by Liangent
Modified: 2013-10-02 19:15 UTC (History)
6 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Liangent 2012-10-18 07:18:12 UTC
Wikitext:

[[]]字

Parsoid:

<p data-parsoid="{&quot;dsr&quot;:[0,6]}"><a rel="mw:WikiLink" href="./汉" data-parsoid="{&quot;tsr&quot;:[0,6],&quot;bsp&quot;:[0,6],&quot;a&quot;:{&quot;href&quot;:&quot;./汉&quot;},&quot;sa&quot;:{&quot;href&quot;:&quot;汉&quot;},&quot;stx&quot;:&quot;simple&quot;,&quot;tail&quot;:&quot;字&quot;,&quot;dsr&quot;:[0,6]}">汉字</a></p>

PHP Parser <https://www.mediawiki.org/w/index.php?title=Project:Sandbox&oldid=594926>:

<p><a href="/w/index.php?title=%E6%B1%89&amp;action=edit&amp;redlink=1" class="new" title="汉 (page does not exist)">汉</a>字</p>
Comment 1 Gabriel Wicke 2012-10-19 15:46:19 UTC
We don't support the localized link trail regexp yet and default to the English one, so in this case the trail is not matched as such.

Currently our focus is to make Parsoid safe for the English Wikipedia first, so that we can release it as a demo in December. After that release the plan is to shift the focus to C++, which will also enable us to call back into PHP. That should allow us to reuse the existing localizations and message systems, so we don't spend too much time reinventing the wheel in a JavaScript prototype.
Comment 2 Gabriel Wicke 2012-10-19 16:28:24 UTC
[09:19] <liangent> $linkTrail = '/^([a-z]+)(.*)$/sD'; is in MessagesEn.php
[09:20] <liangent> this shouldn't include CJK characters, right?
[09:20] <gwicke> I would think so
[09:20] <liangent> but parsoid includes CJK chars in linktrail..
[09:21] <gwicke> interesting- I guess we approximate the regexp to something more liberal right now
[09:22] <gwicke> there is no i18n support yet, so we don't use the localized regexps
[09:23] <gwicke> we currently have tail:( ![A-Z \t(),.:\n\r-] tc:text_char { return tc } )* 
[09:24] <gwicke> I think the idea was to be very liberal about tails in the tokenizer, and to convert/validate based on language in token stream transforms
[09:24] <gwicke> invalid tails can then be converted back to a text token
[09:25] <gwicke> the A-Z might be a bit fishy in that context though..
Comment 3 Gabriel Wicke 2013-01-28 22:56:16 UTC
This specific example is working now as we have extended our negative char class to include something very close to the union of the complements of per-language character classes. This is a bit of a departure from default MediaWiki behavior, but might not be noticeable in practice. 

We'd get consistent link trail behavior across languages if this works ok. If it does not, we'd have to revert to traditional per-language regexps.

Liangent, how well does the new regexp work for Chinese?
Comment 4 Liangent 2013-01-29 01:46:46 UTC
(In reply to comment #3)
> Liangent, how well does the new regexp work for Chinese?

[[]]cjk[[]]

This gets linktrailed which are usually not wanted. I believe there're some real world use cases on zhwiki as we usually don't put a space between Chinese and embedded English words.
Comment 5 Gabriel Wicke 2013-02-19 19:07:53 UTC
We have switched to use per-language link trail (and prefix) regexps now with https://gerrit.wikimedia.org/r/#/c/48589/. Our HTML form defaults to English settings, but testing on a test page from the Chinese Wikipedia has a good chance of working. Can you verify that the test cases above are now fixed?
Comment 6 Liangent 2013-02-24 12:19:31 UTC
(In reply to comment #5)
> We have switched to use per-language link trail (and prefix) regexps now with
> https://gerrit.wikimedia.org/r/#/c/48589/. Our HTML form defaults to English
> settings, but testing on a test page from the Chinese Wikipedia has a good
> chance of working. Can you verify that the test cases above are now fixed?

Can you make that HTML form accept an extra "language" option?
Comment 7 Gabriel Wicke 2013-02-26 01:03:33 UTC
We could do so, but it is pretty low priority for us. Note that you can also point the web service to a page in the user namespace like this:

http://parsoid.wmflabs.org/zh/User:Liangent/Test
Comment 8 Liangent 2013-02-26 11:40:25 UTC
(In reply to comment #7)
> We could do so, but it is pretty low priority for us. Note that you can also
> point the web service to a page in the user namespace like this:
> 
> http://parsoid.wmflabs.org/zh/User:Liangent/Test

Now linkprefix is applied incorrectly. Still the same use case: [[]]cjk[[]]

Expected: [[]]cjk[[]]

Actual: [[]][[|cjk文]]
Comment 9 Mark Holmquist 2013-02-27 18:24:33 UTC
There's still a patch in that depends on a core patch. It's not testable on zhwiki until the core patch is deployed. I guess that will happen at wmf11, which is (at most) four weeks away from zhwiki. But it may have fixed this.

If someone could test it locally, it's at https://gerrit.wikimedia.org/r/50814 and the core patch is in the latest mediawiki (from git).
Comment 10 C. Scott Ananian 2013-10-02 18:47:20 UTC
This should be fixed now.  The core patch has been merged for some time.  Liangent, can you retest?
Comment 11 Liangent 2013-10-02 19:15:19 UTC
This seems fixed, but I found bug 54891 when testing it. Not sure whether this should be considered a bug on VisualEditor side or in Parsoid serializer.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links