Last modified: 2014-10-14 21:53:56 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T65642, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 63642 - Parsoid outputs <small> as child of <figure>
Parsoid outputs <small> as child of <figure>
Status: REOPENED
Product: Parsoid
Classification: Unclassified
General (Other open bugs)
unspecified
All All
: Unprioritized major
: ---
Assigned To: Gabriel Wicke
:
Depends on:
Blocks: 54844
  Show dependency treegraph
 
Reported: 2014-04-07 18:13 UTC by Roan Kattouw
Modified: 2014-10-14 21:53 UTC (History)
6 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Roan Kattouw 2014-04-07 18:13:04 UTC
On http://parsoid.wmflabs.org/enwiki/User:Tamravidhir?oldid=603142295 , I see two images that look like:

<figure><small><a href="..."><img src="..." /></a><figcaption>Caption text</figcaption></small></figure>

You can find them by opening that link in a browser and typing document.querySelectorAll('figure>small') in a console, then clicking on the result (Firebug) or right-clicking the result and choosing "Reveal in elements panel" (Chrome).

Putting a <small> inside a <figure> tag but wrapped around a <figcaption> tag is a violation of the image spec (in addition to just being insane), which causes VisualEditor to get very confused and corrupt the image, which causes Parsoid to output [[undefined|NaNxNaNpx|right|link=]], see https://en.wikipedia.org/w/index.php?title=User:Tamravidhir&diff=603142575&oldid=603142295 . This is reminiscent of bug 62805 (NaNxNaNpx).

The wikitext that triggers this is fairly complicated, and I haven't been able to isolate a test case locally. It seems to be something like:

* '''Text'''
<small>
[[File:Foo|right|200px|Caption]]

Text

[[File:Foo|right|200px|Caption]]
</small>
* '''Text'''
<small>
etc.
Comment 1 C. Scott Ananian 2014-04-07 18:34:17 UTC
I'm fairly certain that LinkHandler never outputs <small>, so my guess is that this is a foster-parenting bug.
Comment 2 C. Scott Ananian 2014-04-07 18:36:04 UTC
(02:34:55 PM) RoanKattouw: This structure is in a table cell BTW
(02:35:00 PM) RoanKattouw: It's like the 104th child of a <td>
(02:35:08 PM) RoanKattouw: Mismatched tags seems plausible
(02:35:15 PM) RoanKattouw: There's a +</div> elsewhere in the dirty diff
Comment 3 ssastry 2014-04-07 19:24:34 UTC
Reduced test case: Every newline in there is important to reproduce this.

{|
|
<small>
[[File:Foo.jpg|right|Test]]
</small>
|}
Comment 4 ssastry 2014-04-08 17:06:55 UTC
node parse --trace html --dump dom:post-builder reveals that Parsoid generates a <p><small></p><figure>..</figure></small> (imagine them to be html tokens input to the HTML5 treebuilder) which the treebuilder helpfully fixes up to <p><small></small></p><figure><small>..</small></figure>

But, the odd thing is that the tree builder does this only when this is nested inside the table. Not sure if that is a feature or a bug. To be tested in Firefox and Chrome.

Replace <small> with <b>/<i>/.. and you see the same behavior. But, I think this is the same class of bugs around formatting tags wrapping certain kind of block tags and getting fixed up by HTML5 parsing semantics. In the case of unclosed tags, we have bug reports where a <s> or a <ul> formatting tag continues to the rest of the document (bug 50536, bug 57196). I am recording this info here for now. Unclear what the exact solution is for this here, but clearly, we need a generic solution for misnested formatting tags.
Comment 5 ssastry 2014-04-11 18:27:35 UTC
Reports from Roan on IRC:

* Not very common, but seen on a few user pages.
* This corrupts wikitext on save in pages where it is seen.
* Community liasion wants this fixed in some fashion.
* More important to protect content than provide editability.

If possible, we should probably wrap such images with uneditable protection (mw:Placeholder) and deal with the larger issue later.
Comment 6 C. Scott Ananian 2014-04-14 15:53:18 UTC
Firefox fixes this up as follows:

> div=document.createElement('div')
> div.innerHTML = "<p><small></p><figure><figcaption>foo</figcaption></figure</small>"
> div.outerHTML
"<div><p><small></small></p><figure><figcaption><small>foo</small></figcaption></figure></div>"

But:
> div.innerHTML = "<p><small></p><figure><a href='xxx'><img></a><figcaption>foo</figcaption></figure></small>"
> div.outerHTML
"<div><p><small></small></p><figure><small><a href="xxx"><img></a><figcaption>foo</figcaption></small></figure></div>"

So it seems to be the <a> tag in particular which causes the treebuilder to resurface the <small> tag.  The same thing happens if <small> is <b> or <i> or <s> or <u>, but not <a>:

> div.innerHTML = "<p><a href=yyy></p><figure><a href='xxx'><img></a><figcaption>foo</figcaption></figure></a>"
> div.outerHTML
"<div><p><a href="yyy"></a></p><figure><a href="xxx"><img></a><figcaption>foo</figcaption></figure></div>"

We might be able to handle this with a focused post-processing step which looks for <small>/<b>/<i>/<s>/<u> as the direct child of a <figure> and hoists the formatting safely inside the <figcaption>.  In theory we should have a similar check in WTS in case we get invalid HTML as input.
Comment 7 C. Scott Ananian 2014-04-25 19:48:04 UTC
For the WTS case, it's probably sufficient to tweak the code that looks for the various bits of <figure> and ensure it skips over any extraneous <small>/<b>/<i>/<s>/<u> as direct children of <figure>.  That part wouldn't require a separate fixup pass.
Comment 8 Gerrit Notification Bot 2014-04-25 20:39:04 UTC
Change 129818 had a related patch set uploaded by Cscott:
WIP: WTS: Handle tag fixup that floats inside <figure>.

https://gerrit.wikimedia.org/r/129818
Comment 9 Gerrit Notification Bot 2014-04-28 17:37:18 UTC
Change 130109 had a related patch set uploaded by Subramanya Sastry:
WIP: (Bug 63642): Cleanup treebuilder fixup of formatting elts.

https://gerrit.wikimedia.org/r/130109
Comment 10 Gerrit Notification Bot 2014-05-05 09:12:26 UTC
Change 130109 merged by jenkins-bot:
(Bug 63642): Cleanup treebuilder fixup of formatting elts.

https://gerrit.wikimedia.org/r/130109
Comment 11 ssastry 2014-05-05 14:01:23 UTC
This will be deployed on Wednesday.
Comment 12 ssastry 2014-05-06 23:00:25 UTC
Will close this .. Please reopen if this is not fixed post deploy tomorrow.
Comment 13 C. Scott Ananian 2014-06-17 22:34:33 UTC
See bug 66749 which appears to be a dup?

Also, the WTS side of this ( https://gerrit.wikimedia.org/r/129818 ) never got finished/merged.
Comment 14 Roan Kattouw 2014-10-14 20:04:44 UTC
This is still causing problems. See https://ru.wikipedia.org/?diff=66125825 for instance.

$ echo "<small>[[File:Foo.jpg|right|300px]]</small>" | node tests/parse.js --wt2wt

<small>[[File:Foo.jpg|right|300px|link=]]

This seems to happen because of something similar to what I described in bug 66749 comment 7: there is a <small> inside a <figure>, which violates Parsoid's HTML spec, and causes a round-trip failure. Additionally, there's a frivolous <p><small></small></p> before and <p></p> after the image.
Comment 15 ssastry 2014-10-14 20:14:17 UTC
Any thoughts about https://gerrit.wikimedia.org/r/#/c/162816/?
Comment 16 Roan Kattouw 2014-10-14 20:18:44 UTC
(In reply to ssastry from comment #15)
> Any thoughts about https://gerrit.wikimedia.org/r/#/c/162816/?

I don't understand the patch very well, but the changes to the test cases look promising.
Comment 17 ssastry 2014-10-14 21:53:56 UTC
That patch also fixes your example from Comment 14:

[subbu@earth lib] echo "<small>[[File:Foo.jpg|right|300px]]</small>" | node parse.js --wt2wt
<small>[[File:Foo.jpg|right|300px]]</small>

Okay, will revive it and try to get a resolution on it soon enough. I need to study behavior on more convoluted test cases and see what Tidy does and what we do now and see if I can fold that into this patch.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links