Last modified: 2014-01-03 17:49:19 UTC
Input HTML <p>Foo<ref>Bar</ref>Baz</p> Expected wikitext Foo<nowiki><ref>Bar</ref></nowiki>Baz Actual wikitext Foo<ref>Bar</ref>Baz This is rather problematic if a user is documenting MediaWiki operations. :-(
Hmm ... I guess somehow this one fell through the cracks or we have had regressions (which means our wt-escaping tests have holes in them). I was sure this was being handled already, and a cursory glance through the escaping code shows that there are checks for extensions, etc. Anyway, we will investigate next week.
So, actually, this is a slightly different use case than what is currently supported. Try entering any of the extension tags in VE and Review changes. Parsoid escapes the ext-tag like text as long as VE sends them back to Parsoid as text, i.e. <p>Foo&ot;ref>Bar</ref>Baz</p>. I verified this behavior in my local install of VE. This is similar to entering <div> or any other HTML tag as text in VE. If VE does not escape those tag strings to text, Parsoid will not be able to escape them. Currently, Domino parses "<ref>foo</ref>" as a valid XML tree which is the same behavior as the browser (verified in the web console). So, while Parsoid could check the node name against the set of valid HTML5 tags and nowiki the others (since we are not serializing a generic XML domtree), why would VE not entity-escape "<" and ">" chars entered as text before sending them to Parsoid, i.e. is this a valid use case in the context of VE editing?
Sounds a lot like a VE bug then. James, can you verify that you are sending those tags over the wire with correct escaping?
(In reply to comment #2) > So, actually, this is a slightly different use case than what is currently > supported. > > Try entering any of the extension tags in VE and Review changes. Parsoid > escapes the ext-tag like text as long as VE sends them back to Parsoid as > text, > i.e. <p>Foo&ot;ref>Bar</ref>Baz</p>. I verified this behavior in my > local install of VE. This is similar to entering <div> or any other HTML tag > as > text in VE. If VE does not escape those tag strings to text, Parsoid will not > be able to escape them. > That's how it should work, but on current master I get: $ echo "<p><ref></p>" | node js/tests/parse.js --html2wt <ref> $ echo "<p><ref> <code><ref></code></p>" | node js/tests/parse.js --html2wt <ref> <code><ref></code>
Ah, I see .. The first example is correct behavior since isolated "<ref>" or "</ref>" are always parsed back as text and hence there is no need to nowiki-escape them. However, the escaping code does need an update to deal with the second example. It currently only handles matching opening/closing extension tags in the same text node. [subbu@earth tests] echo "<ref>foo</ref>" | node parse --html2wt <nowiki><ref>foo</ref></nowiki> [subbu@earth tests] echo "<ref>foo" | node parse --html2wt <ref>foo [subbu@earth tests] echo "<ref>foo<p></ref></p>" | node parse --html2wt <ref>foo </ref> Unlike escaping for link/heading tags, where we only need to worry about wikitext on a single-line, the opening and closing tags for extensions can be split anywhere in the DOM. So, a conservative approach would be to always escape individual <ref> or </ref> tags which, in some cases, might lead to excess nowiki escaping, but will always be safe.
(In reply to comment #5) > So, a conservative approach would be to always > escape individual <ref> or </ref> tags which, in some cases, might lead to > excess nowiki escaping, but will always be safe. That seems like a sane approach to me.
Change 103598 had a related patch set uploaded by Subramanya Sastry: (Bug 57469) Fix for nowiki escaping of ext-tag like text https://gerrit.wikimedia.org/r/103598
Change 103598 merged by jenkins-bot: (Bug 57469) Fix for nowiki escaping of ext-tag like text https://gerrit.wikimedia.org/r/103598
Stray </ref> tags on these pages leads to dirty diffs (but acceptable, IMO) because of the fix: * http://parsoid.wmflabs.org/_rt/enwiki/Ken_Curtis * http://parsoid.wmflabs.org/_rt/enwiki/Koshi_Barrage * http://parsoid.wmflabs.org/_rt/arwiki/%D8%AD%D8%A7%D8%B6%D9%86%D8%A7%D8%AA_%D8%A7%D9%84%D8%A3%D8%B9%D9%85%D8%A7%D9%84 and a bunch of other pages. Adding this info here for the record in case we need to dig up pages later.
Looks like stray </ref> tags are not that uncommon .. A few tens of dirty diffs in rt-testing are all probably because of this. Another source of nowiki-ed dirty diffs seem to be because of buggy ref tags. That is captured in bug 59598 Together, these two categories of buggy ref-wikitext account for ~50 dirty diffs in rt-testing (out of about 160K pages tested). Still a small number overall, and selser will hide most of this in production, but could show up once in a while, but nothing that warrants more complex code in Parsoid to hide them.
(In reply to comment #10) > Still a small number > overall, and selser will hide most of this in production, but could show up > once in a while, but nothing that warrants more complex code in Parsoid to > hide > them. Agreed. These diffs are also classified as syntactical in our rt test setup, so should not distract too much from serious semantic diffs.