Last modified: 2010-05-15 15:29:17 UTC
If you use the following table, the "Weird invisible text" will be invisible when you preview the page before saving it. It shows up normally if you save the page and only affects the page preview. If you remove <!--bar-->, it all works normally. {| |- | valign="top" | <!--foo--> Weird invisible text |} <!--bar-->
I can reproduce the disappearing text, but I don't see it showing "normally if you save the page".
http://en.wikipedia.org/w/wiki.phtml?title=User:Angela/Sandbox&oldid=5879312 is a saved version of it. The text is not invisible there (which is what I meant by showing normally). It's only when you preview the page that it's invisible.
It seems to stay invisible if there's any following text: http://meta.wikimedia.org/w/wiki.phtml?title=Sandbox&oldid=60172 Perhaps the preview/editing-related stuff triggers the same thing.
Created attachment 39 [details] Make preview more consistent with displayed article The fact that the preview behaves differently from viewing the article is due to the way newlines are appended to the preview text (I'm not sure why). If this patch is applied, "Weird invisible text" shows up in the preview, as expected. But if you add <p><p> to the end of the testcase, you can reproduce the problem both in the preview and in the saved article, as Brion said.
The patch may be a good thing (please check that it produces correct spacing in the output with both preview before & preview after edit box options set), but the root problem is the vanishing text itself...
Indeed, the root problem is this line in Parser.php: # Remove HTML comments $text = preg_replace( '/(\\n *<!--.*--> *(?=\\n)|<!--.*-->)/sU', '$2', $text ); The regular expression is supposed to remove comments, but it transforms this: #### BEGIN text {| |- | valign="top" | <!--foo--> Weird invisible text |} <!--bar--> <p> #### END text into this: #### BEGIN text {| |- | valign="top" | <p> #### END text The desired result of the regex is: #### BEGIN text {| |- | valign="top" | Weird invisible text |} <p> #### END text The regex will need to be fixed; I'll try to take a look at this more later.
replacing it with $text = preg_replace( '/(\\n *<!--.*--> *|<!--.*?-->)/sU', '', $text ); works for me. I don't see why this should be anchored to an end-of-line. Tested it with some combinations, and didn't break any trivial example. Committed to CVS HEAD for further testing.
The problem with that is if you have something like abc <!-- comment --> def your code eats the newline and gives "abc def", while the original code gives abc def and the second seems more correct to me. This is really tricky.
abc def would start preformatted text. But that should only happen if the line starts with a blank. It doesn't. It starts with a <!--. So at least the space should also be eaten: abc def But, abc <!-- comment --> def should not become abc def since that would add a paragraph, which is not intended. But: * abc <!-- comment --> def currently becomes * abc def and this probably should not happen
I was just pointing out that your code represents a change in behavior from the original. In addition to the problem you mentioned, what about this: abc <!-- comment --> def That looks more like preformatted text, but your regexp would still make "abc def", generating only normal text. The only thing I have been able to convince myself is correct so far is this: $text = preg_replace( '/\\n *<!--.*--> *(?=\\n)/sUe', '(strstr("$1", "-->")) ? "$0" : ""', $text ); $text = preg_replace( '/<!--.*-->/sU', '', $text ); it's slow and quite ugly, but I think at least does what we want.
Err, I meant $text = preg_replace( '/\\n *<!--(.*)--> *(?=\\n)/sUe', '(strstr("$1", "-->")) ? "$0" : ""', $text ); $text = preg_replace( '/<!--.*-->/sU', '', $text ); This gets your example right: abc <!-- comment --> def But another constraint is that abc <!-- comment --> <!-- comment --> def should give the same thing; at least, that's how it works now. To make that work you can't capture the newline following the first comment, because then the second comment won't match the pattern and have its newline removed. But without capturing the newline at the end of a comment, how can you know whether to remove trailing whitespace? In short, I'm not sure there's a good way to do this without either changing the current behavior, or doing two separate passes as I did above.
Created attachment 40 [details] Custom function to remove comments I got tired of trying to make a regexp do the right thing, so I wrote a custom function to strip out the comments by hand. I hope this will be more robust than the previous methods, and in my tests, it was faster. :) Let me know what you think.
I just realized the change JeLuF committed is very broken. Now in the text abc <!-- test --> blah blah blah blah <!-- test --> all the "blah"s get removed. The "U" modifier makes all quantifiers ungreedy by default, so the ? acually enables greediness. Oops! I also think the problem with the problem with lists and preformatted text must be addressed. My second patch would fix this, but if it's too ugly, I can look for another solution....
Sorry for the spam, but I think this will do the trick: $text = preg_replace_callback( '/(\\n)??( *)<!--.*-->( *(?=\\1))??/sU', removeComment, $text ); function removeComment($m) { if ($m[1] === "\n") return ''; else return $m[2] . $m[3]; } although the custom function still seems faster for larger inputs.
Committed the custom function to CVS HEAD.