Last modified: 2014-09-23 23:53:32 UTC
The nesting of definition lists like ";; x :: y" produces awful html. Moreover, the parser outputs different html for 2 dls with the common structre. The only difference between these lists is thet one of them is single-line and the other is not. The simple example: ;; x :: y ;; x :: y Output: <dl><dt> x </dt><dd><dl><dt></dt><dd> y </dd></dl> </dd></dl> <p><br> </p> <dl><dt></dt><dl><dt> x </dt><dd> y </dd></dl> IMHO, single-line dl parcing is not quite right. The emply <dt><dt> should stay before '<dt> x </dt>', like in multi-line variant. I've discussed the problem on #mediawiki. TimStarling suggested to treat the second semicolon as literal semicolon. It can be archived by adding new line: $oLine = preg_replace( '/;(;)+/', ';<nowiki>$1</nowiki>', $oLine ); after $preOpenMatch = preg_match('/<pre/i', $oLine ); PS. If there are more the one colon on the line (like in the first example), all colons, starting from 2nd will be also treaten as literals. Cause "; x :: y" acts in the same way.
Created attachment 2053 [details] Improved Parser.php - treats 2nd semicolon as literal Solves the problem of nested definition lists as described in the bug #6569.
Please include patches as diffs, not as entirely new files. To apply your changes, the devs would have to guess at what version you were working from, diff them themselves, and only then would they be able to apply the diff to the current version.
Created attachment 4063 [details] Patch that applies the above change patch for r25328
Single-line definition lists handling currently appears to be wildly inconsistent: http://www.mediawiki.org/wiki/User:GWicke/Definitionlists My personal preference would be to treat a '; x : y' pair as a syntactic unit, so that *; bla : blub produces <ul> <li> <dl> <dt>bla </dt> <dd>blub</dd> </dl> </li> </ul> and *; bla :: blub results in <ul> <li> <dl> <dt>bla </dt> <dd>: blub</dd> </dl> </li> </ul> This would make it different from *; bla :: blub which imo should result in <ul> <li> <dl><dt>bla </dt></dl> </li> </ul> <dl> <dd><dl><dd>blub</dd></dl> </dl> to stay consistent with general nested-list handling. This is also how lists are currently interpreted in the prototype PEG parser and HTML serializer we are currently working on: http://www.mediawiki.org/wiki/Future/Parser_plan.
From IRC conversation with Gabriel just now -- the patch might be technically fine, but it appears to be inconsistent with general nested list behaviour, and Gabriel it makes more sense to treat ; bla : blub as a unit. So the patch needs more discussion on https://lists.wikimedia.org/mailman/listinfo/wikitext-l . It could be that this patch is obviated by the new parser being developed ( https://www.mediawiki.org/wiki/Future ).
Adding the newparser keyword so we keep this issue in mind for it.
Additional information from http://lists.wikimedia.org/pipermail/wikitext-l/2011-November/000483.html. Nested definition lists are rare enough to allow us to decide on a new standard without breaking too many pages: > Can we deconstruct the current parser's processing steps and build a set > of rules that must be followed? I think the commonly-used structures are quite clearly defined, but the behaviour of these strange permutations is quite unspecified. The parser output for the case reported in the bug already changed in the meantime.. > I think we need to get a dump of English Wikipedia and start using a > simple PEG parser to scan through it looking for patterns and figuring > out how often certain things are used - if ever. I just ran an en-wiki article dump through a zcat/tee/grep pipeline: pattern count example ------------------------------------------------------------------ ^ 548498738 (total number of lines) ^; 681495 ^;[^:]+: 153997 ; bla : blub ^[;:*#]+;[^:]+: 3817 *; bla : blub ^;; 2332 ^[:;*#]*;[^:]*:: 41 most probably ;:: ^[;:*#]*;[^:]+:: 17 ;; bla :: blub Nested definition lists are not exactly common. Lines starting with ';;' often appear as comments in code listings. The most common other application appears to be indentation and emphasis. Any change in the produced structure that keeps indentation and bolding should thus avoid breaking pages.
(In reply to comment #7) Dan, I'm marking this patch reviewed per Gabriel's comments; it would be great if you could reply, revise, and resubmit. Thanks!
*** Bug 11894 has been marked as a duplicate of this bug. ***
We added several parser tests documenting Parsoid's behavior in parserTests.txt, but disabled them for the PHP parser for now. Please test the patch against those. The expected output might need whitespace adjustment to match the PHP parser output. The Parsoid parser test runner renormalizes whitespace, so should still pass after those changes.