Last modified: 2014-09-23 23:53:32 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T8569, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 6569 - Avoid nested definition lists
Avoid nested definition lists
Status: NEW
Product: MediaWiki
Classification: Unclassified
Parser (Other open bugs)
All All
: Low normal with 1 vote (vote)
: ---
Assigned To: Gabriel Wicke
: need-parsertest, newparser, patch, patch-reviewed
: 11894 (view as bug list)
Depends on:
  Show dependency treegraph
Reported: 2006-07-06 12:37 UTC by Shtriter Andrew
Modified: 2014-09-23 23:53 UTC (History)
4 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---

Improved Parser.php - treats 2nd semicolon as literal (144.61 KB, text/plain)
2006-07-06 12:45 UTC, Shtriter Andrew
Patch that applies the above change (562 bytes, patch)
2007-08-31 12:22 UTC, Dan Collins

Description Shtriter Andrew 2006-07-06 12:37:41 UTC
The nesting of definition lists like ";; x :: y" produces awful html. 
Moreover, the  parser outputs different html for 2 dls with the common structre. 
The only difference between these lists is thet one of them is single-line and the other 
is not.
The simple example:
 ;; x :: y
 ;; x
 :: y

 <dl><dt> x&nbsp;</dt><dd><dl><dt></dt><dd> y
 <dl><dt></dt><dl><dt> x
 </dt><dd> y

IMHO, single-line dl parcing is not quite right. The emply <dt><dt> should stay before 
'<dt> x&nbsp;</dt>', like in multi-line variant.

I've discussed the problem on #mediawiki. TimStarling suggested  to treat the second 
semicolon as literal semicolon. It can be archived by adding new line:
 $oLine = preg_replace( '/;(;)+/', ';<nowiki>$1</nowiki>', $oLine );

 $preOpenMatch = preg_match('/<pre/i', $oLine );

PS. If there are more the one colon on the line (like in the first example), all colons, 
starting from 2nd will be also treaten as literals. Cause "; x :: y" acts in the same 
Comment 1 Shtriter Andrew 2006-07-06 12:45:40 UTC
Created attachment 2053 [details]
Improved Parser.php - treats 2nd semicolon as literal

Solves the problem of nested definition lists as described in the bug #6569.
Comment 2 Aryeh Gregor (not reading bugmail, please e-mail directly) 2006-07-17 04:46:58 UTC
Please include patches as diffs, not as entirely new files.  To apply your
changes, the devs would have to guess at what version you were working from,
diff them themselves, and only then would they be able to apply the diff to the
current version.
Comment 3 Dan Collins 2007-08-31 12:22:13 UTC
Created attachment 4063 [details]
Patch that applies the above change

patch for r25328
Comment 4 Gabriel Wicke 2011-11-10 14:14:47 UTC
Single-line definition lists handling currently appears to be wildly inconsistent:

My personal preference would be to treat a '; x : y' pair as a syntactic unit, so that

*; bla : blub




*; bla :: blub

results in

<dd>: blub</dd>

This would make it different from

*; bla
:: blub

which imo should result in


to stay consistent with general nested-list handling. This is also how lists are currently interpreted in the prototype PEG parser and HTML serializer we are currently working on:
Comment 5 Sumana Harihareswara 2011-11-10 14:24:46 UTC
From IRC conversation with Gabriel just now -- the patch might be technically fine, but it appears to be inconsistent with general nested list behaviour, and Gabriel it makes more sense to treat ; bla : blub as a unit.  So the patch needs more discussion on .  It could be that this patch is obviated by the new parser being developed ( ).
Comment 6 Gabriel Wicke 2011-11-10 15:33:45 UTC
Adding the newparser keyword so we keep this issue in mind for it.
Comment 7 Gabriel Wicke 2011-11-14 13:45:23 UTC
Additional information from Nested definition lists are rare enough to allow us to decide on a new standard without breaking too many pages:

> Can we deconstruct the current parser's processing steps and build a set
> of rules that must be followed?

I think the commonly-used structures are quite clearly defined, but the
behaviour of these strange permutations is quite unspecified. The parser
output for the case reported in the bug already changed in the meantime..

> I think we need to get a dump of English Wikipedia and start using a
> simple PEG parser to scan through it looking for patterns and figuring
> out how often certain things are used - if ever.

I just ran an en-wiki article dump through a zcat/tee/grep pipeline:

pattern			count		example
^			548498738 	(total number of lines)
^;			681495
^;[^:]+:		153997		; bla : blub
^[;:*#]+;[^:]+:		3817		*; bla : blub
^;;                     2332
^[:;*#]*;[^:]*::        41		most probably ;::
^[;:*#]*;[^:]+::	17		;; bla :: blub

Nested definition lists are not exactly common. Lines starting with ';;'
often appear as comments in code listings. The most common other
application appears to be indentation and emphasis. Any change in the
produced structure that keeps indentation and bolding should thus avoid
breaking pages.
Comment 8 Sumana Harihareswara 2012-01-23 19:55:06 UTC
(In reply to comment #7)
Dan, I'm marking this patch reviewed per Gabriel's comments; it would be great if you could reply, revise, and resubmit.  Thanks!
Comment 9 Gabriel Wicke 2012-06-27 14:34:18 UTC
*** Bug 11894 has been marked as a duplicate of this bug. ***
Comment 10 Gabriel Wicke 2012-06-27 14:40:09 UTC
We added several parser tests documenting Parsoid's behavior in parserTests.txt, but disabled them for the PHP parser for now. Please test the patch against those. The expected output might need whitespace adjustment to match the PHP parser output. The Parsoid parser test runner renormalizes whitespace, so should still pass after those changes.

Note You need to log in before you can comment on or make changes to this bug.