Last modified: 2014-08-23 15:20:28 UTC
As an alternative solution to Bug 3461, non-breaking spaces should be added automatically by Mediawiki on page render in appropriate places:
* After numbers, so that "100 km" stays together
* Before dashes (like http://en.wikipedia.org/wiki/Template:Ndash )
Don't worry too much about false positives, since an extra non-breaking space won't cause any serious problems unless many of them occur on the same line.
Clarifying that this requests an addition to the existing automatic rules, rather than creating a new feature.
(In reply to comment #1)
> Clarifying that this requests an addition to the existing automatic
> rules, rather than creating a new feature.
Is there any documentation for the existing rules?
Documentation? Don't be silly, this is MediaWiki! ;)
You can find the current rules in Parser::parse(), though:
# Clean up special characters, only run once, next-to-last before doBlockLevels
$fixtags = array(
# french spaces, last one Guillemet-left
# only if there is something before the space
'/(.) (?=\\?|:|;|!|%|\\302\\273)/' => '\\1 \\2',
# french spaces, Guillemet-right
'/(\\302\\253) /' => '\\1 ',
'/ (!\s*important)/' => ' \\1', #Beware of CSS magic word !important, bug #11874.
(In reply to comment #3)
> Documentation? Don't be silly, this is MediaWiki! ;)
I wasn't expecting a book. :) just a link to mailing list or prior bug report.
> You can find the current rules in Parser::parse(), though:
Ok, so currently all it does is:
* Changes "some : word" into "some : word" and likewise for ? : ; ! % »
* Changes "« " into "« "
* Breaks things inside HTML tags :)
So adding one before dashes is easy enough. Just add a hyphen and the codes for en and em dashes to the ?|:|;|!|% regexp.
I'd like it to also add a nbsp; for anything like "10 kiloohm" or "100 MW". We could either write a huge regular expression for every unit and prefix that exists (http://en.wikipedia.org/wiki/User:Bobblewik/monobook.js/unitformatter.js), or we could just make the rule for any time a number is followed by a space that is followed by a letter. The Manual of Style actually recommends as much:
Active MoS editors generally believe that something along these lines would be great. If you want to simplify the rule to "number space letter gets replaced by no-break space", then the MoS editors believe that additional markup would be useful for the no-break space, probably a double-comma (,,) (that is, the double-comma would be typed and show it the edit window, and would be rendered as hard-space in the text). The reason is that we don't want automatically-inserted invisible characters to start multiplying in the text, as additions and deletions are made; we want to be able to see them, and easily insert and delete them. On the other hand, if you use very specific rules to insert no-break spaces exactly where most style manuals want them inserted (and I like http://en.wikipedia.org/wiki/User:Bobblewik/monobook.js/unitformatter.js as a good start), then perhaps the double-comma markup is not necessary; we'll be happy to take anything you can give us and try it out.
I should add: I'm talking about en.wikipedia.org. It's my sense that GA and FA article reviewers are included in the long list of people who have approved the idea; if it makes a difference, I'll be happy to survey their opinions.
(In reply to comment #5)
> If you want to simplify the rule to "number space letter gets replaced
> by no-break space", then the MoS editors believe that additional markup would
> be useful for the no-break space
No. This is about adding a non-breaking space automatically when the page is rendered. Please don't add even more markup to the already cluttered and confusing syntax. Wiki markup is not like HTML, where you have to specify formatting and detail every little thing. The whole point of a wiki is that you enter semantic information, and it takes care of all the formatting and other little details for you.
> The reason is that we don't want
> automatically-inserted invisible characters to start multiplying in the text,
> as additions and deletions are made
They won't be multiplying over time and they won't be visible in the edit box. This wouldn't affect the code in the edit box at all. It would only affect the HTML of the final rendered article.
Thanks for the explanation; I agree that's more elegant if the wizards can do it. Would anyone like me to survey among article reviewers and MoS editors to see if they see potential problems from a broad rule such as "number space letter never wraps"?
(In reply to comment #8)
> Would anyone like me to survey among article reviewers and MoS editors to
> see if they see potential problems from a broad rule such as "number space
> letter never wraps"?
Absolutely. It's recommended in the manual of style to add a non-breaking space for this case (not just units), but there are certainly a few cases that shouldn't be. False positives won't cause much of a problem, though, since it will just prevent things from line wrapping, and it can't happen multiple times in a row to create a page-widening attack. ("1 a 1 a 1 a" --> "1 a 1 a 1 a ")
Oh wait. :) "a1 a1 a1 a1" --> "a1 a1 a1 a1"
Why worry about spacing here? You can just write aaaaaaaaaaaaa... and widen to your heart's content. :)
I'm surveying the WP:MOSNUM people now and I gave them the http://en.wikipedia.org/wiki/User:Bobblewik/monobook.js/unitformatter.js list to tweak. Not wrapping at "number space letter" is a non-starter. More than 90% of the time, that will be something we want to wrap, such as "the 1969 Mets World Series" or "9999 bottles of beer".
(In reply to comment #12)
> More than
> 90% of the time, that will be something we want to wrap, such as "the 1969 Mets
> World Series" or "9999 bottles of beer".
Why would we want those to wrap? MOSNUM currently recommends that they don't.
Why would we want to keep them from wrapping? MOSNUM is nonsense, recommending non-breaking spaces in places where they are not needed, and not recommending them in places where they are needed. It is also vague and ambiguous, arguably recommending a nonbreaking space at the star in "Ninety-nine*bottles of beer", and in the first space but not saying anything about the second space in a paper weight of "75 g m<sup>−2</sup>"; if that breaks, it should be between the 5 and the g, NOT between the g and the m, which is not only what the MoS rule says, but it is ALSO what we would get if this bug/feature request were implemented.
(In reply to comment #14)
> Why would we want to keep them from wrapping?
Why wouldn't we? See:
There is no consensus for not wrapping "9999 bottles of beer". If the letter and number are long, it may well produce clumsy final text; the key question is whether "9999<nowiki> </nowiki>bottles" will disable this feature, so it can be turned off when it does cause trouble.
(In reply to comment #16)
> There is no consensus for not wrapping "9999 bottles of beer".
Please discuss at http://en.wikipedia.org/wiki/Wikipedia_talk:Manual_of_Style#No-break_spaces_discussion_continues_at_bugzilla
then we can come back here and tell the devs what we want
(In reply to comment #16)
> the key question is
> whether "9999<nowiki> </nowiki>bottles" will disable this feature, so it can be
> turned off when it does cause trouble.
It does. Try a long string of « word »« word »« word »« word » vs <nowiki>« word »« word »« word »« word »</nowiki>
Why not create a MediaWiki: message with a space, comma, or whatever you want, separated list of units.
Then take that message and quote it then convert the separators into |'s turn it into a proper regex list with escaping.
Then just add a with the regex [/(\d+) (<Quoted | list here)/S, "\\1 \\2"]
That way only real units have the nbsp added, and additionally wikis may localize the units, and also add any newer or custom units such as fake units which apply only to their wiki. Or instead they can just replace the message with a - and have the whole thing disabled if they don't want it.
Please don't discuss the merits of various ideas here, discuss them on-wiki and report on consensus. Bugzilla is an even worse discussion forum than talk pages. :)
We've had localizable regexes before that were part of the parser, like linktrail, but AFAIK those have been disabled as too scary. They can still be localized per-language, but only in the PHP files, not in the MW-namespace messages.
(In reply to comment #19)
> Why not create a MediaWiki: message with a space, comma, or whatever you want,
> separated list of units.
Why not create a MediaWiki: message where one could add regular expressions and their replacements? Then every language (this discussion here is very en-focused) could add it's rules, could test them and so on...
For German and many related languages, the "digit space letter" rule would be wrong too often, I believe.
Few examples translated to English, using "_" to represent the nonbreaking space:
1) word space digit rules:
the year 1960 and ==> year_1960 and
a class 23354 consumer good ==> a class_23354 consumer good
laid down in ISO 4711 and not in ==> in ISO_4711 and
an ASA 22 film ==> an ASA_22 film
this is in paragraph 16 of the law on ==> in paragraph_16 of
but article 3 in the constitution ==> but article_3 in
king Henry 8 did ==> king Henry_8 did
2) more complex:
the years 1970 and 71 ==> years 1970_and_71
is 17 and a half miles from home ==> is 17_and_a_half_miles from home
was 18 miles and three eighth until ==> was 18_miles and three_eighth until
my 22 years old sister ==> my 22_years_old sister
took 23 years until ==> took 22_years until
I doubt, that this can be had in a language independent way. We still would have not so few false positives, such as:
found the article 19 feet behind the
went in that year 1999 soldiers to
according to ISO 1234 people in Spain
(Note that, English word order and comma rules make English much less prone to some of those)
Currencies, and their abbreviations, can appear both in front of, and after the figures they relate to, so we should have both a " curreny space [+-] digit " and a " digit space currency " rule and probably tolerate " In week 17 € 1500 were spent " unless we can make a " 'week' space digit " rule eat the 17 on its own, hiding it from the cureency rules.
Also, there are style rules like these:
we saw 1 young man ==> saw a young man / saw one young man
not even 7 sailors ==> not even seven sailors
when 12 candles ==> when twelve candles
with 13 grumps ==> with 13_grumps
So I suggest a language specific, or language group specific, kind of treatment.
*** Bug 18443 has been marked as a duplicate of this bug. ***
(In reply to comment #3)
> Documentation? Don't be silly, this is MediaWiki! ;)
Heh. I've created https://meta.wikimedia.org/wiki/Help:Newlines_and_spaces#Non-breaking_spaces
(In reply to comment #20)
> Please don't discuss the merits of various ideas here, discuss them on-wiki and
> report on consensus. Bugzilla is an even worse discussion forum than talk
> pages. :)
Perhaps we can summarize on that Meta page (and even discuss in its talk)?
> We've had localizable regexes before that were part of the parser, like
> linktrail, but AFAIK those have been disabled as too scary. They can still be
> localized per-language, but only in the PHP files, not in the MW-namespace
This still holds true, so I suppose this is the way here too, and I've written it in the above page. I'm not going to summarize anything else from these two bugs because they're too long, but feel free if you find something consensual. :-)
fyi: Because of bug #18443 I already started a discussion at w:de concerning German typography.
At https://de.wikipedia.org/wiki/WD:TYP#automatische_leerzeichen there's an unfinished table called 'regexps' which will resolve bug #18443 and this bug at least for w:de.
That table is still under construction. If it's finished I'll inform you here.
I’d like to point out one approach, which was discussed in w:de some years ago (discussion felt asleep back then):
Use of underscores for thin- and non-breaking-spaces within the wiki-code:
One underscore for thin-space: _ ⇒ “ ”
Two underscores for n-b-space: __ ⇒ “ ”
Underscores are hardly ever used, except for links (there a filter can easily be implemented). In those rare remaining cases, the nowiki-tag should be used.
This would allow every user with minimal experience to use the correct typography, avoid long lists of common abbrevations as started on the German project site and ensure, that copy-paste-errors of spaces are easily detectable.
(In reply to comment #27)
> Use of underscores for thin- and non-breaking-spaces within the wiki-code:
This is bug 3461, please continue there.
It would be helpful to fix this bug at least vor numbers and SI units and perhaps some widely used non-SI units (as ft, kn/kt mph, sm/nm)
Thanks Matthias. Would really be nice to see movement on this after all these years ... it would make VE so much prettier too if we didn't have to deal with some nbsp-equivalent in VE.