Last modified: 2014-08-23 15:20:28 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T15619, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 13619 - Add non-breaking spaces in additional places automatically
Add non-breaking spaces in additional places automatically
Status: NEW
Product: MediaWiki
Classification: Unclassified
Parser (Other open bugs)
unspecified
All All
: Low enhancement with 6 votes (vote)
: ---
Assigned To: Nobody - You can work on this!
http://en.wikipedia.org/wiki/Wikipedi...
:
: 18443 (view as bug list)
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2008-04-05 20:06 UTC by Omegatron
Modified: 2014-08-23 15:20 UTC (History)
11 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Omegatron 2008-04-05 20:06:01 UTC
As an alternative solution to Bug 3461, non-breaking spaces should be added automatically by Mediawiki on page render in appropriate places:

* After numbers, so that "100 km" stays together
* Before dashes (like http://en.wikipedia.org/wiki/Template:Ndash )
* etc.

Don't worry too much about false positives, since an extra non-breaking space won't cause any serious problems unless many of them occur on the same line.
Comment 1 Brion Vibber 2008-04-07 21:38:19 UTC
Clarifying that this requests an addition to the existing automatic   rules, rather than creating a new feature.
Comment 2 Omegatron 2008-04-07 21:42:05 UTC
(In reply to comment #1)
> Clarifying that this requests an addition to the existing automatic  
> rules, rather than creating a new feature.

Is there any documentation for the existing rules?
Comment 3 Brion Vibber 2008-04-08 23:31:25 UTC
Documentation? Don't be silly, this is MediaWiki! ;)

You can find the current rules in Parser::parse(), though:

# Clean up special characters, only run once, next-to-last before doBlockLevels
$fixtags = array(
	# french spaces, last one Guillemet-left
	# only if there is something before the space
	'/(.) (?=\\?|:|;|!|%|\\302\\273)/' => '\\1 \\2',
	# french spaces, Guillemet-right
	'/(\\302\\253) /' => '\\1 ',
	'/ (!\s*important)/' => ' \\1', #Beware of CSS magic word !important, bug #11874.
);
Comment 4 Omegatron 2008-04-09 01:19:52 UTC
(In reply to comment #3)
> Documentation? Don't be silly, this is MediaWiki! ;)

I wasn't expecting a book. :) just a link to mailing list or prior bug report.

> You can find the current rules in Parser::parse(), though:

Ok, so currently all it does is:
* Changes "some : word" into "some : word" and likewise for ? : ; ! % »
* Changes "« " into "« "
* Breaks things inside HTML tags :)

So adding one before dashes is easy enough.  Just add a hyphen and the codes for en and em dashes to the ?|:|;|!|% regexp.

I'd like it to also add a nbsp; for anything like "10 kiloohm" or "100 MW".  We could either write a huge regular expression for every unit and prefix that exists (http://en.wikipedia.org/wiki/User:Bobblewik/monobook.js/unitformatter.js), or we could just make the rule for any time a number is followed by a space that is followed by a letter.  The Manual of Style actually recommends as much:

http://en.wikipedia.org/wiki/Wikipedia:Manual_of_Style#Non-breaking_spaces
http://en.wikipedia.org/wiki/Wikipedia:Manual_of_Style_%28dates_and_numbers%29#Non-breaking_spaces
Comment 5 Dan Kindsvater 2008-04-10 23:00:26 UTC
Active MoS editors generally believe that something along these lines would be great.  If you want to simplify the rule to "number space letter gets replaced by no-break space", then the MoS editors believe that additional markup would be useful for the no-break space, probably a double-comma (,,) (that is, the double-comma would be typed and show it the edit window, and would be rendered as hard-space in the text).  The reason is that we don't want automatically-inserted invisible characters to start multiplying in the text, as additions and deletions are made; we want to be able to see them, and easily insert and delete them.  On the other hand, if you use very specific rules to insert no-break spaces exactly where most style manuals want them inserted (and I like http://en.wikipedia.org/wiki/User:Bobblewik/monobook.js/unitformatter.js as a good start), then perhaps the double-comma markup is not necessary; we'll be happy to take anything you can give us and try it out.
Comment 6 Dan Kindsvater 2008-04-10 23:04:13 UTC
I should add: I'm talking about en.wikipedia.org.  It's my sense that GA and FA article reviewers are included in the long list of people who have approved the idea; if it makes a difference, I'll be happy to survey their opinions.
Comment 7 Omegatron 2008-04-10 23:34:12 UTC
(In reply to comment #5)
> If you want to simplify the rule to "number space letter gets replaced
> by no-break space", then the MoS editors believe that additional markup would
> be useful for the no-break space

No.  This is about adding a non-breaking space automatically when the page is rendered.  Please don't add even more markup to the already cluttered and confusing syntax.  Wiki markup is not like HTML, where you have to specify formatting and detail every little thing.  The whole point of a wiki is that you enter semantic information, and it takes care of all the formatting and other little details for you.

> The reason is that we don't want
> automatically-inserted invisible characters to start multiplying in the text,
> as additions and deletions are made

They won't be multiplying over time and they won't be visible in the edit box.  This wouldn't affect the code in the edit box at all.  It would only affect the HTML of the final rendered article.
Comment 8 Dan Kindsvater 2008-04-11 02:48:28 UTC
Thanks for the explanation; I agree that's more elegant if the wizards can do it.  Would anyone like me to survey among article reviewers and MoS editors to see if they see potential problems from a broad rule such as "number space letter never wraps"?
Comment 9 Omegatron 2008-04-11 03:00:55 UTC
(In reply to comment #8)
> Would anyone like me to survey among article reviewers and MoS editors to
> see if they see potential problems from a broad rule such as "number space
> letter never wraps"?

Absolutely.  It's recommended in the manual of style to add a non-breaking space for this case (not just units), but there are certainly a few cases that shouldn't be.  False positives won't cause much of a problem, though, since it will just prevent things from line wrapping, and it can't happen multiple times in a row to create a page-widening attack.  ("1 a 1 a 1 a" --> "1 a 1 a 1 a ")
Comment 10 Omegatron 2008-04-11 03:06:58 UTC
Oh wait.  :)  "a1 a1 a1 a1" --> "a1 a1 a1 a1"

Maybe we need to worry about that in some rare case?   Or make it only for numbers with no letters inside?  javascript would be something like: \s[,.0-9]+
Comment 11 Brion Vibber 2008-04-11 18:49:10 UTC
Why worry about spacing here? You can just write aaaaaaaaaaaaa... and widen to your heart's content. :)
Comment 12 Dan Kindsvater 2008-04-12 03:46:27 UTC
I'm surveying the WP:MOSNUM people now and I gave them the http://en.wikipedia.org/wiki/User:Bobblewik/monobook.js/unitformatter.js list to tweak.  Not wrapping at "number space letter" is a non-starter.  More than 90% of the time, that will be something we want to wrap, such as "the 1969 Mets World Series" or "9999 bottles of beer".
Comment 13 Omegatron 2008-04-12 03:49:29 UTC
(In reply to comment #12)
> More than
> 90% of the time, that will be something we want to wrap, such as "the 1969 Mets
> World Series" or "9999 bottles of beer".

Why would we want those to wrap?  MOSNUM currently recommends that they don't.
Comment 14 Gene Nygaard 2008-04-12 16:12:07 UTC
Why would we want to keep them from wrapping?  MOSNUM is nonsense, recommending non-breaking spaces in places where they are not needed, and not recommending them in places where they are needed.  It is also vague and ambiguous, arguably recommending a nonbreaking space at the star in "Ninety-nine*bottles of beer", and in the first space but not saying anything about the second space in a paper weight of "75 g m<sup>−2</sup>"; if that breaks, it should be between the 5 and the g, NOT between the g and the m, which is not only what the MoS rule says, but it is ALSO what we would get if this bug/feature request were implemented.
Comment 15 Omegatron 2008-04-12 16:17:10 UTC
(In reply to comment #14)
> Why would we want to keep them from wrapping?

Why wouldn't we?  See:

http://en.wikipedia.org/wiki/Wikipedia_talk:Manual_of_Style#No-break_spaces_discussion_continues_at_bugzilla
Comment 16 PMAnderson 2008-04-12 16:33:34 UTC
There is no consensus for not wrapping "9999 bottles of beer". If the letter and number are long, it may well produce clumsy final text; the key question is whether "9999<nowiki> </nowiki>bottles" will disable this feature, so it can be turned off when it does cause trouble. 
Comment 17 Omegatron 2008-04-12 16:37:14 UTC
(In reply to comment #16)
> There is no consensus for not wrapping "9999 bottles of beer".

Please discuss at http://en.wikipedia.org/wiki/Wikipedia_talk:Manual_of_Style#No-break_spaces_discussion_continues_at_bugzilla

then we can come back here and tell the devs what we want
Comment 18 Omegatron 2008-04-12 16:39:22 UTC
(In reply to comment #16)
> the key question is
> whether "9999<nowiki> </nowiki>bottles" will disable this feature, so it can be
> turned off when it does cause trouble. 

It does.  Try a long string of « word »« word »« word »« word » vs <nowiki>« word »« word »« word »« word »</nowiki>
Comment 19 Daniel Friesen 2008-04-13 03:24:04 UTC
Why not create a MediaWiki: message with a space, comma, or whatever you want, separated list of units.

Then take that message and quote it then convert the separators into |'s turn it into a proper regex list with escaping.

Then just add a &nbsp; with the regex [/(\d+) (<Quoted | list here)/S, "\\1&nbsp;\\2"]

That way only real units have the nbsp added, and additionally wikis may localize the units, and also add any newer or custom units such as fake units which apply only to their wiki. Or instead they can just replace the message with a - and have the whole thing disabled if they don't want it.
Comment 20 Aryeh Gregor (not reading bugmail, please e-mail directly) 2008-04-13 15:01:34 UTC
Please don't discuss the merits of various ideas here, discuss them on-wiki and report on consensus.  Bugzilla is an even worse discussion forum than talk pages.  :)

We've had localizable regexes before that were part of the parser, like linktrail, but AFAIK those have been disabled as too scary.  They can still be localized per-language, but only in the PHP files, not in the MW-namespace messages.
Comment 21 Christian Thiele 2008-12-19 12:21:36 UTC
(In reply to comment #19)
> Why not create a MediaWiki: message with a space, comma, or whatever you want,
> separated list of units.

Why not create a MediaWiki: message where one could add regular expressions and their replacements? Then every language (this discussion here is very en-focused) could add it's rules, could test them and so on...
Comment 22 Purodha Blissenbach 2011-02-19 12:03:45 UTC
For German and many related languages, the "digit space letter" rule would be wrong too often, I believe.
Few examples translated to English, using "_" to represent the nonbreaking space:

1) word space digit rules:
  the year 1960 and  ==> year_1960 and
  a class 23354 consumer good ==> a class_23354 consumer good
  laid down in ISO 4711 and not in ==> in ISO_4711 and
  an ASA 22 film ==> an ASA_22 film
  this is in paragraph 16 of the law on ==> in paragraph_16 of
  but article 3 in the constitution ==> but article_3 in
  king Henry 8 did ==> king Henry_8 did

2) more complex:
  the years 1970 and 71 ==> years 1970_and_71
  is 17 and a half miles from home ==> is 17_and_a_half_miles from home
  was 18 miles and three eighth until ==> was 18_miles and three_eighth until
  my 22 years old sister ==> my 22_years_old sister
  took 23 years until ==> took 22_years until

I doubt, that this can be had in a language independent way. We still would have not so few false positives, such as:

  found the article 19 feet behind the 
  went in that year 1999 soldiers to
  according to ISO 1234 people in Spain

(Note that, English word order and comma rules make English much less prone to some of those)

Currencies, and their abbreviations, can appear both in front of, and after the figures they relate to, so we should have both a " curreny space [+-] digit " and a " digit space currency " rule and probably tolerate " In week 17 € 1500 were spent " unless we can make a " 'week' space digit " rule eat the 17 on its own, hiding it from the cureency rules.

Also, there are style rules like these:

  we saw 1 young man ==> saw a young man / saw one young man
  ...
  not even 7 sailors ==> not even seven sailors
  ...
  when 12 candles ==> when twelve candles
  with 13 grumps ==> with 13_grumps

So I suggest a language specific, or language group specific, kind of treatment.
Comment 23 Bawolff (Brian Wolff) 2012-07-29 17:53:06 UTC
*** Bug 18443 has been marked as a duplicate of this bug. ***
Comment 24 Nemo 2012-07-29 20:28:27 UTC
(In reply to comment #3)
> Documentation? Don't be silly, this is MediaWiki! ;)

Heh. I've created https://meta.wikimedia.org/wiki/Help:Newlines_and_spaces#Non-breaking_spaces

(In reply to comment #20)
> Please don't discuss the merits of various ideas here, discuss them on-wiki and
> report on consensus.  Bugzilla is an even worse discussion forum than talk
> pages.  :)

Perhaps we can summarize on that Meta page (and even discuss in its talk)? 

> We've had localizable regexes before that were part of the parser, like
> linktrail, but AFAIK those have been disabled as too scary. They can still be
> localized per-language, but only in the PHP files, not in the MW-namespace
> messages.

This still holds true, so I suppose this is the way here too, and I've written it in the above page. I'm not going to summarize anything else from these two bugs because they're too long, but feel free if you find something consensual. :-)
Comment 25 seth 2012-08-05 08:48:19 UTC
fyi: Because of bug #18443 I already started a discussion at w:de concerning German typography.

At https://de.wikipedia.org/wiki/WD:TYP#automatische_leerzeichen there's an unfinished table called 'regexps' which will resolve bug #18443 and this bug at least for w:de.
That table is still under construction. If it's finished I'll inform you here.
Comment 26 seth 2012-08-24 23:23:11 UTC
//de.wikipedia.org/wiki/WD:TYP#automatische_leerzeichen
moved to
https://de.wikipedia.org/wiki/Wikipedia:Typografie/Automatische_Leerzeichen
Comment 27 Sebastian Werk 2013-01-02 14:53:34 UTC
I’d like to point out one approach, which was discussed in w:de some years ago (discussion felt asleep back then):

Use of underscores for thin- and non-breaking-spaces within the wiki-code:

One underscore for thin-space: _ ⇒ “ ”
Two underscores for n-b-space: __ ⇒ “ ”

Underscores are hardly ever used, except for links (there a filter can easily be implemented). In those rare remaining cases, the nowiki-tag should be used.

This would allow every user with minimal experience to use the correct typography, avoid long lists of common abbrevations as started on the German project site and ensure, that copy-paste-errors of spaces are easily detectable.
Comment 28 Nemo 2013-01-02 16:33:57 UTC
(In reply to comment #27)
> Use of underscores for thin- and non-breaking-spaces within the wiki-code:

This is bug 3461, please continue there.
Comment 29 Matthias Becker 2014-08-23 12:46:27 UTC
It would be helpful to fix this bug at least vor numbers and SI units and perhaps some widely used non-SI units (as ft, kn/kt mph, sm/nm)
Comment 30 Dan Kindsvater 2014-08-23 15:20:28 UTC
Thanks Matthias. Would really be nice to see movement on this after all these years ... it would make VE so much prettier too if we didn't have to deal with some nbsp-equivalent in VE.

Dan

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links