Last modified: 2012-07-29 17:53:06 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T20443, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 18443 - auto-insert of non-breaking whitespace where appropriate
auto-insert of non-breaking whitespace where appropriate
Status: RESOLVED DUPLICATE of bug 13619
Product: MediaWiki
Classification: Unclassified
Parser (Other open bugs)
unspecified
All All
: Low enhancement with 6 votes (vote)
: ---
Assigned To: Nobody - You can work on this!
: i18n
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2009-04-12 22:36 UTC by seth
Modified: 2012-07-29 17:53 UTC (History)
6 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description seth 2009-04-12 22:36:00 UTC
Problem:
There are several places where (thin) non-breaking spaces should be inserted, e.g., between numbers and units. Of course, there exist different spacing rules in different countries.
Up to now, inserting thin spaces still leads to problems: "nbsp" is generally too large, "thinsp" and "U+202f" won't be displayed in the wanted way using opera and so on; see e.g. [http://de.wikipedia.org/wiki/Wikipedia:Meinungsbilder/Typographie_(Zwischenr%C3%A4ume)#Browser-Unterst.C3.BCtzung] (german).
There are possibilities to display thin non-breaking spaces by using some html/css tricks, see [http://de.wikipedia.org/wiki/Schmales_Leerzeichen#.C3.9Cbergangsl.C3.B6sungen]. But that would complicate the source of articles too much.

Solution:
At http://de.wikipedia.org/wiki/Wikipedia:Fragen_zur_Wikipedia/Archiv/2009/Woche_01#.26nbsp.3B (german, but with php source-code) the idea was given to use (localized) regexps for automatically inserting of whitespace in some cases. With this modification we could easily auto-insert even sophisticated things like [http://de.wikipedia.org/wiki/Schmales_Leerzeichen#.C3.9Cbergangsl.C3.B6sungen] without obfuscating the article source code.

But:
Maybe such a thing would slow down the parsing of wikitext, so I guess it would be the best to implement the idea at test-wiki first. Somebody should profile the parsing after those changes. I could help in generating some fast regexps.
Comment 1 Matthias Becker 2010-12-13 11:14:31 UTC
Why not add non-braking space after a number in every case? That won't do any bad I think.
Comment 2 seth 2010-12-13 22:34:33 UTC
(In reply to comment #1)
> Why not add non-braking space after a number in every case? That won't do any
> bad I think.

This would not solve the problem, because not all cases contain numbers (like the German abbreviation "z.(thin non-breaking space)B.")

Apart from that there are several false positives like "In the year 2525 and ..." where there shouldn't be a non-breaking space after the "2525".

So I don't think this would be a good solution. I still believe that the already mentioned discussion at w:de (http://de.wikipedia.org/wiki/Wikipedia:Fragen_zur_Wikipedia/Archiv/2009/Woche_01#.26nbsp.3B) gives a possible and good solution.
Comment 3 Bawolff (Brian Wolff) 2010-12-13 23:25:36 UTC
The proposed solution on de.wiki (if i can understand this right. Google translate for german sucks very very badly) is to add: wfMsgForContent( 'nbsp-before-word' ) => '\\1 \\2' to the $fixtags array in ~ line 302 of Parser::parse in includes/parser/Parser.php. In other words, have a system message with a regex to tell mediawiki where to put the non-breaking spaces.

This seems like a bad idea. First someone is bound to put an invalid regex in there (that could probably be worked around by checking for validity). Allowing the users to add an arbitrary regex that gets executed on all text when parsing seems like begging for someone to put something evil in there. Regexes are powerful, you can do quite computationally intensive thingies with them, sometimes without meaning to.

Additionally, mistakes could cause quite a mass of confusion. If someone for example set nbsp-before-word to be /./ say (or anything where they forgot the brackets), that would make the parser output only nonbreaking spaces, and break the entire site which would be quite disruptive.
Comment 4 seth 2011-01-09 21:56:41 UTC
(In reply to comment #3)
> The proposed solution on de.wiki [...] is to add: wfMsgForContent(
> 'nbsp-before-word' ) => '\\1 \\2' to the $fixtags array in ~ line 302 of
> Parser::parse in includes/parser/Parser.php.

Right. Or even better:

  wfMsgForContent('auto-thinspace') => '\\1<span style="margin-left:0.167em"><span style="display:none">&nbsp;</span></span>\\2'

This leads to thin spaces which are compatible with all common browsers, see http://de.wikipedia.org/wiki/user:Raphael_Frey/Labor#Browser-Unterst.C3.BCtzung (the span-solution is the column called "Übergangslösung")

Regarding the problems Bawolff mentioned, this is very similar to other regexp-based extensions like the spam-blacklist, the title-blacklist and the abuse filter (aka edit filter).
Of ourse only admins should be allowed to edit the regexps. And they have to be very careful, that's true; at least as careful as if they were editing the sbl, tbl or af.
Comment 5 Bawolff (Brian Wolff) 2012-07-27 12:06:02 UTC
(In reply to comment #4)
....
> 
> Regarding the problems Bawolff mentioned, this is very similar to other
> regexp-based extensions like the spam-blacklist, the title-blacklist and the
> abuse filter (aka edit filter).
> Of ourse only admins should be allowed to edit the regexps. And they have to be
> very careful, that's true; at least as careful as if they were editing the sbl,
> tbl or af.

Abuse filter/spam blacklist mistakes are easier to fix than what is proposed here. I think collecting a list of wanted rules, and hardcoding those to MW is much more likely to succede than letting admins dynamically add such rules. (As it stands of course, one could do this on the js side already, but that is really icky).


(Note MW does have some rules for adding nbsp in certain contexts. The rules just aren't all that complex)
Comment 6 Nemo 2012-07-27 17:18:13 UTC
(In reply to comment #5)
> (Note MW does have some rules for adding nbsp in certain contexts. The rules
> just aren't all that complex)

What are they, by the way? I think this is not documented anywhere, but it would important to keep it consistent if we add such a new rule.
Right now I can remember only the separators for digits, used by formatnum, which is defined in the MessagesXx files and can be modified only there.

Moreover, some such rules are defined by the [[International System of Units]] itself IIRC, and are not that easy to find, but may be included in some library already? The reporter/voters should probably do some investigation.
Comment 7 Bawolff (Brian Wolff) 2012-07-27 17:35:13 UTC
(In reply to comment #6)
> (In reply to comment #5)
> > (Note MW does have some rules for adding nbsp in certain contexts. The rules
> > just aren't all that complex)
> 
> What are they, by the way? I think this is not documented anywhere, but it
> would important to keep it consistent if we add such a new rule.
> Right now I can remember only the separators for digits, used by formatnum,
> which is defined in the MessagesXx files and can be modified only there.
> 
> Moreover, some such rules are defined by the [[International System of Units]]
> itself IIRC, and are not that easy to find, but may be included in some library
> already? The reporter/voters should probably do some investigation.

They're run towards the end of the parsing process (The original proposal in comment 0 that's linked actually refer to them).

Specificly they are:

 373                 # Clean up special characters, only run once, next-to-last before doBlockLevels
 374                 $fixtags = array(
 375                         # french spaces, last one Guillemet-left
 376                         # only if there is something before the space
 377                         '/(.) (?=\\?|:|;|!|%|\\302\\273)/' => '\\1&#160;',
 378                         # french spaces, Guillemet-right
 379                         '/(\\302\\253) /' => '\\1&#160;',
 380                         '/&#160;(!\s*important)/' => ' \\1', # Beware of CSS magic word !important, bug #11874.
 381                 );
 382                 $text = preg_replace( array_keys( $fixtags ), array_values( $fixtags ), $text );

In english they say:

*If you have a character (any character including spaces), followed by a space, followed by any of the following characters: ?,:,;,!,% or » (U+BB), the space gets replaced with a non-breaking space.
*If you have a « (U+AB) followed by a space, that space is replaced by a non-breaking space.
*As an exception to these rules, if you have a non-breaking space followed by "!important", the non-breaking space is turned back into a normally breaking space. This is to prevent messing up CSS style attributes. (This isn't perfect, there's an open bug somewhere about css styles being messed up by this in edge cases).

Based on the Guillemet characters, I imagine this is meant for the typing rules of french.
Comment 8 seth 2012-07-27 19:11:22 UTC
In reply to comment #5)

Yes, a hardcoded solution would be ok. But at least in the beginning there should be an easy way of communication (between admins and devs) regarding changes of that hardcoded rules.

The typographic rules[1] in Germany are quite complicated:
there should be a _narrow_ _non-breaking_ space inside of
* abbreviations (like 'z. B.', 'i. d. R.', 'u. a.')
* abbreviations with numbers (like '§ 315', 'Abs. 3', 'S. 78 ff')
* dates like '1. Mai'
* between numbers and units (like '100 m', '5 kg')

If I'd get an "ok" here, s.t. some dev would insert those hardcoded rules for w:de (and probably for all other de-projects, too), then I could create some regexps.

[1] actually "rule" is not the right word here. "typographic sugar" would be a better description.
Comment 9 Bawolff (Brian Wolff) 2012-07-27 19:16:41 UTC
I imagine we'd want to change these rules so they're handled in the i18n files instead of in the parser itself (Since we'd want vary per lang). CC'ing Niklas to see if he has any thoughts on the i18n aspects.

>The typographic rules[1] in Germany are quite complicated:

One of the scary things about this type of scheme is that its invisible to the user. If there are exceptions to the rules, the user cannot override these exceptions (Well maybe they could do things like insert &#32;, but its not obvious to the user how to/very difficult for them). Hence we'd want to make the rules have effectively no false positives.
Comment 10 seth 2012-07-27 19:52:08 UTC
(In reply to comment #9)
> Hence we'd want to make
> the rules have effectively no false positives.

I fully agree with that.
(And actually that was one of the reasons, why I asked for a management system where admins can quickly change regexps. Because it's quite easy to overlook such false positive cases a priori.)

However, cases like "123 %" have schown, that we don't have to fear false positives too much.
Comment 11 Nemo 2012-07-27 21:12:49 UTC
Is bug 13619 a duplicate?
Comment 12 Bawolff (Brian Wolff) 2012-07-29 17:53:06 UTC
yes. Since that bug is older, lets continue the discussion over there.

*** This bug has been marked as a duplicate of bug 13619 ***

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links