Last modified: 2013-02-12 16:43:58 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T3485, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 1485 - Automatic hyphens to (localized?) dashes
Automatic hyphens to (localized?) dashes
Status: RESOLVED WONTFIX
Product: MediaWiki
Classification: Unclassified
Parser (Other open bugs)
unspecified
All All
: Lowest enhancement with 6 votes (vote)
: ---
Assigned To: Nobody - You can work on this!
:
: 1782 6402 7125 14795 (view as bug list)
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2005-02-07 04:36 UTC by Mark Pellegrini
Modified: 2013-02-12 16:43 UTC (History)
9 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments
Replaces certain sequences with UTF-8 codes for dashes (1.50 KB, patch)
2005-02-10 12:55 UTC, Nathan Hamblen
Details
Replace dash sequences with HTML codes rather than UTF-8 (1.75 KB, patch)
2005-02-11 17:28 UTC, Nathan Hamblen
Details
applies new dash rules, excludes math sections (1.91 KB, patch)
2005-03-07 15:42 UTC, Nathan Hamblen
Details

Description Mark Pellegrini 2005-02-07 04:36:18 UTC
The en manual of style has long been promising that the software would
automatically convert -- (a double dash) into the html –. This would keep
ugly html out of our articles and make editing more accessable for the html
impaired. When is it coming?
Comment 1 Nathan Hamblen 2005-02-10 12:55:12 UTC
Created attachment 279 [details]
Replaces certain sequences with UTF-8 codes for dashes

I've written a patch that I think is fairly well placed since it's adjacent to
the existing code that inserts non-breaking spaces between guillemets. This
method would make a lot of people happy, and it promotes compliance to the
Manual of Style as much as is possible. Here's how it works:

1. Replace any ' -- ' with the UTF-8 sequence equivalent to ' – '
2. Replace any '--' between numbers with '–' alone.
3. Replace any ' --- ' with the UTF-8 sequence equivalent to ' — '
Comment 2 Brion Vibber 2005-02-11 11:25:42 UTC
Don't use raw UTF-8 here; numeric character references will be compatible with Latin-1 wikis as well. Test to make sure this 
doesn't break things interestingly.

Also, there's no need to use the 'i' regex modifier on an expression that contains no letters.
Comment 3 Nathan Hamblen 2005-02-11 17:28:34 UTC
Created attachment 285 [details]
Replace dash sequences with HTML codes rather than UTF-8

Hey, that's great... I hadn't thought we would be able to do it with HTML
entities, having a distant memory of a prior dash fix causing problems for
exactly that reason. But, the guillemet replace string uses   so, duh, of
course we can. I had copied the /insensitive from the guillemet string —
which doesn't need it either — so this patch removes it from both places.

By the way, I filed bug #1513 to do similar work for quotes and elipses (and
dashes) in a separate function. I used UTF-8 for it because I don't see it
going in before 1.5 when everything's UTF-8 anyway. (Whether or not people even
want that feature is, of course, up for debate.)
Comment 4 Michael Zajac 2005-02-11 18:53:03 UTC
This is excellent.  But a million typists already habitually use two hyphens to represent a parenthetical dash (em dash), usually 
spaced, but often not.  There's a very strong usability case to make things work the way people expect.
Comment 5 Nathan Hamblen 2005-02-11 19:12:26 UTC
(In reply to comment #4)
I agree, it would be nice to make -- do — since that's what many people already
use for it, but I'm not sure how we could do that and still allow for the (also
very common) shorter dash used in ranges (i.e. January -- March becomes January
– March).

More people than I would expect are familiar with the triple-hyphen from TeX,
and the idea the idea of doing likewise was debated on [[Wikipedia talk:Manual
of Style (dashes)]] and didn't meet with tremendous opposition. I think that if
something is finally put into place, people will adopt to it quickly and fix
pages in short order (there are some pretty serious typographers out there!).
Comment 6 Brion Vibber 2005-02-11 20:33:03 UTC
Agree with Michael; I can't imagine ever intending to write an en-dash with '--'. 
Virtually all existing cases will be meant as em-dashes.
Comment 7 Nathan Hamblen 2005-02-11 20:55:03 UTC
From that talk page I keep mentioning: "When the automatic conversion was
briefly turned on, a - remained unaffected, -- turned into a dash (an n dash I
assume) and --- turned into a longer dash (an m dash I assume)."

My thinking was that if this was ok once, it will be ok again (esecially since
it won't break tables this time!) The question could be raised once again on the
talk page, but from what I can tell it's a technical problem (how to allow for
both length dashes) with only one proposed solution.
Comment 8 Mark Pellegrini 2005-02-12 03:59:00 UTC
[[en:user:Curps]] asked me to post this: 

It would be nice to accomodate minus-sign as well, and could probably easily be
done.

The Unicode minus-sign character is approved in [[Wikipedia:Manual of Style
(dashes)]].

In addition to the three rules already proposed, anything of the form '
-[0123456789]' (space followed by hyphen followed by a digit) should get
converted to −


Comment 9 Nathan Hamblen 2005-02-12 17:00:37 UTC
(In reply to comment #8)
But shouldn't the minus sign also apply to subtraction?

And we'd need to make sure that <math> sections aren't affected. My test setup
doesn't have the right parts installed to render them so I'm not sure; if we're
lucky, <math> is turned into a reference to a graphic before it gets to the
patch's code.
Comment 10 Garth Wallace 2005-02-17 00:52:53 UTC
I suggest having "--" become an en dash, and " -- " (spaces and all) become an
em dash. This is the usage recommended by many typewriter style manuals, and it
has carried through to modern computing. "---" as an em dash is obvious to TeX
users, but not to the general populace.
Comment 11 Baylink@en 2005-02-17 00:59:13 UTC
I can just barely agree with Garth's comment, above.  But any code that converts
-- into anything but an em-dash will be surgically pruned out of any wiki's *I*
run; that violates the Principle Of Least Astonishment with *unusual* violence.

It's bad enough no one thinks that we can reasonably parse the traditional
ASCII-7 'escape sequences' for *bold* and _italics_ (as the typographical
special case of underlining).

No one *needs* an en-dash, anyway.
Comment 12 Nathan Hamblen 2005-02-17 02:38:22 UTC
For me it would be a little "astonishing" to prohibit spaces around en-dashes,
since those spaces are prescribed in our style guide. And please don't dismiss
en-dashes out of hand; there's a mob on wikipedia that wants shortcuts to both
kinds of dashes. (Please do read for yourself.)

There's another proposal on the dash talk page: " -- " goes to em-dash and " - "
goes to en-dash. I'm a little concerned that it would affect <math> code. Can
someone confirm that?
Comment 13 Nathan Hamblen 2005-03-07 15:42:28 UTC
Created attachment 346 [details]
applies new dash rules, excludes math sections

I got math parsing going on my install and found that the old patch did affect
math sections if they were simple enough to be rendered in HTML. That would
pose problems, especially if we convert ' - ' to endashes. To excude the math
markup, I moved the replace function to be between the strip() and unstrip()
functions. That worked, then I updated the regular expressions to the new
proposed format.

Have a look at the source yourself to be sure. Here's what the expressions do
in words: 
1) replace a hyphen surrounded by spaces with an endash preceeded by a
nonbreaking space and followed by a regular space 
2) replace a hyphen between two numeric characters (a range) with an endash.
1) replace a double-hyphen surrounded by spaces with an emdash preceeded by a
nonbreaking space and followed by a regular space
Comment 14 JeLuF 2005-03-13 19:51:44 UTC
Fixed in CVS HEAD. Scheduled for Release 1.5
Comment 15 Brion Vibber 2005-03-31 01:22:13 UTC
*** Bug 1782 has been marked as a duplicate of this bug. ***
Comment 16 Brion Vibber 2005-06-20 05:00:59 UTC
I've removed this from 1.5 as it has a nasty tendency to break legitimate markup in addition to 
generally being inconsistent in when it activated.
Comment 17 Nathan Hamblen 2005-06-22 15:10:22 UTC
(In reply to comment #16)
> I've removed this from 1.5 as it has a nasty tendency to break legitimate
markup in addition to 
> generally being inconsistent in when it activated.
> 

Could we have some more information? I'm happy to play with the regular
expression some more to fix whatever's breaking.
Comment 18 Brion Vibber 2005-06-22 20:15:50 UTC
* conversion must not happen in markup
* conversion must not happen in markup
* conversion must not happen in markup
* conversion should happen in text regardless of surrounding markup
* conversion must not happen in markup

and, let's not forget:
* conversion must not happen in markup

not to mention:
* nobody agrees on what should actually be converted when to what

A regex is unlikely to get this right very easily.
Comment 19 Brion Vibber 2005-06-22 21:13:03 UTC
Nathan asked for more details. Here are the existing bug reports for the issues I mentioned above. 
Some had been worked around, others not:

bug 2021: Corruption of markup (wikilinks)
bug 2462: Corruption of markup (URLs)
bug 2122: Consistency of application when there is surrounding markup
bug 2109: Is this just consistency or does it break date conversion too?
bug 1937: Was this just consistency or did it break functioning of ISBN links too?
Comment 20 Nathan Hamblen 2005-07-15 12:40:46 UTC
How about using this SmartyPants implementation on PHP:
http://www.michelf.com/projects/php-smartypants/ . I tried hooking it up to
mediawiki and it works fine. SmartyPants is used on all kinds of web sites and
dosen't do dumb things like changing hyphens inside URLs, and it won't even
touch MathML. It does conversion "in the markup," but it's battle tested.

It also does quotes and ellipses. (bug #1513)

Downside is it doesn't offer exactly the conversion syntax we sort-of agreed to,
- to ndash and -- to mdash. From discussions here I would say the best
configuration for it is -- to mdash, --- to ndash (backwards and weird) or ndash
disabled entirely. People were pretty hostile to the idea of having to use ---
for the very common mdash, which is its default.
Comment 21 Omegatron 2005-08-09 03:56:04 UTC
"in the markup" meaning it converts -- into &mdash; when you save?  Like
converting ~~~ into signature?  That's bad.  It needs to *render* -- as &mdash;,
but leave the markup as --
Comment 22 Mark Pellegrini 2005-08-09 04:07:06 UTC
Erm, yes, sorry if I wasn't clear about that. Yes, I meant the conversion should
occur at page-render time, not at save time. 
Comment 23 peter green 2005-08-16 22:48:19 UTC
mmm another option would be to convert on save but put the dash itself in the
wikitext rather than a html entity.
Comment 24 Omegatron 2005-08-16 22:56:49 UTC
(In reply to comment #23)
> mmm another option would be to convert on save but put the dash itself in the
> wikitext rather than a html entity.

Do all browsers support them in edit boxes, though?  Or will some convert them
back into hyphens?
Comment 25 Omegatron 2005-12-01 02:18:00 UTC
(In reply to comment #24)
> Do all browsers support them in edit boxes, though?  Or will some convert them
> back into hyphens?

There is a workaround for old browsers and dashes can now be entered directly
into the unicode wikitext with no problems.  I've written a user script that
automatically converts the HTML entities, double hyphens, and so on into their
unicode characters. 
Comment 26 Aryeh Gregor (not reading bugmail, please e-mail directly) 2006-06-15 02:51:03 UTC
(In reply to comment #13)
> Have a look at the source yourself to be sure. Here's what the expressions do
> in words: 
> 1) replace a hyphen surrounded by spaces with an endash preceeded by a
> nonbreaking space and followed by a regular space 
> 2) replace a hyphen between two numeric characters (a range) with an endash.
> 1) replace a double-hyphen surrounded by spaces with an emdash preceeded by a
> nonbreaking space and followed by a regular space

You forgot 4: replace a double-hyphen not surrounded by spaces with a lone em
dash.  (Obviously the attachment is most likely so old as to be worthless at
this point, so this is just a note to future implementers.)
Comment 27 Ævar Arnfjörð Bjarmason 2006-06-22 12:50:38 UTC
*** Bug 6402 has been marked as a duplicate of this bug. ***
Comment 28 Aryeh Gregor (not reading bugmail, please e-mail directly) 2006-06-22 23:44:35 UTC
Thinking about it, I don't think that a hyphen between two numbers should be
converted to en dash.  Consider the text "Type Alt-0-1-5-0 to get an en
dash"—those are supposed to be hyphens, I believe, not en dashes.  More
generally, there's no legitimate use of two consecutive hyphens in English other
than as a dash, and I certainly can't think of a legitimate use for " - " other
than as a dash, but I get the nagging feeling that there will be a nontrivial
number of non-ranges/subtractions that will look like them.  I'd drop point 2
and go for 1, 3, and 4 instead.
Comment 29 Aryeh Gregor (not reading bugmail, please e-mail directly) 2006-08-27 01:46:00 UTC
*** Bug 7125 has been marked as a duplicate of this bug. ***
Comment 30 Aryeh Gregor (not reading bugmail, please e-mail directly) 2006-08-27 01:52:15 UTC
Please note that this should really be localized.  Whether to use phrases
(presumably very slow, but easy for i18n people to manage) or switch statements
(as fast as is possible, but slightly icky) I leave to people who know about
server load.
Comment 31 yonidebest 2006-08-29 08:26:24 UTC
I would like to note that I would like this feature to *replace* the -- and --- sign 
into another sort of hyphen (like the replacement of ~~~ to sig) and not just display 
the text in another way. I want the Wiki code itself to change and display another 
sort of hyphen - i.e. I wouldn't like to see Wiki code with -- and --- everywhere.
Comment 32 Omegatron 2006-08-29 11:49:57 UTC
(In reply to comment #31)
> I would like to note that I would like this feature to *replace* the -- and
--- sign 
> into another sort of hyphen (like the replacement of ~~~ to sig) and not just
display 
> the text in another way. I want the Wiki code itself to change and display
another 
> sort of hyphen - i.e. I wouldn't like to see Wiki code with -- and --- everywhere.


I would like to note that I want the opposite.  :-)  -- should be a wikicode and
rendered by the software as an em dash, in the right circumstances.  If you just
want a double-hyphen to unicode dash converter, one can be made in javascript.
Comment 33 yonidebest 2006-08-29 17:18:24 UTC
Thanks for the idea Omegatron. We will see if it is worth using javascript locally, 
but I do think that text conversions should be handled by the server. If there is a 
demand to keep the -- and --- as is in wikicode, pehaps the developers can create an 
option for those who would like the -- and --- converted. At these times I wish I 
knew programming...
Comment 34 Danny B. 2008-07-20 09:48:41 UTC
*** Bug 14795 has been marked as a duplicate of this bug. ***
Comment 35 Brion Vibber 2008-07-30 23:25:44 UTC
De-assigning since not under active development atm.
Comment 36 Niklas Laxström 2009-06-21 08:45:42 UTC
Marking this as wontfix for now. It is too hard to get it right and the existing automatic conversions already cause us trouble. If you mean something, type it. There is already enough assistance and methods to do so even if your keyboard layout is missing characters which are needed to type typographically correct and good looking text in your language.

(In reply to comment #16)
> I've removed this from 1.5 as it has a nasty tendency to break legitimate
> markup in addition to 
> generally being inconsistent in when it activated.


(In reply to comment #18)
> * nobody agrees on what should actually be converted when to what
Comment 37 Dan Jacobson 2009-06-23 02:05:00 UTC
Having used mailing lists, Usenet, and Mediawiki etc. for years, I was
aghast at the gall of WordPress meddling with what the user entered
(mainly quote marks), and am glad that Mediawiki will not be stepping
over that fine line.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links