Last modified: 2013-02-12 16:43:58 UTC
The en manual of style has long been promising that the software would automatically convert -- (a double dash) into the html –. This would keep ugly html out of our articles and make editing more accessable for the html impaired. When is it coming?
Created attachment 279 [details] Replaces certain sequences with UTF-8 codes for dashes I've written a patch that I think is fairly well placed since it's adjacent to the existing code that inserts non-breaking spaces between guillemets. This method would make a lot of people happy, and it promotes compliance to the Manual of Style as much as is possible. Here's how it works: 1. Replace any ' -- ' with the UTF-8 sequence equivalent to ' – ' 2. Replace any '--' between numbers with '–' alone. 3. Replace any ' --- ' with the UTF-8 sequence equivalent to ' — '
Don't use raw UTF-8 here; numeric character references will be compatible with Latin-1 wikis as well. Test to make sure this doesn't break things interestingly. Also, there's no need to use the 'i' regex modifier on an expression that contains no letters.
Created attachment 285 [details] Replace dash sequences with HTML codes rather than UTF-8 Hey, that's great... I hadn't thought we would be able to do it with HTML entities, having a distant memory of a prior dash fix causing problems for exactly that reason. But, the guillemet replace string uses so, duh, of course we can. I had copied the /insensitive from the guillemet string — which doesn't need it either — so this patch removes it from both places. By the way, I filed bug #1513 to do similar work for quotes and elipses (and dashes) in a separate function. I used UTF-8 for it because I don't see it going in before 1.5 when everything's UTF-8 anyway. (Whether or not people even want that feature is, of course, up for debate.)
This is excellent. But a million typists already habitually use two hyphens to represent a parenthetical dash (em dash), usually spaced, but often not. There's a very strong usability case to make things work the way people expect.
(In reply to comment #4) I agree, it would be nice to make -- do — since that's what many people already use for it, but I'm not sure how we could do that and still allow for the (also very common) shorter dash used in ranges (i.e. January -- March becomes January – March). More people than I would expect are familiar with the triple-hyphen from TeX, and the idea the idea of doing likewise was debated on [[Wikipedia talk:Manual of Style (dashes)]] and didn't meet with tremendous opposition. I think that if something is finally put into place, people will adopt to it quickly and fix pages in short order (there are some pretty serious typographers out there!).
Agree with Michael; I can't imagine ever intending to write an en-dash with '--'. Virtually all existing cases will be meant as em-dashes.
From that talk page I keep mentioning: "When the automatic conversion was briefly turned on, a - remained unaffected, -- turned into a dash (an n dash I assume) and --- turned into a longer dash (an m dash I assume)." My thinking was that if this was ok once, it will be ok again (esecially since it won't break tables this time!) The question could be raised once again on the talk page, but from what I can tell it's a technical problem (how to allow for both length dashes) with only one proposed solution.
[[en:user:Curps]] asked me to post this: It would be nice to accomodate minus-sign as well, and could probably easily be done. The Unicode minus-sign character is approved in [[Wikipedia:Manual of Style (dashes)]]. In addition to the three rules already proposed, anything of the form ' -[0123456789]' (space followed by hyphen followed by a digit) should get converted to −
(In reply to comment #8) But shouldn't the minus sign also apply to subtraction? And we'd need to make sure that <math> sections aren't affected. My test setup doesn't have the right parts installed to render them so I'm not sure; if we're lucky, <math> is turned into a reference to a graphic before it gets to the patch's code.
I suggest having "--" become an en dash, and " -- " (spaces and all) become an em dash. This is the usage recommended by many typewriter style manuals, and it has carried through to modern computing. "---" as an em dash is obvious to TeX users, but not to the general populace.
I can just barely agree with Garth's comment, above. But any code that converts -- into anything but an em-dash will be surgically pruned out of any wiki's *I* run; that violates the Principle Of Least Astonishment with *unusual* violence. It's bad enough no one thinks that we can reasonably parse the traditional ASCII-7 'escape sequences' for *bold* and _italics_ (as the typographical special case of underlining). No one *needs* an en-dash, anyway.
For me it would be a little "astonishing" to prohibit spaces around en-dashes, since those spaces are prescribed in our style guide. And please don't dismiss en-dashes out of hand; there's a mob on wikipedia that wants shortcuts to both kinds of dashes. (Please do read for yourself.) There's another proposal on the dash talk page: " -- " goes to em-dash and " - " goes to en-dash. I'm a little concerned that it would affect <math> code. Can someone confirm that?
Created attachment 346 [details] applies new dash rules, excludes math sections I got math parsing going on my install and found that the old patch did affect math sections if they were simple enough to be rendered in HTML. That would pose problems, especially if we convert ' - ' to endashes. To excude the math markup, I moved the replace function to be between the strip() and unstrip() functions. That worked, then I updated the regular expressions to the new proposed format. Have a look at the source yourself to be sure. Here's what the expressions do in words: 1) replace a hyphen surrounded by spaces with an endash preceeded by a nonbreaking space and followed by a regular space 2) replace a hyphen between two numeric characters (a range) with an endash. 1) replace a double-hyphen surrounded by spaces with an emdash preceeded by a nonbreaking space and followed by a regular space
Fixed in CVS HEAD. Scheduled for Release 1.5
*** Bug 1782 has been marked as a duplicate of this bug. ***
I've removed this from 1.5 as it has a nasty tendency to break legitimate markup in addition to generally being inconsistent in when it activated.
(In reply to comment #16) > I've removed this from 1.5 as it has a nasty tendency to break legitimate markup in addition to > generally being inconsistent in when it activated. > Could we have some more information? I'm happy to play with the regular expression some more to fix whatever's breaking.
* conversion must not happen in markup * conversion must not happen in markup * conversion must not happen in markup * conversion should happen in text regardless of surrounding markup * conversion must not happen in markup and, let's not forget: * conversion must not happen in markup not to mention: * nobody agrees on what should actually be converted when to what A regex is unlikely to get this right very easily.
Nathan asked for more details. Here are the existing bug reports for the issues I mentioned above. Some had been worked around, others not: bug 2021: Corruption of markup (wikilinks) bug 2462: Corruption of markup (URLs) bug 2122: Consistency of application when there is surrounding markup bug 2109: Is this just consistency or does it break date conversion too? bug 1937: Was this just consistency or did it break functioning of ISBN links too?
How about using this SmartyPants implementation on PHP: http://www.michelf.com/projects/php-smartypants/ . I tried hooking it up to mediawiki and it works fine. SmartyPants is used on all kinds of web sites and dosen't do dumb things like changing hyphens inside URLs, and it won't even touch MathML. It does conversion "in the markup," but it's battle tested. It also does quotes and ellipses. (bug #1513) Downside is it doesn't offer exactly the conversion syntax we sort-of agreed to, - to ndash and -- to mdash. From discussions here I would say the best configuration for it is -- to mdash, --- to ndash (backwards and weird) or ndash disabled entirely. People were pretty hostile to the idea of having to use --- for the very common mdash, which is its default.
"in the markup" meaning it converts -- into — when you save? Like converting ~~~ into signature? That's bad. It needs to *render* -- as —, but leave the markup as --
Erm, yes, sorry if I wasn't clear about that. Yes, I meant the conversion should occur at page-render time, not at save time.
mmm another option would be to convert on save but put the dash itself in the wikitext rather than a html entity.
(In reply to comment #23) > mmm another option would be to convert on save but put the dash itself in the > wikitext rather than a html entity. Do all browsers support them in edit boxes, though? Or will some convert them back into hyphens?
(In reply to comment #24) > Do all browsers support them in edit boxes, though? Or will some convert them > back into hyphens? There is a workaround for old browsers and dashes can now be entered directly into the unicode wikitext with no problems. I've written a user script that automatically converts the HTML entities, double hyphens, and so on into their unicode characters.
(In reply to comment #13) > Have a look at the source yourself to be sure. Here's what the expressions do > in words: > 1) replace a hyphen surrounded by spaces with an endash preceeded by a > nonbreaking space and followed by a regular space > 2) replace a hyphen between two numeric characters (a range) with an endash. > 1) replace a double-hyphen surrounded by spaces with an emdash preceeded by a > nonbreaking space and followed by a regular space You forgot 4: replace a double-hyphen not surrounded by spaces with a lone em dash. (Obviously the attachment is most likely so old as to be worthless at this point, so this is just a note to future implementers.)
*** Bug 6402 has been marked as a duplicate of this bug. ***
Thinking about it, I don't think that a hyphen between two numbers should be converted to en dash. Consider the text "Type Alt-0-1-5-0 to get an en dash"—those are supposed to be hyphens, I believe, not en dashes. More generally, there's no legitimate use of two consecutive hyphens in English other than as a dash, and I certainly can't think of a legitimate use for " - " other than as a dash, but I get the nagging feeling that there will be a nontrivial number of non-ranges/subtractions that will look like them. I'd drop point 2 and go for 1, 3, and 4 instead.
*** Bug 7125 has been marked as a duplicate of this bug. ***
Please note that this should really be localized. Whether to use phrases (presumably very slow, but easy for i18n people to manage) or switch statements (as fast as is possible, but slightly icky) I leave to people who know about server load.
I would like to note that I would like this feature to *replace* the -- and --- sign into another sort of hyphen (like the replacement of ~~~ to sig) and not just display the text in another way. I want the Wiki code itself to change and display another sort of hyphen - i.e. I wouldn't like to see Wiki code with -- and --- everywhere.
(In reply to comment #31) > I would like to note that I would like this feature to *replace* the -- and --- sign > into another sort of hyphen (like the replacement of ~~~ to sig) and not just display > the text in another way. I want the Wiki code itself to change and display another > sort of hyphen - i.e. I wouldn't like to see Wiki code with -- and --- everywhere. I would like to note that I want the opposite. :-) -- should be a wikicode and rendered by the software as an em dash, in the right circumstances. If you just want a double-hyphen to unicode dash converter, one can be made in javascript.
Thanks for the idea Omegatron. We will see if it is worth using javascript locally, but I do think that text conversions should be handled by the server. If there is a demand to keep the -- and --- as is in wikicode, pehaps the developers can create an option for those who would like the -- and --- converted. At these times I wish I knew programming...
*** Bug 14795 has been marked as a duplicate of this bug. ***
De-assigning since not under active development atm.
Marking this as wontfix for now. It is too hard to get it right and the existing automatic conversions already cause us trouble. If you mean something, type it. There is already enough assistance and methods to do so even if your keyboard layout is missing characters which are needed to type typographically correct and good looking text in your language. (In reply to comment #16) > I've removed this from 1.5 as it has a nasty tendency to break legitimate > markup in addition to > generally being inconsistent in when it activated. (In reply to comment #18) > * nobody agrees on what should actually be converted when to what
Having used mailing lists, Usenet, and Mediawiki etc. for years, I was aghast at the gall of WordPress meddling with what the user entered (mainly quote marks), and am glad that Mediawiki will not be stepping over that fine line.