Last modified: 2012-06-17 22:00:52 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T8458, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 6458 - Dashes are parsed as part of autodetected URLs
Dashes are parsed as part of autodetected URLs
Product: MediaWiki
Classification: Unclassified
Parser (Other open bugs)
All All
: Low minor with 1 vote (vote)
: ---
Assigned To: Nobody - You can work on this!
: newparser, patch, patch-need-review
Depends on:
  Show dependency treegraph
Reported: 2006-06-27 03:55 UTC by Aryeh Gregor (not reading bugmail, please e-mail directly)
Modified: 2012-06-17 22:00 UTC (History)
6 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---

Patch (2.28 KB, patch)
2006-07-31 00:04 UTC, Aryeh Gregor (not reading bugmail, please e-mail directly)
Patch (2.23 KB, patch)
2006-07-31 00:06 UTC, Aryeh Gregor (not reading bugmail, please e-mail directly)

Description Aryeh Gregor (not reading bugmail, please e-mail directly) 2006-06-27 03:55:45 UTC
The provided URL demonstrates that the parser interprets any (or almost any)
character other than linebreak, tab, space, or nbsp as being part of a preceding
URL.  Only alphanumerics and -_.~!*'();:@&=+$,/?%#[] can appear in a URI of any
Comment 1 Platonides 2006-06-28 11:11:11 UTC
Agree, but unicode URLs showed in 'nice form' (without % escaping) could be
affected. This shouldn't be allowed, but as several people asked here to not
escape urls on [[ ]], it's probably that some urls will break. Should we care?
Comment 2 Aryeh Gregor (not reading bugmail, please e-mail directly) 2006-06-28 18:04:31 UTC
Hrm, good point.  MediaWiki does, of course, escape those automatically if input
as an actual URL (so em dash would be included but automatically translated to
%E2%80%95 for the href of the link).  So this might actually be a feature, not a
bug.  Note, however, that it only affects autodetected URLs, not single- or
double-bracketed links; it would be correct for those two to assume that invalid
characters are part of the link, because you know that the link only ends on a
space/] or pipe/]], respectively.

Regardless, it would be still better to just selectively exclude a few more
common punctuation marks (like em and en dash, the former of which is where I
noticed this at
and continue to parse, e.g., foreign-language characters as part of autodetected
Comment 3 Aryeh Gregor (not reading bugmail, please e-mail directly) 2006-07-31 00:04:44 UTC
Created attachment 2185 [details]

This patch treats trailing *or* internal 0xA0, 0x2000-0x200B, 0x200D-0x2015
(various Unicode spaces/dashes) as not being part of free links.  Bracketed
links are unaffected.
Comment 4 Aryeh Gregor (not reading bugmail, please e-mail directly) 2006-07-31 00:06:52 UTC
Created attachment 2186 [details]

Fix a couple of trivial mistakes I just noticed in previous patch.
Comment 5 Brion Vibber 2006-09-11 13:01:11 UTC
This seems a weird exception. Aren't there thousands of other punctuation and control characters?
Comment 6 Aryeh Gregor (not reading bugmail, please e-mail directly) 2006-09-12 00:57:36 UTC
Yes, but many will either a) routinely occur in URLs (in encoded form) in any
language where they'd be likely to occur in text, or b) never legitimately occur
immediately after something intended to be a URL.  If there are any other
characters to which neither of those points apply, those should probably be
added as well.  U+2018 through U+2026 would probably be good candidates, and »

(On the other hand, U+200D shouldn't be included, or at least I don't know what
precisely it does but have some vague idea it's common in some languages.  I
must have included it by mistake.)
Comment 7 Dan Collins 2011-07-12 02:59:54 UTC
This clearly has not been applied. The patch is rather restrictive and does not address many of the characters listed in the example, however many of those characters (hyphen, certainly, and exclamation point, and parenthesis) are found in URLs. Further, I can't think of any negative effect of allowing a rather liberal character set in urls, especially since the single bracket syntax allows the user to explicitly delimit the link and since there is no visual difference in that workaround. Taking the initiative to close this wontfix.
Comment 8 Aryeh Gregor (not reading bugmail, please e-mail directly) 2011-07-12 22:21:35 UTC
I explicitly said in comment 2 what the reason is for the patch.  People will write stuff like "I like―do you?" with an unspaced em dash, and the autolinking will go too far.  We already stop autolinking at spaces and brackets, there's no reason not to add a few more characters to the list.

Of course, you can always use brackets instead of a free link.  But that goes both ways: if you have a URL that contains a dash and is parsed incorrectly with the change, you can always use square brackets to make the link work.  The question is, will the change correctly guess more or fewer URLs than at present?
Comment 9 au 2012-06-17 13:24:42 UTC
Hi Aryeh, thank you for the patch!

As you may already know, MediaWiki is currently revamping its PHP-based parser into a "Parsoid" prototype component, to support the rich-text Visual Editor project:

Folks interested in enhancing the parser's capabilities are very much welcome to join the Parsoid project, and contribute patches as Git branches:

Compared to .diff attachments in Bugzilla tickets, Git branches are much easier for us to review, refine and merge features together.

Each change set has a distinct URL generated by the "git review" tool, which can be referenced in Bugzilla by pasting its URL as a comment.

If you run into any issues with the patch process, please feel free to ask on #wikimedia-dev and the wikitext-l mailing list. Thank you!
Comment 10 Aryeh Gregor (not reading bugmail, please e-mail directly) 2012-06-17 13:43:46 UTC
The patch is trivial, and doesn't really need review for correctness, but I never committed it because I was unsure if it would fix more pages than it broke.  If someone else wants to try, they should feel free.
Comment 11 Platonides 2012-06-17 21:49:23 UTC
It's a good point to stop URL autodetection on dashes, but I think it'd be preferible to leave the rule simply as "urls run until whitespace" (current behavior), closing this as WONTFIX.
Comment 12 Krinkle 2012-06-17 21:55:57 UTC
I agree with Platonides. How would a user make a link to the actual page with an em-dash in the title (not counting "per cent encoding" as a viable alternative since "nobody" knows about that).

For example, the Wikipedia article about en-dash:–

Requiring whitespace to be between the url and another part of the sentence seems like a sane requirement to me. And if in the weird exception somebody prefers to have them next to each other without a space (even if that is grammatically correct, I don't know), then one can always use the full syntax: 

"I like []―do you?"
Comment 13 Gabriel Wicke 2012-06-17 22:00:52 UTC
I agree with Platonides- better to have simple and relatively predictable rules, and a simple workaround for the complex cases. Adding exceptions for some characters while leaving out others would make the behavior harder to predict.

Note You need to log in before you can comment on or make changes to this bug.