Last modified: 2012-06-17 22:00:52 UTC

Wikimedia Bugzilla is closed!

Wikimedia has migrated from Bugzilla to Phabricator. Bug reports should be created and updated in Wikimedia Phabricator instead. Please create an account in Phabricator and add your Bugzilla email address to it.
Wikimedia Bugzilla is read-only. If you try to edit or create any bug report in Bugzilla you will be shown an intentional error message.
In order to access the Phabricator task corresponding to a Bugzilla report, just remove "static-" from its URL.
You could still run searches in Bugzilla or access your list of votes but bug reports will obviously not be up-to-date in Bugzilla.
Bug 6458 - Dashes are parsed as part of autodetected URLs
Dashes are parsed as part of autodetected URLs
Status: RESOLVED WONTFIX
Product: MediaWiki
Classification: Unclassified
Parser (Other open bugs)
unspecified
All All
: Low minor with 1 vote (vote)
: ---
Assigned To: Nobody - You can work on this!
http://en.wikipedia.org/wiki/User:Sim...
: newparser, patch, patch-need-review
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2006-06-27 03:55 UTC by Aryeh Gregor (not reading bugmail, please e-mail directly)
Modified: 2012-06-17 22:00 UTC (History)
6 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments
Patch (2.28 KB, patch)
2006-07-31 00:04 UTC, Aryeh Gregor (not reading bugmail, please e-mail directly)
Details
Patch (2.23 KB, patch)
2006-07-31 00:06 UTC, Aryeh Gregor (not reading bugmail, please e-mail directly)
Details

Description Aryeh Gregor (not reading bugmail, please e-mail directly) 2006-06-27 03:55:45 UTC
The provided URL demonstrates that the parser interprets any (or almost any)
character other than linebreak, tab, space, or nbsp as being part of a preceding
URL.  Only alphanumerics and -_.~!*'();:@&=+$,/?%#[] can appear in a URI of any
kind.
Comment 1 Platonides 2006-06-28 11:11:11 UTC
Agree, but unicode URLs showed in 'nice form' (without % escaping) could be
affected. This shouldn't be allowed, but as several people asked here to not
escape urls on [[ ]], it's probably that some urls will break. Should we care?
Comment 2 Aryeh Gregor (not reading bugmail, please e-mail directly) 2006-06-28 18:04:31 UTC
Hrm, good point.  MediaWiki does, of course, escape those automatically if input
as an actual URL (so em dash would be included but automatically translated to
%E2%80%95 for the href of the link).  So this might actually be a feature, not a
bug.  Note, however, that it only affects autodetected URLs, not single- or
double-bracketed links; it would be correct for those two to assume that invalid
characters are part of the link, because you know that the link only ends on a
space/] or pipe/]], respectively.

Regardless, it would be still better to just selectively exclude a few more
common punctuation marks (like em and en dash, the former of which is where I
noticed this at
<http://en.wikipedia.org/w/index.php?title=Wikipedia:Image_copyright_tags&diff=prev&oldid=60640000#Other_public_domain_images>)
and continue to parse, e.g., foreign-language characters as part of autodetected
URLs.
Comment 3 Aryeh Gregor (not reading bugmail, please e-mail directly) 2006-07-31 00:04:44 UTC
Created attachment 2185 [details]
Patch

This patch treats trailing *or* internal 0xA0, 0x2000-0x200B, 0x200D-0x2015
(various Unicode spaces/dashes) as not being part of free links.  Bracketed
links are unaffected.
Comment 4 Aryeh Gregor (not reading bugmail, please e-mail directly) 2006-07-31 00:06:52 UTC
Created attachment 2186 [details]
Patch

Fix a couple of trivial mistakes I just noticed in previous patch.
Comment 5 Brion Vibber 2006-09-11 13:01:11 UTC
This seems a weird exception. Aren't there thousands of other punctuation and control characters?
Comment 6 Aryeh Gregor (not reading bugmail, please e-mail directly) 2006-09-12 00:57:36 UTC
Yes, but many will either a) routinely occur in URLs (in encoded form) in any
language where they'd be likely to occur in text, or b) never legitimately occur
immediately after something intended to be a URL.  If there are any other
characters to which neither of those points apply, those should probably be
added as well.  U+2018 through U+2026 would probably be good candidates, and »
(U+BB).

(On the other hand, U+200D shouldn't be included, or at least I don't know what
precisely it does but have some vague idea it's common in some languages.  I
must have included it by mistake.)
Comment 7 Dan Collins 2011-07-12 02:59:54 UTC
This clearly has not been applied. The patch is rather restrictive and does not address many of the characters listed in the example, however many of those characters (hyphen, certainly, and exclamation point, and parenthesis) are found in URLs. Further, I can't think of any negative effect of allowing a rather liberal character set in urls, especially since the single bracket syntax allows the user to explicitly delimit the link and since there is no visual difference in that workaround. Taking the initiative to close this wontfix.
Comment 8 Aryeh Gregor (not reading bugmail, please e-mail directly) 2011-07-12 22:21:35 UTC
I explicitly said in comment 2 what the reason is for the patch.  People will write stuff like "I like http://example.com/path/―do you?" with an unspaced em dash, and the autolinking will go too far.  We already stop autolinking at spaces and brackets, there's no reason not to add a few more characters to the list.

Of course, you can always use brackets instead of a free link.  But that goes both ways: if you have a URL that contains a dash and is parsed incorrectly with the change, you can always use square brackets to make the link work.  The question is, will the change correctly guess more or fewer URLs than at present?
Comment 9 au 2012-06-17 13:24:42 UTC
Hi Aryeh, thank you for the patch!

As you may already know, MediaWiki is currently revamping its PHP-based parser into a "Parsoid" prototype component, to support the rich-text Visual Editor project:

   https://www.mediawiki.org/wiki/Parsoid
   https://www.mediawiki.org/wiki/Visual_editor

Folks interested in enhancing the parser's capabilities are very much welcome to join the Parsoid project, and contribute patches as Git branches:

   https://www.mediawiki.org/wiki/Git/Tutorial#How_to_submit_a_patch

Compared to .diff attachments in Bugzilla tickets, Git branches are much easier for us to review, refine and merge features together.

Each change set has a distinct URL generated by the "git review" tool, which can be referenced in Bugzilla by pasting its gerrit.wikimedia.org URL as a comment.

If you run into any issues with the patch process, please feel free to ask on irc.freenode.net #wikimedia-dev and the wikitext-l mailing list. Thank you!
Comment 10 Aryeh Gregor (not reading bugmail, please e-mail directly) 2012-06-17 13:43:46 UTC
The patch is trivial, and doesn't really need review for correctness, but I never committed it because I was unsure if it would fix more pages than it broke.  If someone else wants to try, they should feel free.
Comment 11 Platonides 2012-06-17 21:49:23 UTC
It's a good point to stop URL autodetection on dashes, but I think it'd be preferible to leave the rule simply as "urls run until whitespace" (current behavior), closing this as WONTFIX.
Comment 12 Krinkle 2012-06-17 21:55:57 UTC
I agree with Platonides. How would a user make a link to the actual page with an em-dash in the title (not counting "per cent encoding" as a viable alternative since "nobody" knows about that).

For example, the Wikipedia article about en-dash:

https://en.wikipedia.org/wiki/–

Requiring whitespace to be between the url and another part of the sentence seems like a sane requirement to me. And if in the weird exception somebody prefers to have them next to each other without a space (even if that is grammatically correct, I don't know), then one can always use the full syntax: 

"I like [http://example.com/path/ http://example.com/path]―do you?"
Comment 13 Gabriel Wicke 2012-06-17 22:00:52 UTC
I agree with Platonides- better to have simple and relatively predictable rules, and a simple workaround for the complex cases. Adding exceptions for some characters while leaving out others would make the behavior harder to predict.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links