Last modified: 2012-06-17 22:00:52 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T8458, the corresponding Phabricator task for complete and up-to-date bug report information.

Bug 6458 - Dashes are parsed as part of autodetected URLs


Summary:	Dashes are parsed as part of autodetected URLs

Status:	RESOLVED WONTFIX

Product:	MediaWiki
Classification:	Unclassified
Component:	Parser (Other open bugs)
Version:	unspecified
Hardware:	All All

Importance:	Low minor with 1 vote (vote)
Target Milestone:	---
Assigned To:	Nobody - You can work on this!

URL:	http://en.wikipedia.org/wiki/User:Sim...
Whiteboard:
Keywords:	newparser, patch, patch-need-review

Depends on:
Blocks:
	Show dependency tree / graph

Reported:	2006-06-27 03:55 UTC by Aryeh Gregor (not reading bugmail, please e-mail directly)
Modified:	2012-06-17 22:00 UTC (History)
CC List:	6 users (show)

See Also:
Web browser:	---
Mobile Platform:	---
Assignee Huggle Beta Tester:	---

Attachments
Patch (2.28 KB, patch) 2006-07-31 00:04 UTC, Aryeh Gregor (not reading bugmail, please e-mail directly)	Details
Patch (2.23 KB, patch) 2006-07-31 00:06 UTC, Aryeh Gregor (not reading bugmail, please e-mail directly)	Details
Show Obsolete (1) Add an attachment (proposed patch, testcase, etc.)

Description Aryeh Gregor (not reading bugmail, please e-mail directly) 2006-06-27 03:55:45 UTC

The provided URL demonstrates that the parser interprets any (or almost any)
character other than linebreak, tab, space, or nbsp as being part of a preceding
URL.  Only alphanumerics and -_.~!*'();:@&=+$,/?%#[] can appear in a URI of any
kind.

Comment 1 Platonides 2006-06-28 11:11:11 UTC

Agree, but unicode URLs showed in 'nice form' (without % escaping) could be
affected. This shouldn't be allowed, but as several people asked here to not
escape urls on [[ ]], it's probably that some urls will break. Should we care?

Comment 2 Aryeh Gregor (not reading bugmail, please e-mail directly) 2006-06-28 18:04:31 UTC

Hrm, good point.  MediaWiki does, of course, escape those automatically if input
as an actual URL (so em dash would be included but automatically translated to
%E2%80%95 for the href of the link).  So this might actually be a feature, not a
bug.  Note, however, that it only affects autodetected URLs, not single- or
double-bracketed links; it would be correct for those two to assume that invalid
characters are part of the link, because you know that the link only ends on a
space/] or pipe/]], respectively.

Regardless, it would be still better to just selectively exclude a few more
common punctuation marks (like em and en dash, the former of which is where I
noticed this at
<http://en.wikipedia.org/w/index.php?title=Wikipedia:Image_copyright_tags&diff=prev&oldid=60640000#Other_public_domain_images>)
and continue to parse, e.g., foreign-language characters as part of autodetected
URLs.

Comment 3 Aryeh Gregor (not reading bugmail, please e-mail directly) 2006-07-31 00:04:44 UTC

Created attachment 2185 [details]
Patch

This patch treats trailing *or* internal 0xA0, 0x2000-0x200B, 0x200D-0x2015
(various Unicode spaces/dashes) as not being part of free links.  Bracketed
links are unaffected.

Comment 4 Aryeh Gregor (not reading bugmail, please e-mail directly) 2006-07-31 00:06:52 UTC

Created attachment 2186 [details]
Patch

Fix a couple of trivial mistakes I just noticed in previous patch.

Comment 5 Brion Vibber 2006-09-11 13:01:11 UTC

This seems a weird exception. Aren't there thousands of other punctuation and control characters?

Comment 6 Aryeh Gregor (not reading bugmail, please e-mail directly) 2006-09-12 00:57:36 UTC

Yes, but many will either a) routinely occur in URLs (in encoded form) in any
language where they'd be likely to occur in text, or b) never legitimately occur
immediately after something intended to be a URL.  If there are any other
characters to which neither of those points apply, those should probably be
added as well.  U+2018 through U+2026 would probably be good candidates, and »
(U+BB).

(On the other hand, U+200D shouldn't be included, or at least I don't know what
precisely it does but have some vague idea it's common in some languages.  I
must have included it by mistake.)

Comment 7 Dan Collins 2011-07-12 02:59:54 UTC

This clearly has not been applied. The patch is rather restrictive and does not address many of the characters listed in the example, however many of those characters (hyphen, certainly, and exclamation point, and parenthesis) are found in URLs. Further, I can't think of any negative effect of allowing a rather liberal character set in urls, especially since the single bracket syntax allows the user to explicitly delimit the link and since there is no visual difference in that workaround. Taking the initiative to close this wontfix.

Comment 8 Aryeh Gregor (not reading bugmail, please e-mail directly) 2011-07-12 22:21:35 UTC

I explicitly said in comment 2 what the reason is for the patch.  People will write stuff like "I like http://example.com/path/―do you?" with an unspaced em dash, and the autolinking will go too far.  We already stop autolinking at spaces and brackets, there's no reason not to add a few more characters to the list.

Of course, you can always use brackets instead of a free link.  But that goes both ways: if you have a URL that contains a dash and is parsed incorrectly with the change, you can always use square brackets to make the link work.  The question is, will the change correctly guess more or fewer URLs than at present?

Comment 9 au 2012-06-17 13:24:42 UTC

Hi Aryeh, thank you for the patch!

As you may already know, MediaWiki is currently revamping its PHP-based parser into a "Parsoid" prototype component, to support the rich-text Visual Editor project:

   https://www.mediawiki.org/wiki/Parsoid
   https://www.mediawiki.org/wiki/Visual_editor

Folks interested in enhancing the parser's capabilities are very much welcome to join the Parsoid project, and contribute patches as Git branches:

   https://www.mediawiki.org/wiki/Git/Tutorial#How_to_submit_a_patch

Compared to .diff attachments in Bugzilla tickets, Git branches are much easier for us to review, refine and merge features together.

Each change set has a distinct URL generated by the "git review" tool, which can be referenced in Bugzilla by pasting its gerrit.wikimedia.org URL as a comment.

If you run into any issues with the patch process, please feel free to ask on irc.freenode.net #wikimedia-dev and the wikitext-l mailing list. Thank you!

Comment 10 Aryeh Gregor (not reading bugmail, please e-mail directly) 2012-06-17 13:43:46 UTC

The patch is trivial, and doesn't really need review for correctness, but I never committed it because I was unsure if it would fix more pages than it broke.  If someone else wants to try, they should feel free.

Comment 11 Platonides 2012-06-17 21:49:23 UTC

It's a good point to stop URL autodetection on dashes, but I think it'd be preferible to leave the rule simply as "urls run until whitespace" (current behavior), closing this as WONTFIX.

Comment 12 Krinkle 2012-06-17 21:55:57 UTC

I agree with Platonides. How would a user make a link to the actual page with an em-dash in the title (not counting "per cent encoding" as a viable alternative since "nobody" knows about that).

For example, the Wikipedia article about en-dash:

https://en.wikipedia.org/wiki/–

Requiring whitespace to be between the url and another part of the sentence seems like a sane requirement to me. And if in the weird exception somebody prefers to have them next to each other without a space (even if that is grammatically correct, I don't know), then one can always use the full syntax: 

"I like [http://example.com/path/ http://example.com/path]―do you?"

Comment 13 Gabriel Wicke 2012-06-17 22:00:52 UTC

I agree with Platonides- better to have simple and relatively predictable rules, and a simple workaround for the complex cases. Adding exceptions for some characters while leaving out others would make the behavior harder to predict.

Wikimedia Bugzilla is closed!

Search

Personal tools

Navigation

Links