Last modified: 2012-06-17 22:00:52 UTC
The provided URL demonstrates that the parser interprets any (or almost any) character other than linebreak, tab, space, or nbsp as being part of a preceding URL. Only alphanumerics and -_.~!*'();:@&=+$,/?%#[] can appear in a URI of any kind.
Agree, but unicode URLs showed in 'nice form' (without % escaping) could be affected. This shouldn't be allowed, but as several people asked here to not escape urls on [[ ]], it's probably that some urls will break. Should we care?
Hrm, good point. MediaWiki does, of course, escape those automatically if input as an actual URL (so em dash would be included but automatically translated to %E2%80%95 for the href of the link). So this might actually be a feature, not a bug. Note, however, that it only affects autodetected URLs, not single- or double-bracketed links; it would be correct for those two to assume that invalid characters are part of the link, because you know that the link only ends on a space/] or pipe/]], respectively. Regardless, it would be still better to just selectively exclude a few more common punctuation marks (like em and en dash, the former of which is where I noticed this at <http://en.wikipedia.org/w/index.php?title=Wikipedia:Image_copyright_tags&diff=prev&oldid=60640000#Other_public_domain_images>) and continue to parse, e.g., foreign-language characters as part of autodetected URLs.
Created attachment 2185 [details] Patch This patch treats trailing *or* internal 0xA0, 0x2000-0x200B, 0x200D-0x2015 (various Unicode spaces/dashes) as not being part of free links. Bracketed links are unaffected.
Created attachment 2186 [details] Patch Fix a couple of trivial mistakes I just noticed in previous patch.
This seems a weird exception. Aren't there thousands of other punctuation and control characters?
Yes, but many will either a) routinely occur in URLs (in encoded form) in any language where they'd be likely to occur in text, or b) never legitimately occur immediately after something intended to be a URL. If there are any other characters to which neither of those points apply, those should probably be added as well. U+2018 through U+2026 would probably be good candidates, and » (U+BB). (On the other hand, U+200D shouldn't be included, or at least I don't know what precisely it does but have some vague idea it's common in some languages. I must have included it by mistake.)
This clearly has not been applied. The patch is rather restrictive and does not address many of the characters listed in the example, however many of those characters (hyphen, certainly, and exclamation point, and parenthesis) are found in URLs. Further, I can't think of any negative effect of allowing a rather liberal character set in urls, especially since the single bracket syntax allows the user to explicitly delimit the link and since there is no visual difference in that workaround. Taking the initiative to close this wontfix.
I explicitly said in comment 2 what the reason is for the patch. People will write stuff like "I like http://example.com/path/―do you?" with an unspaced em dash, and the autolinking will go too far. We already stop autolinking at spaces and brackets, there's no reason not to add a few more characters to the list. Of course, you can always use brackets instead of a free link. But that goes both ways: if you have a URL that contains a dash and is parsed incorrectly with the change, you can always use square brackets to make the link work. The question is, will the change correctly guess more or fewer URLs than at present?
Hi Aryeh, thank you for the patch! As you may already know, MediaWiki is currently revamping its PHP-based parser into a "Parsoid" prototype component, to support the rich-text Visual Editor project: https://www.mediawiki.org/wiki/Parsoid https://www.mediawiki.org/wiki/Visual_editor Folks interested in enhancing the parser's capabilities are very much welcome to join the Parsoid project, and contribute patches as Git branches: https://www.mediawiki.org/wiki/Git/Tutorial#How_to_submit_a_patch Compared to .diff attachments in Bugzilla tickets, Git branches are much easier for us to review, refine and merge features together. Each change set has a distinct URL generated by the "git review" tool, which can be referenced in Bugzilla by pasting its gerrit.wikimedia.org URL as a comment. If you run into any issues with the patch process, please feel free to ask on irc.freenode.net #wikimedia-dev and the wikitext-l mailing list. Thank you!
The patch is trivial, and doesn't really need review for correctness, but I never committed it because I was unsure if it would fix more pages than it broke. If someone else wants to try, they should feel free.
It's a good point to stop URL autodetection on dashes, but I think it'd be preferible to leave the rule simply as "urls run until whitespace" (current behavior), closing this as WONTFIX.
I agree with Platonides. How would a user make a link to the actual page with an em-dash in the title (not counting "per cent encoding" as a viable alternative since "nobody" knows about that). For example, the Wikipedia article about en-dash: https://en.wikipedia.org/wiki/– Requiring whitespace to be between the url and another part of the sentence seems like a sane requirement to me. And if in the weird exception somebody prefers to have them next to each other without a space (even if that is grammatically correct, I don't know), then one can always use the full syntax: "I like [http://example.com/path/ http://example.com/path]―do you?"
I agree with Platonides- better to have simple and relatively predictable rules, and a simple workaround for the complex cases. Adding exceptions for some characters while leaving out others would make the behavior harder to predict.