Last modified: 2006-05-01 20:23:38 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T2361, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 361 - URL inside a URL breaks parsing
URL inside a URL breaks parsing
Status: RESOLVED FIXED
Product: MediaWiki
Classification: Unclassified
Parser (Other open bugs)
unspecified
All All
: Normal normal with 2 votes (vote)
: ---
Assigned To: Nobody - You can work on this!
: parser
: 1031 1111 1129 1301 (view as bug list)
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2004-09-03 03:31 UTC by Timwi
Modified: 2006-05-01 20:23 UTC (History)
8 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Timwi 2004-09-03 03:31:16 UTC
BUG MIGRATED FROM SOURCEFORGE
http://sourceforge.net/tracker/index.php?func=detail&aid=583234&group_id=34373&atid=411192
Originally submitted by Nobody/Anonymous - nobody  2002-07-18 08:48


When a URL contains another full and unescaped URL
within its query string, it is correctly parsed as a
single big URL when placed directly into the text.

However, if put in brackets as [URL] or [URL
description], the URL-in-a-bracket parsing breaks. The
brackets, URL, and description appear as plain text,
and the sub-URL gets reparsed as a standalone hyperlink.

Example:
http://www.unausa.org/newindex.asp?place=http://www.unausa.org/programs/mun.asp
appears correctly as a big link to the correct, full URL.

[http://www.unausa.org/newindex.asp?place=http://www.unausa.org/programs/mun.asp]
should display as "[1]" being a link, but instead
appears with brackets and full URL intact as text, but
the portion "http://www.unausa.org/programs/mun.asp" is
a linked URL.


Workaround: Replacing the : with %3A fixes the parsing
problem. (Of course, in this particular case only the
shorter URL is actually needed, as it will be
dynamically redirected to the longer URL.)

------------------------- Additional comments ------------------------
Date: 2002-07-19 21:06
Sender: SF user lcrocker

I'm lowering the priority on this since there's an easy 
workaround, and would require messing with some pretty 
stable and pretty important code, but it would be nice,
so I'll leave it open.

-------------------------------------------------
Date: 2002-07-30 20:44
Sender: SF user vibber

I'm raising the priority because I've come across a case the
workaround doesn't work for.

(See http://www.wikipedia.com/wiki/Wikipedia%3AVillage_pump )

If the main URL is http and the sub-URL is *ftp*, the %3A
fix doesn't work: all ftp URLs are parsed /after/ all http
URLs, and somehow the %3A gets transformed back into a : in
the 'title' field of the link... this triggers the ftp
URL-checker, so:

[http://promo.net/cgi-promo/pg/t9.cgi?entry=120&full=yes&
ftpsite=ftp%3A//ibiblio.org/pub/docs/books/gutenberg/
Gutenberg text]

is parsed into the horrific:

<a
href='http://promo.net/cgi-promo/pg/t9.cgi?entry=120&full=yes
&ftpsite=ftp%3A//ibiblio.org/pub/docs/books/gutenberg/'
class='external'
title="http://promo.net/cgi-promo/pg/t9.cgi?entry=120&am
p;full=yes&ftpsite=<a
href="ftp://ibiblio.org/pub/docs/books/gutenberg/
class='external'
title="ftp://ibiblio.org/pub/docs/books/gutenberg/">
ftp://ibiblio.org/pub/docs/books/gutenberg/</a>">Gu
tenberg
text</a>

Simple partial fix would be to *not* unescape URL-encoded
bytes when producing the 'title' attribute for the link, so
it remains %3A and doesn't trigger the link converter.
Alternatively, find a way to not check for URLs inside HTML
tags.
-------------------------------------------------
Date: 2003-01-23 00:02
Sender: SF user nichtich

URLs inside all kind of links should be treated as text.

For instance:

[[Sandbox|http://de.wikipedia.org]]

produces a link to http://de.wikipedia.org and not to 
[[Sandbox]] !
-------------------------------------------------
Date: 2003-03-17 07:44
Sender: nobody
Logged In: NO 

Wiki has other problems dealing with URLs that have certain characters
in it like '*' or if the URL contains part of another URL.
Examples:

http://example.com/*/foo/bar
[http://example.com/redir/http://www.prwatch.org some link]

As the original bug report indicated, URL escaping can be used
as a
workaround:

http://example.com/%2A/foo/bar
[http://example.com/redir/%68ttp://www.prwatch.org some link]

--Sheldon Rampton (sheldon.rampton@verizon.net)
-------------------------------------------------
Date: 2004-08-07 20:30
Sender: SF user timstarling

All of these problems are now fixed except nichtich's. I also 
added URL-encoding, it seemed to me to be more user-
friendly to allow users to paste URLs in directly. Brion's 
example:

[http://promo.net/cgi-promo/pg/t9.cgi?
entry=120&full=yes&
ftpsite=ftp%3A//ibiblio.org/pub/docs/books/gutenberg/
Gutenberg text]

This needs to become:

[http://promo.net/cgi-promo/pg/t9.cgi?
entry=120&full=yes&ftpsite=ftp%
3A//ibiblio.org/pub/docs/books/gutenberg/ Gutenberg text]

This is not backwards-compatible so may require automated 
conversion.
Comment 1 Mischa Krilov 2004-09-16 18:26:37 UTC
Had this happen to me when trying to add a link from the Internet Archive. I
replaced the "h" in the second "http" with %68- works for me.

References:
http://en.wikipedia.org/wiki/Smoot
http://web.archive.org/web/19970806205154/%68ttp://web.mit.edu/museum/fun/smoots.html
Comment 2 Rowan Collins [IMSoP] 2004-09-16 18:29:39 UTC
Just for reference, I discovered an even simpler workaround for that particular
problem earlier: you can leave out the second "http://" altogether from the
archive.org URL, and it will still function correctly. Just in case anyone was
looking for the easiest workaround.
Comment 3 JeLuF 2004-12-07 20:13:05 UTC
*** Bug 1031 has been marked as a duplicate of this bug. ***
Comment 4 Shane King 2004-12-08 06:45:45 UTC
This seems to be fixed in both 1.4 and HEAD, can someone else please verify this
incase I'm misunderstanding the problem?
Comment 5 JeLuF 2004-12-08 06:50:02 UTC
It's marked as fixed-in-cvs (see keywords at the top). 
It's not yet closed since the site is still running 1.3.x
Comment 6 Shane King 2004-12-08 07:09:14 UTC
Gah, I'll get the hang of the weird way bugzilla is used here someday. I assumed
if it was fixed it would be marked as FIXED, and then CLOSED once the fix is
released. So very confusing.
Comment 7 JeLuF 2004-12-08 07:20:55 UTC
If it's marked as "fixed", it wouldn't show in search. So we couldn't yell
at people who open duplicates.
Comment 8 Marc Bejarano 2004-12-08 08:12:02 UTC
that's because the default query in the currently running b.w.o version of
bugzilla is poor.  see https://bugzilla.mozilla.org/show_bug.cgi?id=194116 for
more discussion than you ever wanted on this issue.
Comment 9 Brion Vibber 2004-12-16 05:08:03 UTC
*** Bug 1111 has been marked as a duplicate of this bug. ***
Comment 10 Ta bu shi da yu 2004-12-18 16:49:59 UTC
*** Bug 1129 has been marked as a duplicate of this bug. ***
Comment 11 Zigger 2005-01-10 22:20:39 UTC
*** Bug 1301 has been marked as a duplicate of this bug. ***
Comment 12 Zigger 2005-01-10 22:33:42 UTC
Changed summary to include the problem as at 1.4beta4, where:

URLURL             splits link and displayed text
[URLURL]           ok
[URLURL URLURL]    ok
Comment 13 Brion Vibber 2005-03-10 03:38:32 UTC
Removing fixed-in-cvs keyword, as URLURL form is still kinda broken.
Comment 14 Brion Vibber 2005-04-22 12:23:40 UTC
Added a parser test case for the failing case.
Comment 15 Jelle Zijlstra 2005-08-25 14:45:04 UTC
This bug is still in existance, see for example
http://nl.wikipedia.org/w/index.php?title=Afbeelding:Rob_van_de_Meeberg.jpg&diff=0&oldid=1876874
.

Jelle Zijlstra/Ucucha
Comment 16 JiangXin 2005-11-05 16:51:30 UTC
this patch works:

Index: ../includes/Parser.php
===================================================================
RCS file: /user/jiangxin/project/wiki/mediawiki/src/mediawiki/includes/Parser.php,v
retrieving revision 1.2
retrieving revision 1.3
diff -u -r1.2 -r1.3
--- ../includes/Parser.php	5 Nov 2005 09:06:37 -0000	1.2
+++ ../includes/Parser.php	5 Nov 2005 16:42:35 -0000	1.3
@@ -1127,6 +1127,17 @@
 		while ( $i < count( $bits ) ){
 			$protocol = $bits[$i++];
 			$remainder = $bits[$i++];
+			/* Fix BUG 361: URL within URL. (by johnson@worldhello.net) */
+			while ( !preg_match('/[\s]+$/', $remainder) ) {
+				if( $i < count( $bits) )
+				{
+					$remainder .= $bits[$i++];
+				}
+				else
+				{
+					break;
+				}				
+			}
 
 			if ( preg_match( '/^('.EXT_LINK_URL_CLASS.'+)(.*)$/s', $remainder, $m ) ) {
 				# Found some characters after the protocol that look promising
Comment 17 Sherool 2006-04-28 17:17:38 UTC
I've come across something that might be related (not sure if this belong here or as a new bug), some "illegal" 
URL's can cause not just the link parsing to break, but the rendering of the whole page. Check this revision for 
example:

http://en.wikipedia.org/w/index.php?title=Image:S%2BS.jpg&oldid=49244647

I've been able to reproduce it by copying the messed up URL to other pages and previewing them, sometimes the 
page breaks and sometimes everyting parses ok, wich is odd. Guess the order in wich scertain things appear on the 
page affect it somehow.
Comment 18 Antoine "hashar" Musso (WMF) 2006-05-01 19:57:34 UTC
c17 can be resumed as:

http://www/?http://www/ really

==header==

Which render as the incorrect:

<p><a href="http://www/?http://www/ really
</p><p>&lt;h2&gt;header&lt;/h2&gt;" class='external free'
title="http://www/?http://www/ really
</p><p>&lt;h2&gt;header&lt;/h2&gt;" rel="nofollow">http://www/?http://www/ really
</p><p>&lt;h2&gt;header&lt;/h2&gt;</a> really

</p>
Comment 19 Antoine "hashar" Musso (WMF) 2006-05-01 20:23:38 UTC
All occurences above are fixed in current trunk. Case in comment 17
is fixed by r14008 .

Closing bug.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links