Last modified: 2010-05-15 15:33:19 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T2289, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 289 - ">"-token in URL-tail parsed wrongly
">"-token in URL-tail parsed wrongly
Product: MediaWiki
Classification: Unclassified
General/Unknown (Other open bugs)
All All
: Normal normal with 1 vote (vote)
: ---
Assigned To: Nobody - You can work on this!
: parser
: 308 (view as bug list)
Depends on:
  Show dependency treegraph
Reported: 2004-09-03 03:07 UTC by Timwi
Modified: 2010-05-15 15:33 UTC (History)
1 user (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Description Timwi 2004-09-03 03:07:52 UTC
Originally submitted by Roger Persson (rogper)  2004-05-21 07:00

Of a coincident I noticed that greater-than (>) char in URLs is 
rendered wrongly IF it occures as last character in URL.

Check this extra semicolon<hello> in the 
Check this<hello&gt strange thing


------------------------- Additional comments ------------------------
Date: 2004-05-28 09:35
Sender: SF user vibber

The HTML output is:

It looks like the HTML stripping is being done before external
links, so
the  have become "&lt;" and "&gt;". Semicolons are
legal in links; the _final_ punctuation (not followed by linkable
chars) is
stripped, but the bits in the middle are considered fair game
belonging to a link so it extends up to the "&gt" but not
the final ";" (or the other ";" that follows, which
is extraneous).

Correct behavior would be to have the link cover
then cut off at the <. This will require parsing for external
links before
stripping HTML; perhaps another placeholder step would be useful
here (might also help the longstanding URL-within-URL bug).

Bug is present in both 1.2 and current 1.3.
Comment 1 Timwi 2004-09-03 19:41:14 UTC
*** Bug 308 has been marked as a duplicate of this bug. ***
Comment 2 Brion Vibber 2004-10-10 13:08:14 UTC
Still present; added a test case to parserTests.
Comment 3 Wil Mahan 2004-10-11 00:30:52 UTC
According to RFC 2396, '<' and '>' are disallowed within URIs, and hence I added 
them to the list of prohibited characters.
Comment 4 Brion Vibber 2004-10-11 00:32:05 UTC
Wil, right. The problem is that the conversion of < and > to &lt; and &gt; has already been done when we do the 
external link parsing, and & and ; _are_ allowed in URLs.
Comment 5 Wil Mahan 2004-10-11 17:05:35 UTC
(In reply to comment #4)
> Wil, right. The problem is that the conversion of < and > to &lt; and &gt; has
already been done when we do the 
> external link parsing, and & and ; _are_ allowed in URLs.

Oh, I see. This should now be fixed in HEAD (Parser.php revision 1.323).
Rather than replacing external links before stripping HTML tags as
you suggested before, I just added a check for '&lt;' and '&gt;'
within external links. It's not an especially elegant solution, but
I think it will fix this without meddling with the order of
parser passes.
Comment 6 Brion Vibber 2004-10-11 18:13:49 UTC
Added more test cases.
Comment 7 Wil Mahan 2004-10-11 19:01:30 UTC
(In reply to comment #6)
> Added more test cases.

Fixed one by adding '<' and '>' back to the list of disallowed chars
(I added them earlier, but then I got nervous and undid the change.)

The two cases that still fail are due to the way disallowed
characters are treated as part of the link description; if that's
a bug, it's separate from this one, IMHO.
Comment 8 Antoine "hashar" Musso (WMF) 2005-08-24 11:52:26 UTC
Issue fixed in HEAD and 1.5, all parsertests in HEAD passed successfully.

Note You need to log in before you can comment on or make changes to this bug.