Last modified: 2013-06-08 15:19:50 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T26918, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 24918 - Do not allow #, %, [, ], nbsp in fragment identifiers
Do not allow #, %, [, ], nbsp in fragment identifiers
Status: NEW
Product: MediaWiki
Classification: Unclassified
General/Unknown (Other open bugs)
unspecified
All All
: Low enhancement (vote)
: ---
Assigned To: Nobody - You can work on this!
:
Depends on:
Blocks: html
  Show dependency treegraph
 
Reported: 2010-08-24 04:27 UTC by entlinkt
Modified: 2013-06-08 15:19 UTC (History)
2 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments
Testcase (648 bytes, text/html)
2010-08-24 04:27 UTC, entlinkt
Details
Extended testcase (771 bytes, text/html)
2010-08-24 04:58 UTC, entlinkt
Details

Description entlinkt 2010-08-24 04:27:54 UTC
Created attachment 7648 [details]
Testcase

The characters "#", "%", "[" and "]" as well as any Unicode whitespace characters (no-break space etc.) should be banned in HTML5 IDs because they trigger a validation error if used in a href attribute. At least "#" are "%" also causing problems in practice (and not just in IE6 as the comment from r62134 suggests); see attached testcase.
Comment 1 entlinkt 2010-08-24 04:58:01 UTC
Created attachment 7649 [details]
Extended testcase

It seems that percent-encoding (the only way to avoid the validation error) does not work at all in any IE version and is implemented inconsistently in other browsers.
Comment 2 entlinkt 2010-08-24 05:59:30 UTC
It seems that the disallowed characters are based on section 2.2 of RFC 3987: ":", "/", "?", "#", "[", "]", "@" (gen-delims) minus "/" and "?" (explicitly allowed for ifragment) minus ":" and "@" (explicitly allowed for ipchar) plus "%" (special case) gives "#", "%", "[" and "]" in the end.

I don't know why Unicode whitespace characters aren't allowed, but the HTML5 validator complains about <a href="#&nbsp;"></a>, <a href="#&thinsp;"></a> and the like.
Comment 3 entlinkt 2010-08-24 07:19:09 UTC
See also section 3.1 of RFC 3987: Systems accepting IRIs MAY also deal with the printable characters in US-ASCII that are not allowed in URIs [...]. Please note that the number sign ("#"), the percent sign ("%"), and the square bracket characters ("[", "]") are not part of the above list and MUST NOT be converted. [...]
Comment 4 Aryeh Gregor (not reading bugmail, please e-mail directly) 2010-08-25 17:58:18 UTC
I'm not really worried about us not following the spec, since the spec can be changed if it's unreasonable, or ignored.  If we can't reliably link to these characters, though, we should strip them.  Your attachment only illustrates behavior with "#", which is already stripped -- do "%", "[", and "]" also not work in practice?  I tested with [[mw:User:Simetrical/Id test]] and they seemed to work fine.  I didn't test percent-escaping exhaustively, though -- in particular, since you point it out, things like "%3F" are likely to be interpreted differently by different browsers, and I didn't test that in all browsers.

Stripped "%" in r71636.  Does anything else cause browsers to misbehave?  If not, I'll look into filing spec or validator bugs where possible.
Comment 5 entlinkt 2010-08-25 22:15:01 UTC
Attachment 7649 [details] also shows inconsistent behaviour with "%". IE does not seem to support percent-encoding in fragments at all; it takes the "%" sign literally even if the two characters that follow could be hex digits. (If it did support percent-encoding in fragments, this were all moot, since we could just percent-encode these characters.)

Other browsers seem to try to guess how "%" was meant, but do it differently: Chrome prefers to take it literally, Mozilla and Opera prefer taking it as a hex number. location.hash is different again: Mozilla decodes it, but Chrome and Opera don't.

I have not found any practical issues with "[" and "]" so far.
Comment 6 entlinkt 2010-08-26 00:27:18 UTC
MediaWiki has a funny handling of these characters in external links that is exactly the other way round. The wikitext

[http://example.com/#&#x23;&#x25;&#x5B;&#x5D;]

gives this HTML:

<a href="http://example.com/##%%5B%5D">

So it lets the more problematic characters through unencoded and encodes the less problematic ones. Why that?
Comment 7 Aryeh Gregor (not reading bugmail, please e-mail directly) 2010-08-26 19:43:02 UTC
So the only remaining problem is that the validator complains about things like <a href="#&nbsp;"></a>?  If so, I'll look into reporting that as a spec or validator bug, and mark this FIXED.

(In reply to comment #6)
> MediaWiki has a funny handling of these characters in external links that is
> exactly the other way round. The wikitext
> 
> [http://example.com/#&#x23;&#x25;&#x5B;&#x5D;]
> 
> gives this HTML:
> 
> <a href="http://example.com/##%%5B%5D">
> 
> So it lets the more problematic characters through unencoded and encodes the
> less problematic ones. Why that?

I don't know.  I glanced at the code but didn't see an obvious reason.  It's a separate bug.
Comment 8 entlinkt 2010-08-26 23:44:05 UTC
> So the only remaining problem is that the validator complains about things like
> <a href="#&nbsp;"></a>?

Not quite. It's unclear why the HTML5 validator complains about Unicode whitespace like nbsp etc.; the RFCs give no clue. But unencoded "[" and "]" are clearly non-compliant. RFC 3987 says "... square bracket characters ... MUST NOT be converted" and then RFC 3986 says "A host identified by an Internet Protocol literal address, version 6 [RFC3513] or later, is distinguished by enclosing the IP literal within square brackets ("[" and "]").  This is the only place where square bracket characters are allowed in the URI syntax."

> I don't know.  I glanced at the code but didn't see an obvious reason.  It's a
> separate bug.

Separate, but related. There is apparently no way to write links to sections with "[" and "]" in the title as external links (this includes permalinks) without getting them percent-encoded (more than that, it's hard to write them at all, as they clash with wiki markup).

Other than that, I'm not sure if stripping the most problematic characters is the right approach at all. It doesn't solve all compatibility issues. I've just noticed the following: Paste http://example.com/#< into Firefox' address bar. Copy from there and paste into an arbitrary text editor. You'll get http://example.com/#%3C (tested in a current Firefox 4.0 nightly), which doesn't work in IE. This happens with some funny ASCII characters like "<" and ">", but also - and that's far worse - non-ASCII characters that occur in natural language.

So Firefox users will create links that don't work in IE as long as IE doesn't understand percent encoding. Maybe we should therefore allow all characters in IDs, percent-encode where necessary (that is, just 4 ASCII characters which rarely occur in natural language anyway) and accept that this minor detail doesn't work in IE. That's at least compliant; the whole attempt to allow arbitrary Unicode characters isn't interoperable with Firefox enforcing percent encoding and IE not supporting it.
Comment 9 Aryeh Gregor (not reading bugmail, please e-mail directly) 2010-08-27 17:22:32 UTC
(In reply to comment #8)
> But unencoded "[" and "]" are
> clearly non-compliant. RFC 3987 says "... square bracket characters ... MUST
> NOT be converted" and then RFC 3986 says "A host identified by an Internet
> Protocol literal address, version 6 [RFC3513] or later, is distinguished by
> enclosing the IP literal within square brackets ("[" and "]").  This is the
> only place where square bracket characters are allowed in the URI syntax."

Hmm.  We could strip those too, but it seems silly if all browsers accept them.  If the spec requires something that not all browsers support, and prohibits something equivalent that all browsers do support, the spec is broken.

> Separate, but related. There is apparently no way to write links to sections
> with "[" and "]" in the title as external links (this includes permalinks)
> without getting them percent-encoded (more than that, it's hard to write them
> at all, as they clash with wiki markup).

The sensible thing would be to urldecode() anchors automatically in external links, if that's what it takes for IE to accept them . . . if that's necessary for the links to actually work but specs prohibit it, the specs are wrong.  But that's a separate issue from a development perspective, as I said, although conceputally related.

> Other than that, I'm not sure if stripping the most problematic characters is
> the right approach at all. It doesn't solve all compatibility issues. I've just
> noticed the following: Paste http://example.com/#< into Firefox' address bar.
> Copy from there and paste into an arbitrary text editor. You'll get
> http://example.com/#%3C (tested in a current Firefox 4.0 nightly), which
> doesn't work in IE. This happens with some funny ASCII characters like "<" and
> ">", but also - and that's far worse - non-ASCII characters that occur in
> natural language.
> 
> So Firefox users will create links that don't work in IE as long as IE doesn't
> understand percent encoding.

This seems like a minor enough failure.  At worst, the very small number of people who this happens to won't make it to the right section.  Not the end of the world.

I've reported the issue to Microsoft, after verifying that it still exists in IE9PP4:

https://connect.microsoft.com/IE/feedback/details/590087/percent-encoding-fragments-hashes-anchors-does-not-work

(A [free] Microsoft Live account is needed to view.)

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links