Last modified: 2009-01-05 16:31:09 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T6515, the corresponding Phabricator task for complete and up-to-date bug report information.

Bug 4515 - The Sanitizer doesn't validate the contents of id attributes


Summary:	The Sanitizer doesn't validate the contents of id attributes

Status:	RESOLVED DUPLICATE of bug 9530

Product:	MediaWiki
Classification:	Unclassified
Component:	Parser (Other open bugs)
Version:	unspecified
Hardware:	All All

Importance:	Normal minor with 3 votes (vote)
Target Milestone:	---
Assigned To:	Nobody - You can work on this!

URL:	http://www.w3.org/TR/html401/types.ht...
Whiteboard:
Keywords:	need-parsertest

Duplicates:	4687 6301 13832 (view as bug list)
Depends on:	10025 4461 10009
Blocks:	html
	Show dependency tree / graph

Reported:	2006-01-07 09:04 UTC by Ævar Arnfjörð Bjarmason
Modified:	2009-01-05 16:31 UTC (History)
CC List:	7 users (show)

See Also:
Web browser:	---
Mobile Platform:	---
Assignee Huggle Beta Tester:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Description Ævar Arnfjörð Bjarmason 2006-01-07 09:04:39 UTC

id attributes MUST (RFC 2119) have a value that begins with a letter [A-Za-z],
MediaWiki doesn't enforce this.

== Steps to reproduce
1. Save the text "<div id=9></div>" on a page

== Actual output
<div id="9"></div>

== Expected output
No idea how we should properly go about this...

Comment 1 Ævar Arnfjörð Bjarmason 2006-01-07 09:12:03 UTC

This bug has a parsertest called "Sanitizer: Validating the contents of the id
attribute"

Comment 2 Zigger 2006-01-07 10:05:15 UTC

(In reply to comment #0)
>...
> No idea how we should properly go about this...
What would Tidy do?

Comment 3 Antoine "hashar" Musso (WMF) 2006-01-07 13:26:01 UTC

<div id="nine"></div> ? :o)

Comment 4 Ævar Arnfjörð Bjarmason 2006-01-08 02:05:04 UTC

(In reply to comment #2)
> (In reply to comment #0)
> >...
> > No idea how we should properly go about this...
> What would Tidy do?

$ echo Parser::Tidy( "<div id=9>" );
<div id="9"></div>

I.e. it has the same issue.

Comment 5 Brion Vibber 2006-06-14 06:37:50 UTC

*** Bug 6301 has been marked as a duplicate of this bug. ***

Comment 6 Aryeh Gregor (not reading bugmail, please e-mail directly) 2006-06-14 06:52:04 UTC

As I understand it, Tidy explicitly does not fix invalid ids, because that
wouldn't necessarily fix the issue; invalid ids indicate a systematic error that
will inherently screw up, e.g., anchors, no matter what Tidy does.  It has to be
fixed with intelligence.

That said, a hack to kill multiple ids after Tidy tidies might be a good idea. 
If multiple or invalid ids are detected, any id after the first/any invalid id
should perhaps just be commented out with an error message in the comment.  If
someone tries to save something that explicitly (rather than via template)
contains tags that collectively contain one or more ids, a warning should be
thrown up on attempting to save.

There are rather more pressing things to worry about.  (On the non-dev side,
stub templates should use classes, not ids.)

Comment 7 Antoine "hashar" Musso (WMF) 2006-07-11 18:46:17 UTC

From 6301 : should also enforce id uniqueness

Comment 8 WulfTheSaxon 2006-09-10 20:36:45 UTC

I agree with Simetrical, stubs should use classes, not ids…

Comment 9 Philippe Verdy 2007-01-24 18:23:52 UTC

why not fixing it by adding a default compliant prefix for such ids?
So just prepend "id-" and you're done!
-> id="9" becomes id="id-9"

Hmm. in fact the standard is a bit more restrictive about which Unicode
characters are usable as identifiers, either at a start position, or at a
continuing position.

Note that there are currently *two* sets of Unicode properties (starting by
"ID_" or "XID_") but Unicode is about to make them equivalent : the only
difference is whever the MIDDLE DOT should be managed identically as a possible
continuing character, because there's an issue with the compatibility
equivalence of Catalan character L WITH MIDDLE DOT, which is a letter and valid
both as a start character or a continuation character in ids, but not its in its
equivalent when encoded as a valid L start charter, and a MIDDLE DOT continue
character).

See the ongoing Unicode Public Review about this issue and the proposed change
that would make the existing two sets of character properties for identifiers
completely equivalent.

The area where it could cause problems is possibly with HTML and XML strict
conformance for identifier names, or other programming languages (e.g. Java or
C#) that uses one of the affected set of character properties, and this
justifies the ongoing Unicode public review (because this may requiring those
language definitions to be adjusted to preserve the compatibility with their
strict conformance verifiers)

But even if we forget this very limited case, we can at least adopt full
conformance for the other characters usable in identifiers, without causing
strict HTML/XML parsers to fail and adopt automatically the old legacy
compatiblity profile support instead (which affects the rendering of pages).

Currently, it does not seem that either IE or Mozilla/Firefox do enforce the
strict validity of identifier names, and not even their uniqueness in the
document (both can return an array of elements from GetElementById(), if there
are multiple occurences of the same id, instead of failing, or returning null,
or returning a single element chosen arbitrarily), so this is not a demonstrated
major issue... for now...

(I don't know what other browsers are doing to enforce identifiers validity)

Comment 10 Aryeh Gregor (not reading bugmail, please e-mail directly) 2007-01-24 22:26:42 UTC

(In reply to comment #9)
> why not fixing it by adding a default compliant prefix for such ids?
> So just prepend "id-" and you're done!
> -> id="9" becomes id="id-9"

Fine by me.  We also have to fix anchor link generators (like for the TOC, section edit redirect, 
history, etc.), of course.

Comment 11 Aryeh Gregor (not reading bugmail, please e-mail directly) 2007-04-24 16:38:03 UTC

*** Bug 4687 has been marked as a duplicate of this bug. ***

Comment 12 David 2007-10-12 17:20:06 UTC

Why not just run it through static method escapeId() in the Sanitizer class (includes/Sanitizer.php)?  If you set the $flag argument to Sanitizer::NONE, it will make sure the ID starts with a letter.  In fact, the header anchors are created with this method (though the $flag argument is defaulted to Sanitizer::INITIAL_NONLETTER).

Comment 13 Aryeh Gregor (not reading bugmail, please e-mail directly) 2007-10-12 21:08:00 UTC

Doesn't help with id uniqueness, which is a much trickier problem.  It would be a start, though.

Comment 14 David 2007-10-12 22:27:04 UTC

I know.  I've thrown in my $0.02 about that, too.   See bug #7356.

Comment 15 David 2007-12-07 21:11:04 UTC

One thing to keep in mind is that whatever mechanism is used to ensure valid id attribute values, the wiki section links must be parsed using a similar method.  If the code converts

 <span id="0 9"/>

into

 <span id="x0_9"/>

then the wiki markup

 [[Page#0 9|text]]

must result in

 <a href="/wiki/Page#x0_9" title="Page">text</a>

Comment 16 Aryeh Gregor (not reading bugmail, please e-mail directly) 2007-12-08 23:04:11 UTC

Which will only work if the built-in id's aren't broken.  So if we ditch id's starting with numbers, we'd better do it everywhere at once, including in hardcoded id's (although I can't think of any of those that begin with numbers).

Comment 17 David 2007-12-09 01:52:20 UTC

Since the built-in ids can't be set by some random user like the ones in an article body, I think it would be reasonable the code to throw an exception whenever an invalid id is encountered.  Just have to make sure that the code can distinguish the built-in ids from the user-supplied ones.  Put the onus of fixing them on the admin, and in fact if the built-in ones are in violation, they can be caught before an official release.

Comment 18 Aryeh Gregor (not reading bugmail, please e-mail directly) 2007-12-09 02:02:43 UTC

Hardcoded id's don't get run through the Sanitizer anyway (why would they? check them by eye . . .), so there's nowhere to throw an exception.  I think they're all valid, though, thinking about it, except that maybe I recall the Cite.php ones aren't.

Comment 19 David 2007-12-09 03:41:10 UTC

> Hardcoded id's don't get run through the Sanitizer anyway (why would they?
> check them by eye . . .)

Ah, but you're forgetting the discussion on bug 7356, to which this bug is closely related.

Additionally, though all the built-in IDs are valid out-the-box, someone who skins their own wiki might make the mistake of using an invalid ID, in which case it'd be useful to have some sort of notification.

Comment 20 Brion Vibber 2008-04-24 00:23:39 UTC

*** Bug 13832 has been marked as a duplicate of this bug. ***

Comment 21 Aryeh Gregor (not reading bugmail, please e-mail directly) 2009-01-05 16:31:09 UTC

The current summary, "The Sanitizer doesn't validate the contents of id attributes", is misleading.  The Sanitizer does validate id attributes, just not fully.  There are several cases I know of where we might have id validity problems:

1) Id's can start with invalid characters, like digits.  This is bug 9530.

2) The software itself produces duplicate id's, say in the interface.  I'm not aware of anywhere this happens in core.  Any case where this occurs should be filed as a separate bug.

3) Section id's and manually-specified id's ("manually-specified" here means something like <span id="foo">, as opposed to == foo ==) can both conflict with interface id's.  This is bug 13926 and can be dealt with there.

4) Manually-specified id's can conflict with each other and with section id's.  This is covered by bug 7356.

5) It just occurred to me that section id's can conflict with each other in obscure cases, e.g.,

== a ==
== a ==
== a_2 ==

This is fairly unlikely to occur, but might be worth fixing anyway.  I've filed it as bug 16891.


All the individual cases are covered, so I don't see any reason for a general-purpose bug to remain open.  I don't think we need a tracking bug here.  I'll mark this duplicate of bug 9530 (which is hopefully about to be FIXED), since that's what the original bug was about.

*** This bug has been marked as a duplicate of bug 9530 ***

Note You need to log in before you can comment on or make changes to this bug.

Wikimedia Bugzilla is closed!

Search

Personal tools

Navigation

Links