Last modified: 2014-01-28 05:53:03 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T9356, the corresponding Phabricator task for complete and up-to-date bug report information.

Bug 7356 - User-specified HTML IDs can be the same as interface IDs


Summary:	User-specified HTML IDs can be the same as interface IDs

Status:	NEW

Product:	MediaWiki
Classification:	Unclassified
Component:	Parser (Other open bugs)
Version:	unspecified
Hardware:	All All

Importance:	Low normal with 4 votes (vote)
Target Milestone:	---
Assigned To:	Nobody - You can work on this!

URL:	http://en.wikipedia.org/wiki/User:Sim...
Whiteboard:
Keywords:

Duplicates:	7662 11625 13926 17650 21440 21856 22587 24285 29049 29480 (view as bug list)
Depends on:
Blocks:	html
	Show dependency tree / graph

Reported:	2006-09-17 19:48 UTC by Yuri Astrakhan
Modified:	2014-01-28 05:53 UTC (History)
CC List:	16 users (show)

See Also:
Web browser:	---
Mobile Platform:	---
Assignee Huggle Beta Tester:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Description Yuri Astrakhan 2006-09-17 19:48:53 UTC

If any of the header/subheader is given as == content ==, firefox 1.5.0.7 draws
an semi-complete dashed box next to it.

Repo:
create a page with the following text:

==content==

preview or save, and observer the result.

Comment 1 Aryeh Gregor (not reading bugmail, please e-mail directly) 2006-09-17 20:20:12 UTC

I don't see anything.  Does it happen if you log out?  Does it happen at the URL
I just added to this bug?

Comment 2 Yuri Astrakhan 2006-09-18 05:31:30 UTC

That's because you capitalized the word "Content". It must be all lower case.

Comment 3 Dan Li 2006-09-18 05:41:12 UTC

The heading generates an anchor with name=id=content, which collides with the
id=content div. :(

Comment 4 Aryeh Gregor (not reading bugmail, please e-mail directly) 2006-09-18 05:56:36 UTC

Ouch.  That's nasty.  The only solution I can see would be to move all header
id's to stuff like #h-content instead of #content.  (You could also special-case
the few bad id's, but that will a) lead to confusion and b) be hard to maintain.)

Comment 5 Dan Li 2006-10-22 02:30:50 UTC

*** Bug 7662 has been marked as a duplicate of this bug. ***

Comment 6 Aryeh Gregor (not reading bugmail, please e-mail directly) 2006-10-29 18:57:50 UTC

(In reply to comment #4)
> Ouch.  That's nasty.  The only solution I can see would be to move all header
> id's to stuff like #h-content instead of #content.  (You could also special-case
> the few bad id's, but that will a) lead to confusion and b) be hard to maintain.)

Better solution: prefix all interface id's with "mw-" and then ban that from
non-interface id's.  Should be pretty simple to fix, although it will
unfortunately be slightly disruptive.

Comment 7 David 2007-10-10 19:33:14 UTC

Even if the aforementioned solutions are applied, someone could just as easily edit/create a page with the following:

    ==content==

    <span id="content">text</span>

and the same problem would exist.  Also, if you don't allow user-supplied ids/anchor names (or derived ids/anchor names from user-supplied content) to have the prefix "mw-", how would you deal with the following:

    ==mw-content==

Let's not forget templates.  If a page includes a template, it's possible that both pages use the same id/anchor name, even though within each page individually, the ids/anchor names are unique.  And I've found a similar problem with extensions that generate their own ids/anchor names like Cite. (see bug #11625)

One thing I've noticed is that if a tag is created with an ID that has characters not allowed, the parser is smart enough to single out the id and swap out the invalid characters with valid ones.

What if the parser kept a running list of all the ids and anchor names already in use?  When it replaces the invalid id/anchor name characters, it can check against the list to make sure the id/anchor name in question is not already in use.  Duplicates would be resolved the same way headers with the same text are resolved.

The only issue I can see at the moment are when extensions create links to destination anchors yet to be rendered.  Let's take Cite for example.  Given the following:

    I like cheese<ref>It's true!</ref>.

    ...

    <references/>

when the "ref" tag gets rendered, a link must be created to a destination anchor that doesn't yet exist, so two things have to happen:  (a) an id/anchor name must be created on the spot, so it can be linked to the footnote (even the footnote itself has not been created yet), and (b) all other destination anchors must be prevented from using the generated id/anchor name, without preventing the "references" tag from using it, too.

Comment 8 Roan Kattouw 2007-10-11 16:13:20 UTC

*** Bug 11625 has been marked as a duplicate of this bug. ***

Comment 9 Aryeh Gregor (not reading bugmail, please e-mail directly) 2007-10-14 01:42:49 UTC

(In reply to comment #7)
> What if the parser kept a running list of all the ids and anchor names already
> in use?  When it replaces the invalid id/anchor name characters, it can check
> against the list to make sure the id/anchor name in question is not already in
> use.  Duplicates would be resolved the same way headers with the same text are
> resolved.

Something broadly like that is, of course, the only way to fix this bug.  To begin with, though, much of the interface isn't run through the Sanitizer, so we'd have to manually (!) keep track of every single one of the hundreds of id's used in the software, which tend not to follow any rhyme or reason.  It's still doable, certainly.

Comment 10 David 2007-10-15 01:10:46 UTC

Sounds like it might be tedious task, but not necessarily a difficult one.  Worst case scenario is that all the IDs and anchor names outside the actual article body are hard-coded into the list.  A better option is to have the surrounding HTML completely assembled before the article body is, and pass it into a method that extracts every id and anchor name and adds it to the list.

Comment 11 Aryeh Gregor (not reading bugmail, please e-mail directly) 2007-10-15 01:19:28 UTC

Patches are appreciated.

Comment 12 David 2009-05-07 15:15:02 UTC

*** Bug 13926 has been marked as a duplicate of this bug. ***

Comment 13 Chad H. 2009-07-29 19:00:54 UTC

*** Bug 17650 has been marked as a duplicate of this bug. ***

Comment 14 P.Copp 2009-11-08 23:11:58 UTC

*** Bug 21440 has been marked as a duplicate of this bug. ***

Comment 15 P.Copp 2009-12-15 16:24:50 UTC

*** Bug 21856 has been marked as a duplicate of this bug. ***

Comment 16 merl 2009-12-16 00:36:46 UTC

Because the heading can start with a non ascii letter a invalid id is created which starts with a point.
According to specification of xhtml 1.0 an id has to start with [A-Za-z]. Numbers and some other characters (e.g. point) are only allow at the following character.

== Überschrift ==
creates
<span class="mw-headline" id=".C3.9Cberschrift">Überschrift</span>

So a prefix to the id should solve this problem because mw-.C3.9Cberschrift would be a valid id.

Comment 17 Aryeh Gregor (not reading bugmail, please e-mail directly) 2009-12-16 00:54:39 UTC

MediaWiki no longer outputs XHTML1 by default, but HTML5.  id's in HTML5 can be any nonempty string that doesn't contain whitespace:

http://www.whatwg.org/specs/web-apps/current-work/multipage/elements.html#the-id-attribute

Comment 18 Danny B. 2009-12-16 02:09:50 UTC

(In reply to comment #17)
> MediaWiki no longer outputs XHTML1 by default, but HTML5.  id's in HTML5 can be
> any nonempty string that doesn't contain whitespace:
> 
> http://www.whatwg.org/specs/web-apps/current-work/multipage/elements.html#the-id-attribute
> 

But still can (and on WMF wikis does) output XHTML1, so the solution must count with that DTD.

Comment 19 Chad H. 2010-02-22 22:08:40 UTC

*** Bug 22587 has been marked as a duplicate of this bug. ***

Comment 20 Chad H. 2010-07-06 15:00:18 UTC

*** Bug 24285 has been marked as a duplicate of this bug. ***

Comment 21 The Evil IP address 2010-09-20 20:05:20 UTC

Can't we do it here the way we do it with duplicate sections. For example, 

== Heading ==
bla bla...
== Heading ==
bla bla...

becomes 

id="Heading"
bla bla bla...
id="Heading_2"
bla bla bla...

In this case,

== content ==

should simply become id="content_2".

Comment 22 Aryeh Gregor (not reading bugmail, please e-mail directly) 2010-09-20 20:24:49 UTC

Basically, yes.  What we have to do is make a list of all the id's used by the software and blacklist them for section titles and other user-provided id's.  This is feasible to maintain if we adopt a strict policy of prefixing all software-generated id's with "mw-", which we often do already, although we're not very strict about it.  Then we can just blacklist the "mw-" prefix, in addition to a hopefully-not-expanding list of legacy unprefixed id's.

We can't feasibly check the list of interface id's used on the current page on the fly, while parsing.  This works for things the parser generates, but parser output can't depend on UI output.  The same cached parser output is stuck into a variety of skins, plus no skin at all (action=raw, API output, etc.).  So we need to get a list of all id's used anywhere in the software and ban them in all pages.

Comment 23 Krinkle 2010-09-20 20:30:30 UTC

Both sound needed (interface prefix "mw-", and, upcounting them in the headings).

With upcounting I mean what The Evil IP address mentioned above. That "mw-content" would be treated like a duplicate heading. 

So that the following
== something ==
== something ==
== content ==
== mw-content ==

would become

id="something"
id="something_2"
id="content_2"
id="mw-content_2"

Comment 24 Brion Vibber 2011-06-02 20:02:08 UTC

*** Bug 29049 has been marked as a duplicate of this bug. ***

Comment 25 Purodha Blissenbach 2011-06-02 21:14:20 UTC

We also have the problem that with section editing, we get ids in previews which differ from the ids in the full page. That is at least bewildering, and worst may lead to bogus wrong ids being copied and used elsewhere.

Editing a page closer to the beginning may lead to ids further down being renumbered. References to ids from elsewhere, e.g. via links having a fragment identfier, should ideally not break in such cases.

Comment 26 Purodha Blissenbach 2011-06-02 21:21:19 UTC

In bug 29049, it has been suggested that editors be warned when a page is saved with duplicate id values, also to just accept duplicates
during a 2nd save, such like empty "Summary" fields. Maybe even
a toggle in Special:Preferences similar to the one for the
handling of empty "Summary" fields might be considered for the
id= value checking.

Comment 27 Dan Barrett 2011-06-03 00:56:00 UTC

A warning on Save does not seem like the right approach. The ID problem is an internal, technical shortcoming of MediaWiki. Exposing this to non-technical editors would just be confusing to them.

Comment 28 Roan Kattouw 2011-06-18 20:26:06 UTC

*** Bug 29480 has been marked as a duplicate of this bug. ***

Comment 29 Daniel Friesen 2011-10-10 18:25:38 UTC

(In reply to comment #18)
> (In reply to comment #17)
> > MediaWiki no longer outputs XHTML1 by default, but HTML5.  id's in HTML5 can be
> > any nonempty string that doesn't contain whitespace:
> > 
> > http://www.whatwg.org/specs/web-apps/current-work/multipage/elements.html#the-id-attribute
> > 
> 
> But still can (and on WMF wikis does) output XHTML1, so the solution must count
> with that DTD.

WMS only uses XHTML because of some bots and scripts that haven't updated yet. Eventually WMF WILL be using html5. And as this is a pure validation thing (browsers are not going to care if you use an XHTML doctype but actually follow html5's rules) we don't care about XHTML rules.

Wikimedia Bugzilla is closed!

Search

Personal tools

Navigation

Links