Last modified: 2008-12-30 02:20:30 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T2337, the corresponding Phabricator task for complete and up-to-date bug report information.

Bug 337 - inconsistent treatment of character entities and invalid chararcters in titles/links


Summary:	inconsistent treatment of character entities and invalid chararcters in title...

Status:	RESOLVED INVALID

Product:	MediaWiki
Classification:	Unclassified
Component:	Parser (Other open bugs)
Version:	unspecified
Hardware:	All All

Importance:	Normal normal with 2 votes (vote)
Target Milestone:	---
Assigned To:	Nobody - You can work on this!

URL:
Whiteboard:
Keywords:

Duplicates:	631 2096 5731 6932 (view as bug list)
Depends on:
Blocks:	unicode
	Show dependency tree / graph

Reported:	2004-09-03 03:23 UTC by Timwi
Modified:	2008-12-30 02:20 UTC (History)
CC List:	10 users (show)

See Also:
Web browser:	---
Mobile Platform:	---
Assignee Huggle Beta Tester:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Description Timwi 2004-09-03 03:23:56 UTC

BUG MIGRATED FROM SOURCEFORGE
http://sourceforge.net/tracker/index.php?func=detail&aid=830206&group_id=34373&atid=411192
Originally submitted by Luc Van Oostenryck (looxix)  2003-10-25 20:39


On fr: ther is an article (a stub in fact) with the
name [[Fonction &amp;delta; de Dirac]].
It's impossible to rename it and worse; the soft
doesn't detect that the renaming failed so
theredirection page is still created with a bad name
[[Fonction %CE%B4 de Dirac]].

-- Looxix

------------------------- Additional comments ------------------------
Date: 2003-12-10 11:36
Sender: SF user vibber

Copying text of #856267, marked as duplicate of this:

 There are several ways to write a wikilink with a
superscript-2 in the destination article text:

[[User:Finlay McWalter:sandbox:m²]]

[[User:Finlay_McWalter:sandbox:m%26sup2]]

[[User:Finlay_McWalter:sandbox:m%26sup2;]]

[[User:Finlay_McWalter:sandbox:m%26sup2%3b]]

Of these, the top two resolve to the same page, and
each of the latter two resolves to a brand new page.
All three have the same article title, despite being
different articles as far as the database is concerned.

So the creating the two latter pages in the above list
produced the following watchlist fragment:

NM 15:09 User:Finlay McWalter:sandbox:m² (cur; hist) .
. Finlay McWalter (Talk) (another tmp page)
M 15:08 Current events (cur; hist) . . Menchi (Talk)
(typo)
NM 15:08 User:Finlay McWalter:sandbox:m² (cur; hist) .
. Finlay McWalter (Talk) (created (superscript in URLs
thing))

So it sure looks like the &quot;new article&quot; code should
resolve the escaping of characters to produce the
canonical article name.

I'm [[User:Finlay McWalter]] on the english wikipedia.

Comment 1 Brion Vibber 2004-09-12 09:45:09 UTC

*** Bug 462 has been marked as a duplicate of this bug. ***

Comment 2 Alan Barrett 2004-09-12 15:31:23 UTC

See test cases at [[:test:Bug462]]

Comment 3 Brion Vibber 2004-10-03 00:40:33 UTC

*** Bug 631 has been marked as a duplicate of this bug. ***

Comment 4 Wil Mahan 2004-10-13 04:36:59 UTC

(In reply to comment #2)
> See test cases at [[:test:Bug462]]

I fixed your self links example in HEAD. It looks like all your other
examples either have been fixed, or are arguably expected behavior. I
think I disagree "Foo bar" and "Foo_bar" should ever refer to different
articles.

Comment 5 Alan Barrett 2004-10-13 17:54:28 UTC

(In reply to comment #4)
> (In reply to comment #2)
> > See test cases at [[:test:Bug462]]
> 
> I fixed your self links example in HEAD. It looks like all your other
> examples either have been fixed, or are arguably expected behavior. I
> think I disagree "Foo bar" and "Foo_bar" should ever refer to different
> articles.

Most of the bugs described at [[:test:Bug462]] are still present.  I have
updated the page in an atempt to make it more clear.

I don't want "[[Foo bar]]" and "[[Foo_bar]]" to be different.  I do want
http://server/wiki/Foo_bar and http://server/wiki/Foo%20bar to be different,
with the latter page being accessible via "[[Foo&#95;bar]]".

Comment 6 Brion Vibber 2004-10-13 18:00:04 UTC

(In reply to comment #5)
> I don't want "[[Foo bar]]" and "[[Foo_bar]]" to be different.  I do want
> http://server/wiki/Foo_bar and http://server/wiki/Foo%20bar to be different,
> with the latter page being accessible via "[[Foo&#95;bar]]".

Would you mind explaining the logic behind this? I'm quite boggled.

Comment 7 Alan Barrett 2004-10-13 18:23:12 UTC

=== Suggested fix ===

1. The parser should first examine the raw wikitext, looking for links in square brackets.
2. For each link, the canonicalisation algorithm should be performed (ignore leading and trailing 
spaces, treat space and underline as the same, etc.).
3. After that canonicalisation step, HTML entities (&amp;amp;, &amp;#123;, etc) should be mapped to 
the corresponding unicode characters.

The existing observed behaviour is consistent with step 3 being done first instead of last.

Comment 8 Brion Vibber 2004-10-13 18:25:44 UTC

Entity to unicode conversion must come before canonicalization on internal links in order to perform whitespace matching 
and case conversion.

Comment 9 Alan Barrett 2004-10-13 18:37:12 UTC

(In reply to comment #6)
> (In reply to comment #5)
> > I don't want "[[Foo bar]]" and "[[Foo_bar]]" to be different.  I do want
> > http://server/wiki/Foo_bar and http://server/wiki/Foo%20bar to be different,
> > with the latter page being accessible via "[[Foo&#95;bar]]".
> 
> Would you mind explaining the logic behind this? I'm quite boggled.

Major premise: All characters should be allowed in page names,
    even if it difficult to use some characters.
Minor premise: Numeric entity refs are a good way of referring to
    characters that are otherwise difficult to include in a page name.

Almost all my other arguments else follow from that.

Comment 10 Rowan Collins [IMSoP] 2004-10-13 19:31:27 UTC

One of the major arguments for "%20" being treated the same as "_" (and this may
apply for some other examples, too) is that typing "en.wikipedia.org/wiki/Name
of a page" into the address bar of a web browser will be converted, by the
browser, to "en.wikipedia.org/wiki/Name%20of%20a%20page". Now we can be
99.9999999% sure that what the user was after was the page
"en.wikipedia.org/wiki/Name_of_a_page", since that is what they would get if
they typed [[Name of a page]] in the text of an article; thus, it's pretty clear
to me that we should never have an article whose literal title is
"Name%20of%20a%20page".

As far as I can see, the treatment of spaces and underscores is currently a)
completely consistent; and b) consistent in a very useful manner: it is
impossible to create an article whose title looks different from the only way of
actually linking to it. Such an article would be an absolute nightmare to
maintain (page moves, deletion, just plain trying to link there and not
happening to use the same escape sequence as the original author). In my
opinion, this goes for the other "problem" characters too: if they're illegal in
titles, they should be illegal in titles; but I grant that some, like leading
'/' or '#' could conceivably be useful. It seems to me, though, that having to
use some unnatural escape sequence whenever you need to refer to an article is
going to create more head-aches than it will solve (think newbies...).

Re-casting the problem, I wonder if a mechanism to display the page's title (in
the HTML output) as something different from its name (in the database) could be
created, which showed the real name (as needed for linking to the article)
underneath:
<h1>C#</h1>
<p><small>[Article title: C_sharp]</small></p>
Except I'm not sure how to label the second line so that it would make sense to
inexperienced users. My thought is that this could be a magic word at the
beginning of the article: '#TITLE C#'; similarly, one could use '#TITLE h2g2' to
display the lower-case leading letter on a wiki where this was otherwise not
possible.

Comment 11 Alan Barrett 2004-10-13 22:03:32 UTC

(In reply to comment #10)
> One of the major arguments for "%20" being treated the same as "_" (and this may
> apply for some other examples, too) is that typing "en.wikipedia.org/wiki/Name
> of a page" into the address bar of a web browser will be converted, by the
> browser, to "en.wikipedia.org/wiki/Name%20of%20a%20page". Now we can be
> 99.9999999% sure that what the user was after was the page
> "en.wikipedia.org/wiki/Name_of_a_page", since that is what they would get if
> they typed [[Name of a page]] in the text of an article; thus, it's pretty clear
> to me that we should never have an article whose literal title is
> "Name%20of%20a%20page".

OK, I see your point, but I would expect to get an error if I attempted
to browse to the wrong URL by using %20 instead of underline as a word
separator.

> Re-casting the problem, I wonder if a mechanism to display the page's title (in
> the HTML output) as something different from its name (in the database) could be
> created, which showed the real name (as needed for linking to the article)
> underneath:

Yes, that would be fine.  If, in the wikitext for http://en.wikipedia.org/wiki/C_plus_plus
and http://en.wikipedia.org/wiki/H2gh,
I could say "#TITLE C++" and "#TITLE h2gh", and if that modified the 
<TITLE> and <H1> elements of the HTML output, then I wouldn't mind that
the articles are filed in the database under slightly incorrect names.

> <h1>C#</h1>
> <p><small>[Article title: C_sharp]</small></p>
> Except I'm not sure how to label the second line so that it would make sense to
> inexperienced users.

Perhaps "To link to this article, use [[C sharp]]."  Put it as close to the H1
heading as possible, and use a stylesheet to hide it in print media.

See [[:en:Template:Wrongtitle]] and [[:en:Wikipedia:Naming conventions (technical restrictions)]]
(and the corresponding talk pages) for relevant discussion.

Comment 12 Brion Vibber 2004-10-13 22:07:49 UTC

Please continue the alternate title display discussion at bug 496, where it is on-topic.

Comment 13 Rowan Collins [IMSoP] 2004-10-14 15:39:23 UTC

In the discussion for bug 707, someone spotted that (in 1.3.x) one can use links
such as [[foo<nowiki>+</nowiki>bar]], and they will be treated as valid links,
with the characters in question not being escaped in any way. This is rather
handy for interwiki-links (as discussed there) but it hints at something rather
odd going on, and creates strange behaviour for an internal link:
[[foo<nowiki>+</nowiki>bar]] produces an edit link to [[Foo_bar]], for instance.
What's more, the version running on the test server doesn't deal at all well
with this markup, leaving un-replaced placeholders: see
http://test.wikipedia.org/wiki/Bug707 

I know this isn't exactly the same as what we've been talking about so far, but
it's certainly a related issue: how *should* such markup be treated?

(In reply to comment #11)
> OK, I see your point, but I would expect to get an error if I attempted
> to browse to the wrong URL by using %20 instead of underline as a word
> separator.

But that's a developer's way of seeing it, not a user's: as far as the user is
concerned, words are seperated by spaces in links, and so they will type them
seperated by spaces in the URL. They may never notice that in one " " becomes
"_" and in the other " " becomes "%20", and certainly don't care; they have no
conception that they are "using %20 instead of underline as a word separator."

(In reply to comment #12)
> Please continue the alternate title display discussion at bug 496, where it is
on-topic.

My apologies: I should have thought to search for existing bugs relating to this
suggestion; I've copied those comments there.

Comment 14 Wil Mahan 2004-10-15 17:54:59 UTC

(In reply to comment #13)
> [[foo<nowiki>+</nowiki>bar]] produces an edit link to [[Foo_bar]], for instance.
> What's more, the version running on the test server doesn't deal at all well
> with this markup, leaving un-replaced placeholders: see
> http://test.wikipedia.org/wiki/Bug707 

This should be fixed in HEAD; thanks for pointing that out.

Comment 15 Wikipedia:en:User:Paddu 2004-11-09 21:11:41 UTC

http://test.wikipedia.org/wiki/Bug707 currently produces this HTML:

<ul>
<li>[[foo+bar]]</li>
<li>[[C++]]</li>
<li><!--IWLINK 0--></li>
<li>[[meta:foo+bar]]</li>
</ul>

The third line is obviously a bug irrespective of how the others are treated.

Comment 16 Wikipedia:en:User:Paddu 2004-11-09 21:31:10 UTC

[[en:User:SirJective/Parenthesis]] has another example of a problematic
link/title. I've described the workarounds in the talk page. IMHO the user
shouldn't've been allowed to create the page:
	613 commandments ( ''mitzvot'' )
in the first place. Having a page with such a title which must be linked only as:
	613 commandments %28 %27%27mitzvot%27%27 %29
or similar is undesirable.

PS: probably my previous comment adds nothing more to what was already said
(though I couldn't understand what "unreplaced placeholder" meant). Sorry about
that.

Comment 17 Wikipedia:en:User:Paddu 2004-11-09 21:36:16 UTC

Oops! Another goof up & another spam from me :(. The link is
[[en:User:SirJective/Parenthesis/other]]. Bugzilla should also have a preview
feature like mediawiki :).

Comment 18 Brion Vibber 2005-05-07 05:52:54 UTC

*** Bug 2096 has been marked as a duplicate of this bug. ***

Comment 19 lɛʁi לערי ריינהארט 2005-10-12 14:35:23 UTC

This bug is still open:

See [[en:User:Gangleri/tests/bugzilla:00337]] about [[&rlm;]] (this is
[[&amp;rlm;]]) and generates http://en.wikipedia.org/wiki/%E2%80%8F .

Comment 20 lɛʁi לערי ריינהארט 2005-10-12 14:49:43 UTC

&lrm; &rlm; &#8234; &#8235; &#8236; &#8237; &#8238; alone does not make much
sense for titles. I would say this is more or less "whitespace".

Regards Reinhardt [[user:gangleri]]

Comment 21 lɛʁi לערי ריינהארט 2005-12-11 20:04:23 UTC

changed Component to "Page rendering"
bug 462: numeric entity references for problematic characters
is no longer a duplicate of this bug

opened an unsolved issue at
bug 4250: Escaped generation of [[foo|bar]] does not render properly
Please read comments about it at bug 462 coment 2.

best regards reinhardt [[user:gangleri]]

Comment 22 lɛʁi לערי ריינהארט 2005-12-13 03:23:22 UTC

(In reply to comment #10)

> ... Such an article would be an absolute nightmare to
> maintain (page moves, deletion, just plain trying to link there and not
> happening to use the same escape sequence as the original author). In my
> opinion, this goes for the other "problem" characters too: if they're illegal in
> titles, they should be illegal in titles; but I grant that some, like leading
> '/' or '#' could conceivably be useful. It seems to me, though, that having to
> use some unnatural escape sequence whenever you need to refer to an article is
> going to create more head-aches than it will solve (think newbies...).

I agree: "would be an *absolute* *nightmare* to *maintain* (page moves,
deletion, just plain trying to link there and not happening to use the same
escape sequence as the original author)."
regarding "ilegal characters" see below.

I agree: "if they're illegal in titles, they should be illegal in titles; but I
grant that some, like leading '/' or '#' could conceivably be useful."

> Re-casting the problem, I wonder if a mechanism to display the page's title (in
> the HTML output) as something different from its name (in the database) could be
> created, which showed the real name (as needed for linking to the article)
> underneath:

I can not (I do not like to) provide / propose a "markup" here but with the
examples from below it should be possible to solved this with <charinsert>.

----

I made some testcase for the original links from comment 0 at
http://yi.wiktionary.org/w/index.php?title=project:bugzilla/00337&oldid=6893#m.C2.B2
.

During the various previews I made in order to generate the testcase I realised
that it *is* possible to generate titles containing characters which are *not*
allowed in titles. I know also the method to create them as first character (and
also to generate titles starting with lowercase letters).

Please do not understand me wrong. I do not like to *hack* MediaWiki - I only
want to report what I have seen. I also want to refer at various requests (can
not find all bug numbers now)
- allow titles starting with lowercase letters
-- bug 496: Override title text and formatting from page markup
-- bug 2118: patch to let mediawiki display the title lowercase in
wgCapitalLinks mode
- allow titles containing the characters which are *not* allowed in titles
Thise are requests made by others not by me.

Before describing the method I want to point at two issues:
1) Would the "normalisation function" be stable enough to be aplied multiple
times because of how the code / implementation of the whole package is *now*?
Else changing and maintaining the code would me a *nighmare* as Rowan said.
2) What benefit would have the users if there is a tricky way to generate titles
that they want (all using %nn coding) but they would not have the keyboard /
knowledge / skills to generate these easely and / or to refer / link to them easily?


The *new* issue for me was that %nn is a method to generate the characters which
are not alloed in titles. &nn alone would not work as "first characters" but you
/ we could use for exampe *one* and *only* one heading Unicode Character ZERO
WIDTH SPACE - U+200B
http://www.fileformat.info/info/unicode/char/200b/index.htm
HTML Entity (decimal) &#8203; (hex) &#x200b;
UTF-8 (hex) 0xE2 0x80 0x8B (e2808b) %E2%80%8B %e2%80%8b

There are requirements (bug reports) to disallow certain characters. If ZERO
WIDTH SPACE would be disalowed also it mide be whise to allow it *only*
a) before the character characters which are *not* allowed in titles
b) before a lower case letter
These are simple rules.

Made some tests at http://test.leuksman.com/view/Category:Bugzilla/00337 .
The titles there "look" like "/", starting "?", starting ":" etc.
Was not able to find a way to generate a title that "looks" like "/".

best regards reinhardt [[user:gangleri]]

Comment 23 Ævar Arnfjörð Bjarmason 2005-12-13 03:24:26 UTC

changed summary: "illegal" => "invalid", the characters in question are invalid,
they are not a violation of the law.

Comment 24 lɛʁi לערי ריינהארט 2005-12-13 04:06:05 UTC

(In reply to comment #22)
> I made some testcase for the original links from comment 0 at
>
http://yi.wiktionary.org/w/index.php?title=project:bugzilla/00337&oldid=6893#m.C2.B2

http://yi.wiktionary.org/w/index.php?title=project:bugzilla/00337&oldid=6904#m.C2.B2

> Was not able to find a way to generate a title that "looks" like "/".

Was not able to find a way to generate a title that "looks" like "#".
There is an example which should *not* break apache's using "&#8203;/" [[&#8203;/]].

Comment 25 Brion Vibber 2006-04-27 17:27:11 UTC

*** Bug 5731 has been marked as a duplicate of this bug. ***

Comment 26 Brion Vibber 2006-08-12 04:08:15 UTC

*** Bug 6932 has been marked as a duplicate of this bug. ***

Comment 27 Brion Vibber 2008-12-30 02:20:30 UTC

I think all the relevant bits got separated out to other bugs (and most if not all fixed) over the years. The core premise of this bug seems to have been a request to do things the *opposite* order from what we want to be doing (comment 7, 9).

Resolving INVALID.

Wikimedia Bugzilla is closed!

Search

Personal tools

Navigation

Links