Last modified: 2008-12-30 02:20:30 UTC
BUG MIGRATED FROM SOURCEFORGE http://sourceforge.net/tracker/index.php?func=detail&aid=830206&group_id=34373&atid=411192 Originally submitted by Luc Van Oostenryck (looxix) 2003-10-25 20:39 On fr: ther is an article (a stub in fact) with the name [[Fonction δ de Dirac]]. It's impossible to rename it and worse; the soft doesn't detect that the renaming failed so theredirection page is still created with a bad name [[Fonction %CE%B4 de Dirac]]. -- Looxix ------------------------- Additional comments ------------------------ Date: 2003-12-10 11:36 Sender: SF user vibber Copying text of #856267, marked as duplicate of this: There are several ways to write a wikilink with a superscript-2 in the destination article text: [[User:Finlay McWalter:sandbox:m²]] [[User:Finlay_McWalter:sandbox:m%26sup2]] [[User:Finlay_McWalter:sandbox:m%26sup2;]] [[User:Finlay_McWalter:sandbox:m%26sup2%3b]] Of these, the top two resolve to the same page, and each of the latter two resolves to a brand new page. All three have the same article title, despite being different articles as far as the database is concerned. So the creating the two latter pages in the above list produced the following watchlist fragment: NM 15:09 User:Finlay McWalter:sandbox:m² (cur; hist) . . Finlay McWalter (Talk) (another tmp page) M 15:08 Current events (cur; hist) . . Menchi (Talk) (typo) NM 15:08 User:Finlay McWalter:sandbox:m² (cur; hist) . . Finlay McWalter (Talk) (created (superscript in URLs thing)) So it sure looks like the "new article" code should resolve the escaping of characters to produce the canonical article name. I'm [[User:Finlay McWalter]] on the english wikipedia.
*** Bug 462 has been marked as a duplicate of this bug. ***
See test cases at [[:test:Bug462]]
*** Bug 631 has been marked as a duplicate of this bug. ***
(In reply to comment #2) > See test cases at [[:test:Bug462]] I fixed your self links example in HEAD. It looks like all your other examples either have been fixed, or are arguably expected behavior. I think I disagree "Foo bar" and "Foo_bar" should ever refer to different articles.
(In reply to comment #4) > (In reply to comment #2) > > See test cases at [[:test:Bug462]] > > I fixed your self links example in HEAD. It looks like all your other > examples either have been fixed, or are arguably expected behavior. I > think I disagree "Foo bar" and "Foo_bar" should ever refer to different > articles. Most of the bugs described at [[:test:Bug462]] are still present. I have updated the page in an atempt to make it more clear. I don't want "[[Foo bar]]" and "[[Foo_bar]]" to be different. I do want http://server/wiki/Foo_bar and http://server/wiki/Foo%20bar to be different, with the latter page being accessible via "[[Foo_bar]]".
(In reply to comment #5) > I don't want "[[Foo bar]]" and "[[Foo_bar]]" to be different. I do want > http://server/wiki/Foo_bar and http://server/wiki/Foo%20bar to be different, > with the latter page being accessible via "[[Foo_bar]]". Would you mind explaining the logic behind this? I'm quite boggled.
=== Suggested fix === 1. The parser should first examine the raw wikitext, looking for links in square brackets. 2. For each link, the canonicalisation algorithm should be performed (ignore leading and trailing spaces, treat space and underline as the same, etc.). 3. After that canonicalisation step, HTML entities (&, {, etc) should be mapped to the corresponding unicode characters. The existing observed behaviour is consistent with step 3 being done first instead of last.
Entity to unicode conversion must come before canonicalization on internal links in order to perform whitespace matching and case conversion.
(In reply to comment #6) > (In reply to comment #5) > > I don't want "[[Foo bar]]" and "[[Foo_bar]]" to be different. I do want > > http://server/wiki/Foo_bar and http://server/wiki/Foo%20bar to be different, > > with the latter page being accessible via "[[Foo_bar]]". > > Would you mind explaining the logic behind this? I'm quite boggled. Major premise: All characters should be allowed in page names, even if it difficult to use some characters. Minor premise: Numeric entity refs are a good way of referring to characters that are otherwise difficult to include in a page name. Almost all my other arguments else follow from that.
One of the major arguments for "%20" being treated the same as "_" (and this may apply for some other examples, too) is that typing "en.wikipedia.org/wiki/Name of a page" into the address bar of a web browser will be converted, by the browser, to "en.wikipedia.org/wiki/Name%20of%20a%20page". Now we can be 99.9999999% sure that what the user was after was the page "en.wikipedia.org/wiki/Name_of_a_page", since that is what they would get if they typed [[Name of a page]] in the text of an article; thus, it's pretty clear to me that we should never have an article whose literal title is "Name%20of%20a%20page". As far as I can see, the treatment of spaces and underscores is currently a) completely consistent; and b) consistent in a very useful manner: it is impossible to create an article whose title looks different from the only way of actually linking to it. Such an article would be an absolute nightmare to maintain (page moves, deletion, just plain trying to link there and not happening to use the same escape sequence as the original author). In my opinion, this goes for the other "problem" characters too: if they're illegal in titles, they should be illegal in titles; but I grant that some, like leading '/' or '#' could conceivably be useful. It seems to me, though, that having to use some unnatural escape sequence whenever you need to refer to an article is going to create more head-aches than it will solve (think newbies...). Re-casting the problem, I wonder if a mechanism to display the page's title (in the HTML output) as something different from its name (in the database) could be created, which showed the real name (as needed for linking to the article) underneath: <h1>C#</h1> <p><small>[Article title: C_sharp]</small></p> Except I'm not sure how to label the second line so that it would make sense to inexperienced users. My thought is that this could be a magic word at the beginning of the article: '#TITLE C#'; similarly, one could use '#TITLE h2g2' to display the lower-case leading letter on a wiki where this was otherwise not possible.
(In reply to comment #10) > One of the major arguments for "%20" being treated the same as "_" (and this may > apply for some other examples, too) is that typing "en.wikipedia.org/wiki/Name > of a page" into the address bar of a web browser will be converted, by the > browser, to "en.wikipedia.org/wiki/Name%20of%20a%20page". Now we can be > 99.9999999% sure that what the user was after was the page > "en.wikipedia.org/wiki/Name_of_a_page", since that is what they would get if > they typed [[Name of a page]] in the text of an article; thus, it's pretty clear > to me that we should never have an article whose literal title is > "Name%20of%20a%20page". OK, I see your point, but I would expect to get an error if I attempted to browse to the wrong URL by using %20 instead of underline as a word separator. > Re-casting the problem, I wonder if a mechanism to display the page's title (in > the HTML output) as something different from its name (in the database) could be > created, which showed the real name (as needed for linking to the article) > underneath: Yes, that would be fine. If, in the wikitext for http://en.wikipedia.org/wiki/C_plus_plus and http://en.wikipedia.org/wiki/H2gh, I could say "#TITLE C++" and "#TITLE h2gh", and if that modified the <TITLE> and <H1> elements of the HTML output, then I wouldn't mind that the articles are filed in the database under slightly incorrect names. > <h1>C#</h1> > <p><small>[Article title: C_sharp]</small></p> > Except I'm not sure how to label the second line so that it would make sense to > inexperienced users. Perhaps "To link to this article, use [[C sharp]]." Put it as close to the H1 heading as possible, and use a stylesheet to hide it in print media. See [[:en:Template:Wrongtitle]] and [[:en:Wikipedia:Naming conventions (technical restrictions)]] (and the corresponding talk pages) for relevant discussion.
Please continue the alternate title display discussion at bug 496, where it is on-topic.
In the discussion for bug 707, someone spotted that (in 1.3.x) one can use links such as [[foo<nowiki>+</nowiki>bar]], and they will be treated as valid links, with the characters in question not being escaped in any way. This is rather handy for interwiki-links (as discussed there) but it hints at something rather odd going on, and creates strange behaviour for an internal link: [[foo<nowiki>+</nowiki>bar]] produces an edit link to [[Foo_bar]], for instance. What's more, the version running on the test server doesn't deal at all well with this markup, leaving un-replaced placeholders: see http://test.wikipedia.org/wiki/Bug707 I know this isn't exactly the same as what we've been talking about so far, but it's certainly a related issue: how *should* such markup be treated? (In reply to comment #11) > OK, I see your point, but I would expect to get an error if I attempted > to browse to the wrong URL by using %20 instead of underline as a word > separator. But that's a developer's way of seeing it, not a user's: as far as the user is concerned, words are seperated by spaces in links, and so they will type them seperated by spaces in the URL. They may never notice that in one " " becomes "_" and in the other " " becomes "%20", and certainly don't care; they have no conception that they are "using %20 instead of underline as a word separator." (In reply to comment #12) > Please continue the alternate title display discussion at bug 496, where it is on-topic. My apologies: I should have thought to search for existing bugs relating to this suggestion; I've copied those comments there.
(In reply to comment #13) > [[foo<nowiki>+</nowiki>bar]] produces an edit link to [[Foo_bar]], for instance. > What's more, the version running on the test server doesn't deal at all well > with this markup, leaving un-replaced placeholders: see > http://test.wikipedia.org/wiki/Bug707 This should be fixed in HEAD; thanks for pointing that out.
http://test.wikipedia.org/wiki/Bug707 currently produces this HTML: <ul> <li>[[foo+bar]]</li> <li>[[C++]]</li> <li><!--IWLINK 0--></li> <li>[[meta:foo+bar]]</li> </ul> The third line is obviously a bug irrespective of how the others are treated.
[[en:User:SirJective/Parenthesis]] has another example of a problematic link/title. I've described the workarounds in the talk page. IMHO the user shouldn't've been allowed to create the page: 613 commandments ( ''mitzvot'' ) in the first place. Having a page with such a title which must be linked only as: 613 commandments %28 %27%27mitzvot%27%27 %29 or similar is undesirable. PS: probably my previous comment adds nothing more to what was already said (though I couldn't understand what "unreplaced placeholder" meant). Sorry about that.
Oops! Another goof up & another spam from me :(. The link is [[en:User:SirJective/Parenthesis/other]]. Bugzilla should also have a preview feature like mediawiki :).
*** Bug 2096 has been marked as a duplicate of this bug. ***
This bug is still open: See [[en:User:Gangleri/tests/bugzilla:00337]] about [[‏]] (this is [[&rlm;]]) and generates http://en.wikipedia.org/wiki/%E2%80%8F .
‎ ‏ ‪ ‫ ‬ ‭ ‮ alone does not make much sense for titles. I would say this is more or less "whitespace". Regards Reinhardt [[user:gangleri]]
changed Component to "Page rendering" bug 462: numeric entity references for problematic characters is no longer a duplicate of this bug opened an unsolved issue at bug 4250: Escaped generation of [[foo|bar]] does not render properly Please read comments about it at bug 462 coment 2. best regards reinhardt [[user:gangleri]]
(In reply to comment #10) > ... Such an article would be an absolute nightmare to > maintain (page moves, deletion, just plain trying to link there and not > happening to use the same escape sequence as the original author). In my > opinion, this goes for the other "problem" characters too: if they're illegal in > titles, they should be illegal in titles; but I grant that some, like leading > '/' or '#' could conceivably be useful. It seems to me, though, that having to > use some unnatural escape sequence whenever you need to refer to an article is > going to create more head-aches than it will solve (think newbies...). I agree: "would be an *absolute* *nightmare* to *maintain* (page moves, deletion, just plain trying to link there and not happening to use the same escape sequence as the original author)." regarding "ilegal characters" see below. I agree: "if they're illegal in titles, they should be illegal in titles; but I grant that some, like leading '/' or '#' could conceivably be useful." > Re-casting the problem, I wonder if a mechanism to display the page's title (in > the HTML output) as something different from its name (in the database) could be > created, which showed the real name (as needed for linking to the article) > underneath: I can not (I do not like to) provide / propose a "markup" here but with the examples from below it should be possible to solved this with <charinsert>. ---- I made some testcase for the original links from comment 0 at http://yi.wiktionary.org/w/index.php?title=project:bugzilla/00337&oldid=6893#m.C2.B2 . During the various previews I made in order to generate the testcase I realised that it *is* possible to generate titles containing characters which are *not* allowed in titles. I know also the method to create them as first character (and also to generate titles starting with lowercase letters). Please do not understand me wrong. I do not like to *hack* MediaWiki - I only want to report what I have seen. I also want to refer at various requests (can not find all bug numbers now) - allow titles starting with lowercase letters -- bug 496: Override title text and formatting from page markup -- bug 2118: patch to let mediawiki display the title lowercase in wgCapitalLinks mode - allow titles containing the characters which are *not* allowed in titles Thise are requests made by others not by me. Before describing the method I want to point at two issues: 1) Would the "normalisation function" be stable enough to be aplied multiple times because of how the code / implementation of the whole package is *now*? Else changing and maintaining the code would me a *nighmare* as Rowan said. 2) What benefit would have the users if there is a tricky way to generate titles that they want (all using %nn coding) but they would not have the keyboard / knowledge / skills to generate these easely and / or to refer / link to them easily? The *new* issue for me was that %nn is a method to generate the characters which are not alloed in titles. &nn alone would not work as "first characters" but you / we could use for exampe *one* and *only* one heading Unicode Character ZERO WIDTH SPACE - U+200B http://www.fileformat.info/info/unicode/char/200b/index.htm HTML Entity (decimal) ​ (hex) ​ UTF-8 (hex) 0xE2 0x80 0x8B (e2808b) %E2%80%8B %e2%80%8b There are requirements (bug reports) to disallow certain characters. If ZERO WIDTH SPACE would be disalowed also it mide be whise to allow it *only* a) before the character characters which are *not* allowed in titles b) before a lower case letter These are simple rules. Made some tests at http://test.leuksman.com/view/Category:Bugzilla/00337 . The titles there "look" like "/", starting "?", starting ":" etc. Was not able to find a way to generate a title that "looks" like "/". best regards reinhardt [[user:gangleri]]
changed summary: "illegal" => "invalid", the characters in question are invalid, they are not a violation of the law.
(In reply to comment #22) > I made some testcase for the original links from comment 0 at > http://yi.wiktionary.org/w/index.php?title=project:bugzilla/00337&oldid=6893#m.C2.B2 http://yi.wiktionary.org/w/index.php?title=project:bugzilla/00337&oldid=6904#m.C2.B2 > Was not able to find a way to generate a title that "looks" like "/". Was not able to find a way to generate a title that "looks" like "#". There is an example which should *not* break apache's using "​/" [[​/]].
*** Bug 5731 has been marked as a duplicate of this bug. ***
*** Bug 6932 has been marked as a duplicate of this bug. ***
I think all the relevant bits got separated out to other bugs (and most if not all fixed) over the years. The core premise of this bug seems to have been a request to do things the *opposite* order from what we want to be doing (comment 7, 9). Resolving INVALID.