Last modified: 2014-06-27 14:11:49 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T37746, the corresponding Phabricator task for complete and up-to-date bug report information.

Bug 35746 - {{PAGENAME}} must not escape special chars, otherwise it makes {{#ifeq:}} unusable


Summary:	{{PAGENAME}} must not escape special chars, otherwise it makes {{#ifeq:}} unu...

Status:	RESOLVED WONTFIX

Product:	MediaWiki
Classification:	Unclassified
Component:	Templates (Other open bugs)
Version:	unspecified
Hardware:	All All

Importance:	Normal normal with 1 vote (vote)
Target Milestone:	---
Assigned To:	Nobody - You can work on this!

URL:
Whiteboard:
Keywords:	need-parsertest, parser

Duplicates:	61407 (view as bug list)
Depends on:
Blocks:	16474 67196 35628
	Show dependency tree / graph

Reported:	2012-04-06 01:49 UTC by Danny B.
Modified:	2014-06-27 14:11 UTC (History)
CC List:	8 users (show)

See Also:
Web browser:	---
Mobile Platform:	---
Assignee Huggle Beta Tester:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Description Danny B. 2012-04-06 01:49:04 UTC

{{PAGENAME}} must not escape special chars, otherwise it makes {{#ifeq:}} unusable.

{{#ifeq:{{PAGENAME}}|Q & A|true|false}} returns false on page with title "Q & A" because & is converted to &#38;

Obviously same wrong behavior with ' and " in page names.

Same goes with {{FULLPAGENAME}}.

Comment 1 Beta16 2012-04-06 07:27:08 UTC

See also bug 16474 and bug 35628

Comment 2 Niklas Laxström 2012-04-06 07:34:44 UTC

Would you rather have broken texts when '' in page name triggers italic in middle of message? That's the reason why it does escaping.

Not critical because there are easy workarounds starting from {{PAGENAME:Q & A}}.

Comment 3 Mark A. Hershberger 2012-04-06 14:30:43 UTC

Already discussed on Bug #35628

*** This bug has been marked as a duplicate of bug 35628 ***

Comment 4 Danny B. 2012-04-07 23:27:21 UTC

Although discussed in bug 35628, this is a bit different.

That bug wants to escape parser functions, this bug wants to unescape magic words.

Comment 5 Bawolff (Brian Wolff) 2012-04-07 23:42:28 UTC

(In reply to comment #0)
> {{PAGENAME}} must not escape special chars, otherwise it makes {{#ifeq:}}
> unusable.
> 
> {{#ifeq:{{PAGENAME}}|Q & A|true|false}} returns false on page with title "Q &
> A" because & is converted to &#38;
> 
> Obviously same wrong behavior with ' and " in page names.
> 
> Same goes with {{FULLPAGENAME}}.

I disagree. I think #ifeq et al should unescape their args.

(I suppose that would make & == &amp; but I don't entirely think that is a bad thing).

Comment 6 Danny B. 2012-04-07 23:53:32 UTC

(In reply to comment #5)
> I disagree. I think #ifeq et al should unescape their args.

Well, that's third approach. Wanna submit a new bug about it so later on it can be decided which approach is to be taken and other bugs can be closed in favour of that one?

Comment 7 Bawolff (Brian Wolff) 2012-04-08 00:31:28 UTC

(In reply to comment #6)
> (In reply to comment #5)
> > I disagree. I think #ifeq et al should unescape their args.
> 
> Well, that's third approach. Wanna submit a new bug about it so later on it can
> be decided which approach is to be taken and other bugs can be closed in favour
> of that one?

I'd prefer we just kept discussion on one bug. Bugs should be about problems, not the solutions imho.

The reason i prefer to keep escaping in {{PAGENAME}}, is that the escaping was introduced to work around the problem of a page named "*foo" starting a list when you put {{PAGENAME}} in a page.

Comment 8 Philippe Verdy 2014-01-29 13:22:46 UTC

The safest way to compare page names is to pass them BOTH through {{PAGENAMEE|pagename}}, or BOTH to {{PAGENAMEE|pagename}}. If you want to also compare their namespaces, pass both pagenames in parameter to {{FULLPAGENAME|pagename}} so that the given pagename won't have its namespace parsed and removed.

Note that these functions will also resolve relative paths in subpages and FULLPAGENAME(E) will also resolve the namespace.

So:
    {{#ifeq: {{PAGENAME}}|Q & A|true|false}}
will always be false on every page, but the following will work:
    {{#ifeq: {{PAGENAME}}|{{PAGENAME|Q & A}}|true|false}}
as it will return "true" on the expected page.

With full page names where you also check the namespace:
    {{#ifeq: {{FULLPAGENAME}}|{{FULLPAGENAME|Q & A}}|true|false}}
will also return true but only in the main namespace (it will be false on a Category page named "Category:Q & A", because the second parameter of "#if" gets the full page name of page "Q & A" in te main namespace).

-----

In summary:

* {{(FULL|BASE|SUB)PAGENAMEE|...}} return URL-encoded names
* {{(FULL|BASE|SUB)PAGENAME|...}} return HTML-encoded names

There's NO function in MediaWiki that returns the raw pagename.

-----

But note:
    {{(FULL|BASE|SUB)PAGENAMEE|...}}
is also different from
    {{URLENCODE:{{(FULL|BASE|SUB)PAGENAME|...}}}}

Because in the later case, URLENCODE will take in parameter an HTML-encoded name, so the result will be double-encoded, where HTML entities (containing the character & # ;) and SPACEs will be URL-encoded using %nn and +.

But in the first case the MediaWiki-specific URL-encoding performed by PAGENAMEE is different than standard URL-encoding (it does not generate "+" for spaces, but generates underscores).

So:

1. "{{PAGENAMEE|Q & A}}"
   returns in fact "Q_%26_A"
2. "{{PAGENAME|Q & A}}"
   returns in fact "Q &#38; A"
3. "{{URLENCODE:{{PAGENAME|Q & A}}}}"
   returns in fact at least this: "Q+%26%2338;+A"
   I don't know if URLENCODE also recodes the semicolon,
   if so the result will be instead: "Q+%26%2338%2B+A"
   In all cases this will be different from the result of case 1 !!!

-----

This strange behavior means that there are some characters "permitted" in URLs to MediaWiki sites that are transformed in a fery strange way, such as:

1. http://www.mediawiki.org/wiki/Q & A

      not directly a valid URL, but the browser transforms it to
      URL-encoding of UTF-8 and requests:

   http://www.mediawiki.org/wiki/Q%20&%20A

       the server all accept to load the page name "Q & A"

2. http://www.mediawiki.org/wiki/Q+%26%2338%2B+A

       the server parses this URL as containing an URL-encoded pagename,
       so it first URL-decodes it as:

            Q &#38; A

       the server will then parse the URL and will think it contains an
       anchor, it will attempt to load a page named only "Q &",
       with the anchor "38; A" dropped !

3. Valid page names may contain isolated ampersand or ampersands ad valdi characters in pagenames (internally they are HTML-encoded if you query their {{PAGENAME}}) but some sequences will generate errors,
such as "&amp;", but "a amp;" will be accepted...

All this is completely inconsistant, but this time this does not occur in parser functions, but at the server API level when handling incoming HTTP(S) requests that may, or may not, be HTML-encoded, when the HTTP-standard says that URLs should be ONLY URL-encoded ! The server also performs such double-decoding when resolving requests.

Comment 9 Philippe Verdy 2014-01-29 13:51:10 UTC

See also bug 35628 about the weird way the various parser functions interpret (or not) their input (URL-decoding, HTML-decoding, sometimes mixed up!), and how they may or may not reencode their output.

If this was not already complex within ASCII only, it becomes a nightmare with non-ASCII characters not because they are UTF-8 encoded, this is a convention) but because non-ASCII bytes (which may represent UTF-8 sequences of a single character... or not, because MediaWiki accepts invalid Unicode characters such as U+FFFF when they are pseudo-encoded as UTF-8, and then URL-encoded using %nn hex sequences ! On the API level, any %xx encoded byte is accepted, but the UTF-8 encoding is in fact not enforced.

The server just treats *raw* sequences of bytes (filtering only some ASCII characters, but not restricring at all the range of bytes in 0x80 to 0xFF, and not restricting later the range of 16-bit code units in the full range 0x0020 to 0xFFFF (when they are used in various libraries working with UTF-16 instead of real 21-bit code points.

I wonder how this inconsistency could defeat some security restrictions such as violating access rights on blocked pages. It is possible that one could create some weird page names via the HTTP API that will later not be accessible from any other MEdiaWiki page, or from Wiki administrtors with their online tools. and someone could maliciously create those weird page names to fill in a category or some generated MediaWiki pages that list pages in categories.

Possibly a user could also create a user account with such weird name and have his user page name inaccessible from standard blocking tools.

And CheckUser admmins may have difficulty to read logs and find the relevat users.

Comment 10 Bawolff (Brian Wolff) 2014-02-15 01:47:05 UTC

*** Bug 61407 has been marked as a duplicate of this bug. ***

Comment 11 Ryan Kaldari 2014-02-15 01:57:37 UTC

Philippe, is there any workaround for:
{{#ifeq:{{{1}}}|{{FULLPAGENAME}}|..}}

This is currently broken for https://en.wikipedia.org/wiki/Template:Clickable_button_2 and I haven't come up with any way to fix it. {{URLENCODE}} doesn't always work since URL encoding isn't the same as the escaping that {{FULLPAGENAMEE}} does (apparently).

Comment 12 MZMcBride 2014-02-15 01:59:31 UTC

{{urlencode:}} has various options, as I recall. One of them probably works.

Comment 13 Bawolff (Brian Wolff) 2014-02-15 02:04:53 UTC

(In reply to Ryan Kaldari from comment #11)
> Philippe, is there any workaround for:
> {{#ifeq:{{{1}}}|{{FULLPAGENAME}}|..}}

{{#ifeq:{{FULLPAGENAME:{{{1}}}}}|{{FULLPAGENAME}}|...}}

---

/me is working on a proper patch for this bug

Comment 14 Bawolff (Brian Wolff) 2014-02-15 02:47:37 UTC

(In reply to Bawolff (Brian Wolff) from comment #13)
> (In reply to Ryan Kaldari from comment #11)
> > Philippe, is there any workaround for:
> > {{#ifeq:{{{1}}}|{{FULLPAGENAME}}|..}}
> 
> {{#ifeq:{{FULLPAGENAME:{{{1}}}}}|{{FULLPAGENAME}}|...}}
> 
> ---
> 
> /me is working on a proper patch for this bug

Ok, so these bugs are kind of convoluted. I submitted a fix for bug 35628 (Unencode the arguments to #ifeq:). This bug is technically asking for {{PAGENAME}} to not output encoded stuff (whatever happened to bugs are for problems not solutions?), which is not going to happen per comment 2. So closing this wontfix

Comment 15 Philippe Verdy 2014-02-19 13:57:09 UTC

{{URLENCODE:...}} supports three styles of encoding.

{{PAGENAMEE}} uses the deprecated "WIKI" style; but still with its own differences!

See [[mw:Manual:PAGENAMEE encoding]] for extensive details.

What a mess !

And yes Bawolff (Brian Wolff) is correct about the way to fix things when comparing pagenames: you have to consistantly use {{PAGENAME:...}} or {{PAGENAMEE:...}} on all texts to compare with #ifeq: and #switch. This trick should also continue working after the proposed patch of #ifeq: and #switch in order to decode HTML entities (in addition to trimming them) in their parameters before comparing strings, even if they continue return strings with HTML entities.

Note You need to log in before you can comment on or make changes to this bug.

Wikimedia Bugzilla is closed!

Search

Personal tools

Navigation

Links