Last modified: 2011-01-25 01:19:57 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T19031, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 17031 - Sanitizer rejects valid xhtml attributes
Sanitizer rejects valid xhtml attributes
Status: RESOLVED FIXED
Product: MediaWiki
Classification: Unclassified
Parser (Other open bugs)
1.15.x
All All
: Normal enhancement with 1 vote (vote)
: ---
Assigned To: Sean Colombo
: need-parsertest, patch, patch-need-review
: 24749 (view as bug list)
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2009-01-14 23:57 UTC by Garrett
Modified: 2011-01-25 01:19 UTC (History)
4 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments
Sanitizer.php patch for bug 17031 (581 bytes, patch)
2009-01-15 19:35 UTC, Garrett
Details
Test Extension (stest) (726 bytes, application/x-httpd-php)
2009-01-16 22:03 UTC, Garrett
Details
Maintenance-script test-driver for parser test. (2.94 KB, application/octet-stream)
2010-08-11 07:08 UTC, Sean Colombo
Details
Output of the parser testing. (2.21 KB, text/plain)
2010-08-11 07:12 UTC, Sean Colombo
Details

Description Garrett 2009-01-14 23:57:52 UTC
Sanitizer will reject attributes found in tags as invalid even though they are technically valid.

Example:
<tag hyphen-name="value"></tag>

When parserHook() is called for <tag>, the $args parameter will be an empty array because 'hyphen-name' is rejected.

To my knowledge, attribute names must start with [a-zA-Z] followed by zero or more from the set [a-zA-Z0-9._-] (is this accurate?). Currently, only attributes of 1 or more [a-zA-Z0-9] are accepted. This is also a minor problem, because attribute names cannot start with a number.
Comment 1 Garrett 2009-01-14 23:59:31 UTC
The following patch to Sanitizer.php might resolve this issue.


Index: Sanitizer.php
===================================================================
--- Sanitizer.php	(revision 45716)
+++ Sanitizer.php	(working copy)
@@ -40,10 +40,11 @@
  * Allows some... latitude.
  * Used in Sanitizer::fixTagAttributes and Sanitizer::decodeTagAttributes
  */
-$attrib = '[A-Za-z0-9]';
+$attrib_first = '[A-Za-z]';
+$attrib = '[A-Za-z0-9._-]';
 $space = '[\x09\x0a\x0d\x20]';
 define( 'MW_ATTRIBS_REGEX',
-	"/(?:^|$space)($attrib+)
+	"/(?:^|$space)({$attrib_first}{$attrib}*)
 	  ($space*=$space*
 		(?:
 		 # The attribute value: quoted or alone
Comment 2 Brion Vibber 2009-01-15 00:06:54 UTC
On a quick peek at the XML 1.0 spec, I believe attributes are limited to the 'Name' type:

http://www.w3.org/TR/REC-xml/#NT-Name

NameStartChar	   ::=   	":" | [A-Z] | "_" | [a-z] | [#xC0-#xD6] | [#xD8-#xF6] | [#xF8-#x2FF] | [#x370-#x37D] | [#x37F-#x1FFF] | [#x200C-#x200D] | [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] | [#x10000-#xEFFFF]

NameChar	   ::=   	NameStartChar | "-" | "." | [0-9] | #xB7 | [#x0300-#x036F] | [#x203F-#x2040]

Name	   ::=   	NameStartChar (NameChar)*
Comment 3 Garrett 2009-01-15 00:12:53 UTC
Thanks for the quick reply, it looks like indeed more attribute names are valid than are accepted by sanitizer. Should this be rectified?
Comment 4 Brion Vibber 2009-01-15 17:45:04 UTC
Probably yes. :)

While the original approximation is sufficient for *HTML* sanitization, going ahead and accepting a larger set could be nice for extensions which might want to be more flexible.
Comment 5 Garrett 2009-01-15 19:33:24 UTC
Here is an example of how an extension could benefit from accepting a larger set of attributes:

The extension FBConnect (http://www.mediawiki.org/wiki/Extension:FBConnect) enables wiki users to insert XFBML, an extension to XHTML designed by Facebook, by reconstructing the tags and attributes as they are received from the Sanitizer. With the current narrower set [A-Za-z0-9], attributes like facebook-logo (http://wiki.developers.facebook.com/index.php/Fb:profile-pic#Attributes) are dropped from the attributes array in the parser hook.

To fix this issue, the following code

$attrib = '[A-Za-z0-9]';
$space = '[\x09\x0a\x0d\x20]';
define( 'MW_ATTRIBS_REGEX',
       "/(?:^|$space)($attrib+)
...

should be modified to be

$attrib_first = '[:A-Z_a-z]';
$attrib = '[:A-Z_a-z-.0-9]';
$space = '[\x09\x0a\x0d\x20]';
define( 'MW_ATTRIBS_REGEX',
	"/(?:^|$space)({$attrib_first}{$attrib}*)
...

I've attached this difference in patch form.
Comment 6 Garrett 2009-01-15 19:35:18 UTC
Created attachment 5679 [details]
Sanitizer.php patch for bug 17031
Comment 7 Brion Vibber 2009-01-16 17:35:45 UTC
Looks like it'll work; be good to have some parser test cases for it though. (Would need a test extension I guess.)
Comment 8 Garrett 2009-01-16 22:03:27 UTC
Created attachment 5687 [details]
Test Extension (stest)

Test extension for bug 17031. Adds a test parser hook that simply rebuilds the original tag with the attribute pairs and inner text and passes it through htmlspecialchars.
Comment 9 Garrett 2009-01-16 22:36:24 UTC
Some test cases:

<stest facebook-logo="true" a:b="c" z_-.Z="val" A:B-c.1_2:3="val" _1="2">These attribs should be passed through</stest>

<stest -a="no" .b="no" 0c="no" 9d="no" don't="no" a=b="c" a%="no" hi"="no" a$="no" a@="no" ^a="no" a*="no" a(b)="no">Denied</stest>

Output:

<stest facebook-logo="true" a:b="c" z_-.z="val" a:b-c.1_2:3="val" _1="2">These attribs should be passed through</stest>

<stest>Denied</stest>
Comment 10 Siebrand Mazeland 2009-02-02 16:43:15 UTC
Assign to brion for review.
Comment 11 Garrett 2010-03-23 01:29:24 UTC
Any word on this? I noticed today that Sanitizer still contains this bug.
Comment 12 Garrett 2010-08-11 01:39:42 UTC
*** Bug 24749 has been marked as a duplicate of this bug. ***
Comment 13 Sean Colombo 2010-08-11 07:08:58 UTC
Created attachment 7619 [details]
Maintenance-script test-driver for parser test.

Script based on Garrett's test-extension which runs several test-cases.

This was run both before and after the fix was applied.
Comment 14 Sean Colombo 2010-08-11 07:12:48 UTC
Created attachment 7620 [details]
Output of the parser testing.

This output shows exactly what we would expect (assuming that we expected the fix to work ;) ).
Comment 15 Sean Colombo 2010-08-11 07:27:18 UTC
Tests went swimmingly, so I committed Garret's changes.

Fixed in r70849 and r70850.

http://www.mediawiki.org/wiki/Special:Code/MediaWiki/70849
http://www.mediawiki.org/wiki/Special:Code/MediaWiki/70850
Comment 16 Garrett 2010-08-11 20:00:12 UTC
Finally lol. you guys rock!

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links