Last modified: 2011-01-25 01:19:57 UTC
Sanitizer will reject attributes found in tags as invalid even though they are technically valid. Example: <tag hyphen-name="value"></tag> When parserHook() is called for <tag>, the $args parameter will be an empty array because 'hyphen-name' is rejected. To my knowledge, attribute names must start with [a-zA-Z] followed by zero or more from the set [a-zA-Z0-9._-] (is this accurate?). Currently, only attributes of 1 or more [a-zA-Z0-9] are accepted. This is also a minor problem, because attribute names cannot start with a number.
The following patch to Sanitizer.php might resolve this issue. Index: Sanitizer.php =================================================================== --- Sanitizer.php (revision 45716) +++ Sanitizer.php (working copy) @@ -40,10 +40,11 @@ * Allows some... latitude. * Used in Sanitizer::fixTagAttributes and Sanitizer::decodeTagAttributes */ -$attrib = '[A-Za-z0-9]'; +$attrib_first = '[A-Za-z]'; +$attrib = '[A-Za-z0-9._-]'; $space = '[\x09\x0a\x0d\x20]'; define( 'MW_ATTRIBS_REGEX', - "/(?:^|$space)($attrib+) + "/(?:^|$space)({$attrib_first}{$attrib}*) ($space*=$space* (?: # The attribute value: quoted or alone
On a quick peek at the XML 1.0 spec, I believe attributes are limited to the 'Name' type: http://www.w3.org/TR/REC-xml/#NT-Name NameStartChar ::= ":" | [A-Z] | "_" | [a-z] | [#xC0-#xD6] | [#xD8-#xF6] | [#xF8-#x2FF] | [#x370-#x37D] | [#x37F-#x1FFF] | [#x200C-#x200D] | [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] | [#x10000-#xEFFFF] NameChar ::= NameStartChar | "-" | "." | [0-9] | #xB7 | [#x0300-#x036F] | [#x203F-#x2040] Name ::= NameStartChar (NameChar)*
Thanks for the quick reply, it looks like indeed more attribute names are valid than are accepted by sanitizer. Should this be rectified?
Probably yes. :) While the original approximation is sufficient for *HTML* sanitization, going ahead and accepting a larger set could be nice for extensions which might want to be more flexible.
Here is an example of how an extension could benefit from accepting a larger set of attributes: The extension FBConnect (http://www.mediawiki.org/wiki/Extension:FBConnect) enables wiki users to insert XFBML, an extension to XHTML designed by Facebook, by reconstructing the tags and attributes as they are received from the Sanitizer. With the current narrower set [A-Za-z0-9], attributes like facebook-logo (http://wiki.developers.facebook.com/index.php/Fb:profile-pic#Attributes) are dropped from the attributes array in the parser hook. To fix this issue, the following code $attrib = '[A-Za-z0-9]'; $space = '[\x09\x0a\x0d\x20]'; define( 'MW_ATTRIBS_REGEX', "/(?:^|$space)($attrib+) ... should be modified to be $attrib_first = '[:A-Z_a-z]'; $attrib = '[:A-Z_a-z-.0-9]'; $space = '[\x09\x0a\x0d\x20]'; define( 'MW_ATTRIBS_REGEX', "/(?:^|$space)({$attrib_first}{$attrib}*) ... I've attached this difference in patch form.
Created attachment 5679 [details] Sanitizer.php patch for bug 17031
Looks like it'll work; be good to have some parser test cases for it though. (Would need a test extension I guess.)
Created attachment 5687 [details] Test Extension (stest) Test extension for bug 17031. Adds a test parser hook that simply rebuilds the original tag with the attribute pairs and inner text and passes it through htmlspecialchars.
Some test cases: <stest facebook-logo="true" a:b="c" z_-.Z="val" A:B-c.1_2:3="val" _1="2">These attribs should be passed through</stest> <stest -a="no" .b="no" 0c="no" 9d="no" don't="no" a=b="c" a%="no" hi"="no" a$="no" a@="no" ^a="no" a*="no" a(b)="no">Denied</stest> Output: <stest facebook-logo="true" a:b="c" z_-.z="val" a:b-c.1_2:3="val" _1="2">These attribs should be passed through</stest> <stest>Denied</stest>
Assign to brion for review.
Any word on this? I noticed today that Sanitizer still contains this bug.
*** Bug 24749 has been marked as a duplicate of this bug. ***
Created attachment 7619 [details] Maintenance-script test-driver for parser test. Script based on Garrett's test-extension which runs several test-cases. This was run both before and after the fix was applied.
Created attachment 7620 [details] Output of the parser testing. This output shows exactly what we would expect (assuming that we expected the fix to work ;) ).
Tests went swimmingly, so I committed Garret's changes. Fixed in r70849 and r70850. http://www.mediawiki.org/wiki/Special:Code/MediaWiki/70849 http://www.mediawiki.org/wiki/Special:Code/MediaWiki/70850
Finally lol. you guys rock!