Last modified: 2014-10-16 11:32:38 UTC

Wikimedia Bugzilla is closed!

Wikimedia has migrated from Bugzilla to Phabricator. Bug reports should be created and updated in Wikimedia Phabricator instead. Please create an account in Phabricator and add your Bugzilla email address to it.
Wikimedia Bugzilla is read-only. If you try to edit or create any bug report in Bugzilla you will be shown an intentional error message.
In order to access the Phabricator task corresponding to a Bugzilla report, just remove "static-" from its URL.
You could still run searches in Bugzilla or access your list of votes but bug reports will obviously not be up-to-date in Bugzilla.
Bug 71386 - the Sanitizer allows only ASCII and a some punctuation in extension tag attributes
the Sanitizer allows only ASCII and a some punctuation in extension tag attri...
Status: NEW
Product: MediaWiki
Classification: Unclassified
Parser (Other open bugs)
unspecified
All All
: Normal normal (vote)
: ---
Assigned To: Nobody - You can work on this!
: i18n
Depends on:
Blocks: 28980
  Show dependency treegraph
 
Reported: 2014-09-27 17:54 UTC by Amir E. Aharoni
Modified: 2014-10-16 11:32 UTC (History)
4 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Amir E. Aharoni 2014-09-27 17:54:55 UTC
I took a stab at resolving Bug 28980 and adding a non-ASCII tag name.

This was surprisingly easy until I tried to make the tag's parameters non-ASCII as well. Apparently, the Sanitizer only allows ASCII Latin letters, digits and a bit of punctuation, but no Unicode characters.

Would it be disastrous to add support for non-ASCII characters?
Comment 1 Bawolff (Brian Wolff) 2014-09-28 17:15:41 UTC
Hmm, as far as I can tell, even XML allows them (And even if it didn't, I'm not sure that we necessarily should require our xml-like tags to be conforment to XML). For reference, the relavent code in MW land is Sanitizer::getAttribsRegex()

From http://www.w3.org/TR/REC-xml/#NT-Name :

[41]   	Attribute	   ::=   	 Name  Eq  AttValue  

[4]   	NameStartChar	   ::=   	":" | [A-Z] | "_" | [a-z] | [#xC0-#xD6] | [#xD8-#xF6] | [#xF8-#x2FF] | [#x370-#x37D] | [#x37F-#x1FFF] | [#x200C-#x200D] | [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] | [#x10000-#xEFFFF]
[4a]   	NameChar	   ::=   	NameStartChar | "-" | "." | [0-9] | #xB7 | [#x0300-#x036F] | [#x203F-#x2040]
[5]   	Name	   ::=   	NameStartChar (NameChar)*


Which presumably is enough of unicode for your purposes (Although it has some weird exclusions, such as ÷, ×, ⬀, ⭐, ∀, ✀, which seem kind of random to exclude, but we don't need them. It also excludes a whole bunch of combining accents, but is ok with precomposed forms (Which we normalize to anyways, but a couple of obscure things that don't have pre-composed forms may be excluded).
Comment 2 Amir E. Aharoni 2014-09-28 17:21:04 UTC
(In reply to Bawolff (Brian Wolff) from comment #1)
> Hmm, as far as I can tell, even XML allows them (And even if it didn't, I'm
> not sure that we necessarily should require our xml-like tags to be
> conforment to XML). For reference, the relavent code in MW land is
> Sanitizer::getAttribsRegex()

Yes, that's precisely where I found it while debugging core trying to understand where on Earth do my perfectly good Hebrew attribute names disappear :)

> 
> From http://www.w3.org/TR/REC-xml/#NT-Name :
> 
> [41]   	Attribute	   ::=   	 Name  Eq  AttValue  
> 
> [4]   	NameStartChar	   ::=   	":" | [A-Z] | "_" | [a-z] | [#xC0-#xD6] |
> [#xD8-#xF6] | [#xF8-#x2FF] | [#x370-#x37D] | [#x37F-#x1FFF] |
> [#x200C-#x200D] | [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] |
> [#xF900-#xFDCF] | [#xFDF0-#xFFFD] | [#x10000-#xEFFFF]
> [4a]   	NameChar	   ::=   	NameStartChar | "-" | "." | [0-9] | #xB7 |
> [#x0300-#x036F] | [#x203F-#x2040]
> [5]   	Name	   ::=   	NameStartChar (NameChar)*
> 
> 
> Which presumably is enough of unicode for your purposes (Although it has
> some weird exclusions, such as ÷, ×, ⬀, ⭐, ∀, ✀, which seem kind of random
> to exclude, but we don't need them. It also excludes a whole bunch of
> combining accents, but is ok with precomposed forms (Which we normalize to
> anyways, but a couple of obscure things that don't have pre-composed forms
> may be excluded).

Yes, sounds kinda OK, unless people do very funky things with accents :)
I don't need more than simple letters from languages. Adding all Unicode letter ranges would be OK for my purposes.

So again, does anybody think that this can get us into any troubbble?
Comment 3 Andre Klapper 2014-10-16 11:32:38 UTC
(In reply to Amir E. Aharoni from comment #2)

> So again, does anybody think that this can get us into any trouble?

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links