Last modified: 2014-10-16 11:32:38 UTC
I took a stab at resolving Bug 28980 and adding a non-ASCII tag name. This was surprisingly easy until I tried to make the tag's parameters non-ASCII as well. Apparently, the Sanitizer only allows ASCII Latin letters, digits and a bit of punctuation, but no Unicode characters. Would it be disastrous to add support for non-ASCII characters?
Hmm, as far as I can tell, even XML allows them (And even if it didn't, I'm not sure that we necessarily should require our xml-like tags to be conforment to XML). For reference, the relavent code in MW land is Sanitizer::getAttribsRegex() From http://www.w3.org/TR/REC-xml/#NT-Name : [41] Attribute ::= Name Eq AttValue [4] NameStartChar ::= ":" | [A-Z] | "_" | [a-z] | [#xC0-#xD6] | [#xD8-#xF6] | [#xF8-#x2FF] | [#x370-#x37D] | [#x37F-#x1FFF] | [#x200C-#x200D] | [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] | [#x10000-#xEFFFF] [4a] NameChar ::= NameStartChar | "-" | "." | [0-9] | #xB7 | [#x0300-#x036F] | [#x203F-#x2040] [5] Name ::= NameStartChar (NameChar)* Which presumably is enough of unicode for your purposes (Although it has some weird exclusions, such as ÷, ×, ⬀, ⭐, ∀, ✀, which seem kind of random to exclude, but we don't need them. It also excludes a whole bunch of combining accents, but is ok with precomposed forms (Which we normalize to anyways, but a couple of obscure things that don't have pre-composed forms may be excluded).
(In reply to Bawolff (Brian Wolff) from comment #1) > Hmm, as far as I can tell, even XML allows them (And even if it didn't, I'm > not sure that we necessarily should require our xml-like tags to be > conforment to XML). For reference, the relavent code in MW land is > Sanitizer::getAttribsRegex() Yes, that's precisely where I found it while debugging core trying to understand where on Earth do my perfectly good Hebrew attribute names disappear :) > > From http://www.w3.org/TR/REC-xml/#NT-Name : > > [41] Attribute ::= Name Eq AttValue > > [4] NameStartChar ::= ":" | [A-Z] | "_" | [a-z] | [#xC0-#xD6] | > [#xD8-#xF6] | [#xF8-#x2FF] | [#x370-#x37D] | [#x37F-#x1FFF] | > [#x200C-#x200D] | [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] | > [#xF900-#xFDCF] | [#xFDF0-#xFFFD] | [#x10000-#xEFFFF] > [4a] NameChar ::= NameStartChar | "-" | "." | [0-9] | #xB7 | > [#x0300-#x036F] | [#x203F-#x2040] > [5] Name ::= NameStartChar (NameChar)* > > > Which presumably is enough of unicode for your purposes (Although it has > some weird exclusions, such as ÷, ×, ⬀, ⭐, ∀, ✀, which seem kind of random > to exclude, but we don't need them. It also excludes a whole bunch of > combining accents, but is ok with precomposed forms (Which we normalize to > anyways, but a couple of obscure things that don't have pre-composed forms > may be excluded). Yes, sounds kinda OK, unless people do very funky things with accents :) I don't need more than simple letters from languages. Adding all Unicode letter ranges would be OK for my purposes. So again, does anybody think that this can get us into any troubbble?
(In reply to Amir E. Aharoni from comment #2) > So again, does anybody think that this can get us into any trouble?