Last modified: 2014-10-16 11:32:38 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T73386, the corresponding Phabricator task for complete and up-to-date bug report information.

Bug 71386 - the Sanitizer allows only ASCII and a some punctuation in extension tag attributes


Summary:	the Sanitizer allows only ASCII and a some punctuation in extension tag attri...

Status:	NEW

Product:	MediaWiki
Classification:	Unclassified
Component:	Parser (Other open bugs)
Version:	unspecified
Hardware:	All All

Importance:	Normal normal (vote)
Target Milestone:	---
Assigned To:	Nobody - You can work on this!

URL:
Whiteboard:
Keywords:	i18n

Depends on:
Blocks:	28980
	Show dependency tree / graph

Reported:	2014-09-27 17:54 UTC by Amir E. Aharoni
Modified:	2014-10-16 11:32 UTC (History)
CC List:	4 users (show)

See Also:
Web browser:	---
Mobile Platform:	---
Assignee Huggle Beta Tester:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Description Amir E. Aharoni 2014-09-27 17:54:55 UTC

I took a stab at resolving Bug 28980 and adding a non-ASCII tag name.

This was surprisingly easy until I tried to make the tag's parameters non-ASCII as well. Apparently, the Sanitizer only allows ASCII Latin letters, digits and a bit of punctuation, but no Unicode characters.

Would it be disastrous to add support for non-ASCII characters?

Comment 1 Bawolff (Brian Wolff) 2014-09-28 17:15:41 UTC

Hmm, as far as I can tell, even XML allows them (And even if it didn't, I'm not sure that we necessarily should require our xml-like tags to be conforment to XML). For reference, the relavent code in MW land is Sanitizer::getAttribsRegex()

From http://www.w3.org/TR/REC-xml/#NT-Name :

[41]   	Attribute	   ::=   	 Name  Eq  AttValue  

[4]   	NameStartChar	   ::=   	":" | [A-Z] | "_" | [a-z] | [#xC0-#xD6] | [#xD8-#xF6] | [#xF8-#x2FF] | [#x370-#x37D] | [#x37F-#x1FFF] | [#x200C-#x200D] | [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] | [#x10000-#xEFFFF]
[4a]   	NameChar	   ::=   	NameStartChar | "-" | "." | [0-9] | #xB7 | [#x0300-#x036F] | [#x203F-#x2040]
[5]   	Name	   ::=   	NameStartChar (NameChar)*


Which presumably is enough of unicode for your purposes (Although it has some weird exclusions, such as ÷, ×, ⬀, ⭐, ∀, ✀, which seem kind of random to exclude, but we don't need them. It also excludes a whole bunch of combining accents, but is ok with precomposed forms (Which we normalize to anyways, but a couple of obscure things that don't have pre-composed forms may be excluded).

Comment 2 Amir E. Aharoni 2014-09-28 17:21:04 UTC

(In reply to Bawolff (Brian Wolff) from comment #1)
> Hmm, as far as I can tell, even XML allows them (And even if it didn't, I'm
> not sure that we necessarily should require our xml-like tags to be
> conforment to XML). For reference, the relavent code in MW land is
> Sanitizer::getAttribsRegex()

Yes, that's precisely where I found it while debugging core trying to understand where on Earth do my perfectly good Hebrew attribute names disappear :)

> 
> From http://www.w3.org/TR/REC-xml/#NT-Name :
> 
> [41]   	Attribute	   ::=   	 Name  Eq  AttValue  
> 
> [4]   	NameStartChar	   ::=   	":" | [A-Z] | "_" | [a-z] | [#xC0-#xD6] |
> [#xD8-#xF6] | [#xF8-#x2FF] | [#x370-#x37D] | [#x37F-#x1FFF] |
> [#x200C-#x200D] | [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] |
> [#xF900-#xFDCF] | [#xFDF0-#xFFFD] | [#x10000-#xEFFFF]
> [4a]   	NameChar	   ::=   	NameStartChar | "-" | "." | [0-9] | #xB7 |
> [#x0300-#x036F] | [#x203F-#x2040]
> [5]   	Name	   ::=   	NameStartChar (NameChar)*
> 
> 
> Which presumably is enough of unicode for your purposes (Although it has
> some weird exclusions, such as ÷, ×, ⬀, ⭐, ∀, ✀, which seem kind of random
> to exclude, but we don't need them. It also excludes a whole bunch of
> combining accents, but is ok with precomposed forms (Which we normalize to
> anyways, but a couple of obscure things that don't have pre-composed forms
> may be excluded).

Yes, sounds kinda OK, unless people do very funky things with accents :)
I don't need more than simple letters from languages. Adding all Unicode letter ranges would be OK for my purposes.

So again, does anybody think that this can get us into any troubbble?

Comment 3 Andre Klapper 2014-10-16 11:32:38 UTC

(In reply to Amir E. Aharoni from comment #2)

> So again, does anybody think that this can get us into any trouble?

Note You need to log in before you can comment on or make changes to this bug.

Wikimedia Bugzilla is closed!

Search

Personal tools

Navigation

Links