Last modified: 2013-01-26 04:33:58 UTC
User input collected via AFTv5 should be sanitized: all Javascript or potentially malicious code should be stripped before we store a new AFT record in the DB. We had three JS injection attempts in aft_article_answer so far. Even if the output in Special:ArticleFeedbackv5 is sanitized [1], we should not allow users to create new records containing malicious code. Using AbuseFilter could be a long term solution, but if we want to start surfacing the full text of posts collected via AFTv5 we need to have an immediate solution in place. This data is immediately available on the toolserver and if it's not sanitized community-developed tools will be subject to JS injection attempts.I temporarily disabled my own live comment stream on the toolserver until we have a workable solution in place. [1] http://en.wikipedia.org/wiki/Special:ArticleFeedbackv5/Research_Works_Act/68401
I think this is missing the point. We strip this sort of stuff in wikitext because you can actually use some HTML tags in wikitext, but we don't allow that in comments at all. Trying to remove malicious things is much more work and much more error-prone than just escaping your output. Recommend WONTFIX.
Agreed it's not a security issue for Wikipedia itself. However if we serve this data to third parties via DB replicas (including trusted DB like the toolserver) we're creating a major opportunity for abuse that we should be aware of (particularly as a large part of the community blindly trusts whoever runs scripts on the toolserver). If we can avoid the need of trusting developers downstream by removing obviously dangerous contents upstream I don't see why we shouldn't do it. Simply documenting that there is no user input sanitization in AFT (while we do have abuse filters in place for revisions) is not sufficient to avoid this risk in my opinion.
I agree with Roan here. User comments could contain a wide range of content. They may also be reporting script related bugs and as such I don't think anything should be stripped from a comment. The output is where escaping happens and that can be as easy as a single call to htmlspecialchars() in PHP before outputting it, that covers 99% if not 100% of problems with user input, and it doesn't destroy anything the user put in. If a toolserver user is outputting data from the database without escaping it, they should be reported to a ts-admin as soon as possible to have their account blocked until further notice (okay maybe not like that, but it would be a serious consideration whether they know what they're doing).
Then you should probably have my account blocked as I was temporarily providing tools to browse/access AFT CSV dumps ;) We should have a broader discussion on this. Much as we have abuse filters in place we should consider automatic filters for all feedback functionality which is likely to attract a high amount of abusive/malicious stuff and we simply cannot control what end users are doing with our data. Fabrice and his team are working on requirements to implement this in AFTv5. An additional (although unrelated) reason why we should filter junk at the input not the output is for analytics purposes. Scripted attempts to add malicious code (as well as any scripted spam attempts) will make the feedback data useless for analysis if no user input throttling or filtering is in place. We can make a decision to filter this data on the consumer end (in which case this should fall under the competence of the analytics team), but blocking obvious junk at the input (while separately logging blocks) would certainly help alleviate this problem.
I lowered the priority of this request and moved it to the next release, because: 1) we couldn't find an existing filter we could adapt for this; 2) we couldn't find community volunteers to write one for us; 3) we only plan to write a few filters ourselves, since this is driven by the community; 4) the severity of this issue is not as high as we had initially thought, after consulting with Roan and others. But thanks so much for bringing it up! We'll keep tracking it for the next release.
Assigning to Fabrice until we're ready to move on this.
(In reply to comment #4) > ..providing tools to browse/access AFT CSV dumps ;) > And you didn't escape the output to the browser? That doesn't only impose security issues (malicious javascript may be the least of your worries), it is also incorrect and can cause content to be incorrectly displayed. No worries, simple mistake to make. But do understand it! Per "Good security through lazyness", escaping is not related to security, meaning you must not consider it to be a security measure. It is a content measure. If a user comments that they: > would like something to be <than 500 but>100 It should be outputted like that, just like bugzilla did. And it will when you properly escape it. If you don't escape it, you'll get this: > like foo to 100 Because: > <p>like foo to <than 500="" but="">100</than></p> Omitting bogus entries is good and useful for analytical purposes. Even better to keep track of how many were bogus and how many real. But stripping characters from the comments because they are dangerous as HTML, is a whole other thing. It will corrupt the data. The problem is that you must not consider the contents HTML, they're text. Output and use them as such and there is no problem whatsoever. Unless repurposing this bug to find a way to omit bogus entires entirely (in a way more advanced than stripping something that might look like javascript), please wontfix.
Not escaping the output was an obvious mistake, this request is now obsolete as AbuseFilter has been introduced months ago to take care of random junk posted to AFT by bots. Closing it as wontfix.