Last modified: 2014-04-03 10:10:47 UTC
With a fix for bug 5309, such as the one discussed at <https://gerrit.wikimedia.org/r/121255/>, it’s entirely possible that a user might get a CAPTCHA with illegible diacritics. Diacritics in Latin alphabets can look identical to one another when distorted, for example i í ì ỉ, or ó ơ. For better usability, ConfirmEdit should display a CAPTCHA containing diacritics but require the user to enter the characters without diacritics. There’s a third-party module called Unidecode that does a decent job of accent folding. One tradeoff would be that such CAPTCHAs might be easier for a bot to crack. There’s also the issue that a character like Ê might be considered a base letter in one language (as in Vietnamese) but a letter with a diacritic in another (Portuguese).
I'm not sure about the "only" part: for usability it's better if the system is completely agnostic to details, or I may correctly enter all diacritics and have my solution rejected for no reason. When implementing this we're probably going to use some standard Unicode solution for case folding and diacritics/accent folding.
Yes, this is absolutely necessary. Not only diacritics might not be visible, but also some users may not have the keyboard to enter them. I am not sure how to implement the folding, and it may even be language-dependent. For example, users may enter 'ö' as 'o' or as 'oe', or 'đ' as 'đ', 'ð', 'd' or 'dj'. A possibility is to simply avoid words with diacritics, which should be possible for most languages. In future, when non-Latin captchas are implemented, the same should apply to alphabets (f.e. it should be possible to enter a Cyrillic captcha in Latin alphabet).