Last modified: 2014-11-17 11:10:37 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T7309, the corresponding Phabricator task for complete and up-to-date bug report information.

Bug 5309 - Localize captcha images


Summary:	Localize captcha images

Status:	PATCH_TO_REVIEW

Product:	MediaWiki extensions
Classification:	Unclassified
Component:	ConfirmEdit (CAPTCHA extension) (Other open bugs)
Version:	unspecified
Hardware:	All All

Importance:	High normal with 29 votes (vote)
Target Milestone:	---
Assigned To:	Nobody - You can work on this!

URL:	https://www.mediawiki.org/wiki/CAPTCHA
Whiteboard:
Keywords:	i18n, patch, patch-need-review

Duplicates:	19229 (view as bug list)
Depends on:	32695
Blocks:	63216
	Show dependency tree / graph

Reported:	2006-03-21 20:41 UTC by Minh Nguyễn
Modified:	2014-11-17 11:10 UTC (History)
CC List:	37 users (show)

See Also:	41675 62960 http://code.google.com/p/googlefontdirectory/issues/detail?id=297 https://bugzilla.osafoundation.org/show_bug.cgi?id=13081
Web browser:	---
Mobile Platform:	---
Assignee Huggle Beta Tester:	---

Attachments
(naive) patch to make captcha.py work with unicode (1.32 KB, patch) 2010-11-01 18:44 UTC, Bawolff (Brian Wolff)	Details
Add an attachment (proposed patch, testcase, etc.)

Description Minh Nguyễn 2006-03-21 20:41:36 UTC

The captcha software should generate captchas in languages other than English at
non-English projects, depending on the locale. I've seen some generated captchas
at the Vietnamese Wikipedia that would definitely confuse Vietnamese-speakers
(can't remember the words exactly), because of things like r's and n's smooshed
up right next to each other, so it looks like an m, except to an English user
who happens to know a word that has "rn" instead. The user might have to *guess*
because the English words really don't follow Vietnamese spelling rules. We've
recently had users complaining to the sysops of not being able to read captcha
images, presumably for this reason.

An advantage to localizing the captchas would be that it might reduce the impact
of spambots at non-English projects. As far as I know, there isn't yet a
captcha-defeating bot that understands Vietnamese or Basque or Quechua.

For now, I'm only proposing localizing for most languages that use the Latin
alphabet, because requiring users to respond to a captcha in Thai or Arabic
would exclude a lot of legitimate interwiki users. And users of other scripts
tend to have the means of entering in Latin-based characters. Also, for
languages that use diacritical marks, we should generate the words with or
without the marks (not sure which) and modify
[[MediaWiki:Captcha-createaccount]], asking the user to enter in the word
without diacritical marks of any kind.

Once Latin-based alphabets are out of the way, it'd be a good idea to localize
for other writing systems as well, but provide a Latin-based alternative, per
Neil Harris' suggestion [1].

These localized captcha strings should *not* be stored in the MediaWiki:
namespace, nor anywhere easily accessible to the public, because bot writers
could easily write language-aware bots using such information. For wordlists, we
could start by using open-source lexicons, such as OpenOffice.org's [2]. We
should also contact embassadors of non-English projects, asking them for help
compiling sufficiently long lists of their own.

[1] http://mail.wikimedia.org/pipermail/wikien-l/2006-March/042263.html
[2] http://lingucomponent.openoffice.org/spell_dic.html

Comment 1 mimouni 2006-06-20 12:43:59 UTC

If the source code of capatcha is one PHP. The functions which to generate my 
images with character strings are only in coding ANSI. That wants to say that the 
Arab characters for example cannot be poster.

Comment 2 Adam Dziura 2007-04-20 21:00:17 UTC

I think that polish users of Wikipedia wants localized captcha images. Is 
better for new users.

Comment 3 Eugene N 2008-05-01 21:22:34 UTC

This would be very useful in Russian Wikipedia too. Of course the words have to be in Cyrillic alphabet.

Comment 4 Guillaume Paumier 2009-12-28 21:26:30 UTC

Removed URL since it was not relevant to this bug (probably due to a rebuilding of the archives)

Comment 5 Amir E. Aharoni 2010-11-01 14:55:19 UTC

I am surprised that it came up only now, but now there is demand for this in the Hebrew Wikipedia, too.

Comment 6 Bawolff (Brian Wolff) 2010-11-01 18:44:31 UTC

Created attachment 7775 [details]
(naive) patch to make captcha.py work with unicode

(In reply to comment #1)
> If the source code of capatcha is one PHP. The functions which to generate my 
> images with character strings are only in coding ANSI. That wants to say that the 
> Arab characters for example cannot be poster. 

With some very minor changes to the script this is not true. For example I just generated a bunch of hebrew captchas (just taking random words off the main page of [[he:]]) and some Ukraine captchas (because it was the only non-latin language who had a word list thats just an apt-get away).

My very minor changes included disabling the regex check that words don't match /[^a-z]/. Presumably other languages would need an equivalent checks, and checks to avoid words with diacritical marks (since those would i presume be hard to see in captchas)

p.s. I don't know python, so the very minor changes in my example might not be the "proper python way".

Comment 7 Bawolff (Brian Wolff) 2010-11-01 18:53:55 UTC

*** Bug 19229 has been marked as a duplicate of this bug. ***

Comment 8 Bergi 2011-11-06 18:28:56 UTC

(In reply to comment #0)
> For now, I'm only proposing localizing for most languages that use the Latin
> alphabet, because requiring users to respond to a captcha in Thai or Arabic
> would exclude a lot of legitimate interwiki users.

We also could use the uselang attribute (and user setting) instead of locale, then this wouldn't be a problem. But I guess the bigger problem then is to find a captcha generator for exotic alphabets.

Comment 9 l3o 2011-12-12 19:00:52 UTC

I have an idea how this problem could be solved:

MediaWiki should have an default fund of words, if the wiki doesn't contain enough words (eg. 500 words).
Than everytime a captcha should be displayed, a script fetches a random article and two random words. This words will be in the target language, because the are from articles in the same language as the user wants.
Than a script would place those two words onto an image, make them a bit unreadable and display them to the user.
The user would now have the task to solve the captcha.

But there are some problems:
- As mentioned: If the Wiki has not enough words, it can't create really random captchas. So, eventually should be included a default fund of words, but this could be a design problem.
- Also it would be a problem with the non unicode characters. Eventually it should be coded new, instead of using five millions totally different existing solutions and merge them.
- For big pages this could eventually be a performance problem.
= And the biggest problem: It would take some time to create all this new code. Also, I don't know if that would be really better than the existing solution.

And there would be one desing thing: This would be only a good solution for big Wikis, because there it would be hard to predict the selected words in the captcha, like it could eventually be with smaller Wikis.

Comment 10 Sumana Harihareswara 2012-05-25 03:16:15 UTC

Adding i18n keyword,

Comment 11 Everton Zanella Alvarenga 2012-05-30 15:43:15 UTC

Hi, while working for WMF for the Wikipedia Education Program, I've seen a lot of new editors, most of them students, facing a lot of difficulties while editing the CAPTCHA in English.

I think this is a very important issue for Wikipedia in other  languages. I've changed its importance to "high".

Comment 12 555 2012-06-01 02:40:11 UTC

It's a shame that even single implementations are very backlogged.

The developers team really thinks that Vector skin and a WYSIWYG editing interface will be the most relevant to help on editors retention?

Somewhere I've recently said that the language barrier was solved on Wikimedia, resting only the non-Wikipedia projects issue. But unfortunately I was very wrong.

On the bug opening, the Wikimedia paid staff was very small. Now it's a bit larger. But still no single word from any tech-guys, neither the volunteers one...

[[:m:User:555]]

Comment 13 matanya 2012-07-24 12:59:54 UTC

where is this standing?

Comment 14 Rainer Rillke @commons.wikimedia 2012-07-26 11:30:33 UTC

(In reply to comment #12)
Yes, WYSIWYG, article feedback and MoodBar are far more important than some key-issues. The reason? Here it is: Jimmy and the remaining board and Sue are native English speakers so it isn't prioritized. It's not what they are seeing when they are editing Wikipedia. We prefer designing a nice new en.wp main page investing thousands of dollars into questionable campus ambassadors, ...

So even lots of other simple bugs will be never fixed.

Comment 15 Oliver Keyes 2012-07-26 13:56:18 UTC

I'm terribly sorry to see the delay with this :(.

Well, just to be clear, we've not designed a nice new en.wp page - that's a community decision! - and of the 10 board members, half are ESL speakers. Localisation and services to non-enlang projects are things we're focusing more and more on; we've got a dedicated internationalisation team, for example.

On the rest of your examples - I think there's some confusion here as to who does what. Localisation and bug-fixing the "core" software is divided between the internationalisation team and the "Platform" sub-department of Engineering. Things like the visual editor or the feedback tool are the responsibility of the Features Engineering team. So there isn't really one set of things being prioritised by staffers over the other, because they're each handled by different sets of people :).

A more likely issue is that, well, things get lost in Bugzilla :(. Furthermore, there are a lot more bugs than there are developer hours to deal with them - take a look at https://bugzilla.wikimedia.org/weekly-bug-summary.cgi?tops=10&days=365 to see what I mean. Compared to the profile of the software, we really don't have a massive engineering team overall - and that's not down to the board, that's down to our comparatively small budget organisation-wide, which they can't really do anything about.

However! if you'll look above you'll see that Sumana (our awesome Engineering Community Manager) has added the localisation keyword, which should bring this problem to the attention of the localisation team, and I'm going to do my best to make sure they're reached - either to deal with the request, provide some kind of ETA on dealing with it or, if they can't solve the issue, explain what the problem is. They're great people, and I'm confident as both a staffer and a long-term editor that this will get resolved one way or another :).

Comment 16 Al-Scandar Solstag 2012-07-26 17:58:13 UTC

Ni!

Thanks for the very informative message Oliver.

I think it is good for us to get really upset when a bug that ***affects the
experience of every single new editor*** in many Wikipedias has had no
meaningful progress after 6 years since being reported, despite several
comments here and even face-to-face to staff members.

At the same time, it is important for us to get the fact straight about who is
responsible for what, like you described.

However, "bugs get lost" is also not a good explanation for what goes on here.
It's not even an explanation at all.

There are only 17 Mediawiki bugs with equal or more votes than this one, and
that number only grows to 25 if considering every product on this bugzilla:
https://bugzilla.wikimedia.org/buglist.cgi?votes_type=greaterthaneq&query_format=advanced&list_id=132785&votes=24&resolution=---&resolution=LATER&resolution=DUPLICATE&product=MediaWiki

Some of those 17 don't even count as they are already solved or have equivalent
functionality implemented, but some partial issue keeps them from going away.

Yet some of those are, similar to this one, also in a completely stalled state
for no good reason, despite a lot of people contributing to point out how
important they are and suggest solutions. Red interwiki links is probably my
favorite (Bug #11).

Wikimedia's tech team needs to improve how they prioritize work based
on community input.

And the board is also at fault for not requiring or developing themselves a
clear policy about that.

My impression is that they might be comfortable relying mainly on commissioned
studies of usability and participation, overlooking that most of those are
statistically questionable or based on unrealistic assumptions. Not meaning
they are not useful, they are useful and necessary, just limited. They won't
reveal the whole story by themselves, and sometimes not even the crucial facts.

So here we are, despite continuous community input, six years into a relatively
simple bug that affects every single new editor of Wikipedia in several
languages.

Thanks again Oliver for replying and looking after, and Sumana, now let us
hope the right people get to read this.

Hugs,

Ni!

Comment 17 Oliver Keyes 2012-07-26 18:07:28 UTC

This is actually one of my prime concerns; that we prioritise primarily based on "how big a deal, technically, a bug is" rather than the potential impact on the community. Bugzilla has one metric, and it's largely used for technical importance. But I'm confident the new Bugmeister, whomever they will be, can start making progress in this area :). At the moment we're without a bugmeister completely (which may go some way to explaining how even highly-voted bugs are falling through the cracks, although I appreciate this is older than the bugmeister position).

Comment 18 Nemo 2012-07-26 21:34:29 UTC

Adding bug 32695 as blocker because it might be the solution, by fetching the correct Wikisource.

Comment 19 Sumana Harihareswara 2012-07-26 22:09:17 UTC

(In reply to comment #16)
> Wikimedia's tech team needs to improve how they prioritize work based
> on community input.

Yes, the WMF absolutely does need to do better at incorporating community input into our work prioritization.  Guillaume Paumier, Rob Lanphier, and I presented a talk about this a few weeks ago: https://wikimania2012.wikimedia.org/wiki/Submissions/Transparency_and_collaboration_in_Wikimedia_engineering and I know Oliver and other folks have talked about and worked on it as well, but there's a ways to go.

On that more general topic, I strongly recommend that you join the wikitech-ambassadors mailing list https://lists.wikimedia.org/mailman/listinfo/wikitech-ambassadors and bring up your concerns from comment # 16, so we can talk about them in a group that includes more community members and Foundation folks, including Guillaume and Rob.

But on this particular issue (localising CAPTCHAs) I'm cc'ing Alolita Sharma, Director of Engineering (Internationalization and R&D), and Siebrand Mazeland, Product Manager of the Localisation team, hoping for their input.

Thanks, Al-Scandar Solstag!

Comment 20 Bawolff (Brian Wolff) 2012-07-27 12:42:23 UTC

Hmm I like the idea of comment 9.

Some issues:
*Swear words - people get angsty when "fuck", etc is in their captcha. (This is probably a minor consideration)
*complex characters - Unicode characters in and of themselves are not a problem. (Some wikis have words not in their native script, but that's the minority, and can be resolved with a "request new captcha") More concerning is Diacritics. Diacritics are small, and may be hard to see when messed with by the captcha algorithm (although a native speaker might know what the word is and be able to fill in the diacritics). I'm doubtful that a captcha of ɓ b will look very different.

However, with that said, perhaps we should just do some testing to see if that's really an issue. Maybe its less of an issue to a non-native speaker than using english captchas are.

*Actual coding - we'd need to be able to generate captchas from php, presumably in real time. Not a major issue, but requires coding efforts. (Or I suppose we could get the word list once, and generate the captchas one off with the current script)

-----
We should also evaluate the effectiveness of our captchas. The captcha program was written a while ago. Since then there's been advances in getting text out of images. Lots of third party wikis report captchas not being all that effective against spam. Perhaps our captchas aren't actually doing anything.

Comment 21 Nemo 2012-07-27 17:22:34 UTC

(In reply to comment #20)
> We should also evaluate the effectiveness of our captchas. The captcha program
> was written a while ago. Since then there's been advances in getting text out
> of images. Lots of third party wikis report captchas not being all that
> effective against spam. Perhaps our captchas aren't actually doing anything.

AFAIK it's already proven to be completely broken, see http://lists.wikimedia.org/pipermail/wikitech-l/2011-November/056078.html (maybe while implementing the proposed new method we could also get it to use the right dictionaries).
There's quite a chance that our captchas are discouraging only good faith editors, especially non-English speaking.

Comment 22 Pau Giner 2012-08-03 06:42:32 UTC

As part of an email conversation related to this topic, I made some mockups to illustrate some captcha ideas that could be less problematic for non-English speakers, improve the general UX, and rely on images from Commons. 

* Panorama captcha: http://commons.wikimedia.org/wiki/File:Panorama-captcha-idea.png 
Based on tagging parts of a panorama picture with the appropriate word (in the UI language or Basic English words).

* 'Who is who' captcha: http://commons.wikimedia.org/wiki/File:Find-all-captcha-idea.png
Based on finding from a set of similar images the ones that fit a specific criteria (with an image describing also the criteria).

* 'Find the different' captcha: http://commons.wikimedia.org/wiki/File:Find-the-different-captcha-idea.png 
Based on finding the image that is different from a set of images.


These captchas will probably generate new problems for the technical side, require adjustments to reduce the chance of a machine to solve them, or may just be unfeasible to generate, but I wanted to provide these ideas in case anybody else may use it as a base for improve on any technical weakness they may have and make them at least as hard to solve for a machine as text-based captchas are.

A page at Mediawiki has been created to gather ideas and feedback: https://www.mediawiki.org/wiki/Requests_for_comment/CAPTCHA

Comment 23 Bawolff (Brian Wolff) 2012-08-03 11:50:57 UTC

As others have said on the mailing list, I fear such captchas would not only be easier for bots to solve than the current solution (once they've had a little time to adjust), but also would be harder to localize unless the number of such captcha challanges were extremely small.

Comment 24 Nikola Smolenski 2013-01-08 10:45:57 UTC

Note that some users may not have appropriate keyboard to enter the captcha in their language. Aside from captcha generation in various languages, fuzzy comparison with the answer is needed as well.

Comment 25 Quim Gil 2013-04-01 20:00:41 UTC

fyi there is a proposal from the Language team for a mentored project about 

Multilingual, usable and effective captchas
http://www.mediawiki.org/wiki/Mentorship_programs/Possible_projects#Multilingual.2C_usable_and_effective_captchas

I have some reservations about featuring that project to Google Summer of Code or Outreach Program for Women participants, but I'm willing to be proven wrong. Reasons:

* Unclear buy-in from the community or the maintainers. The whole CAPTCHA topic is messy, with several discussion threads, a RFC, a prototype, and potential plans. We don't have a clear plan for captchas. There hasn't been enough feedback about captchas based purely on images without any text, as this project proposes.

* Bug 32695 - Review and Deploy Wikicaptcha. Is there and waiting for feedback.

* I'm not a CS anything and I could be perfectly wrong, but the project feels too ambitious for three months, both with the amount of work required and the skills needed.

With all this I see the risk of failure bigger than wished for a GSOC project, either because students will most likely lack the time/skills or because even a complete GSOC project would have a hard time ending up merged in our codebase.

Feedback welcome.

Comment 26 Bawolff (Brian Wolff) 2013-04-01 21:20:07 UTC

The key word in the gsoc proposal that I like is research. My problem with most captcha proposals is that they promote someone's pet idea without any citations to back up their theory.

This does seem to be much more research oriented than most gsoc projects.

Comment 27 Quim Gil 2013-04-02 23:33:50 UTC

Sure, research is great. But before proposing someone to do a 3 month research on this subject I would like to have confidence that this research is welcome and there is an interest from the MediaWiki / ConfirmEdit maintainers in changing the status quo.

Reading the feedback in various channels it is easier to find a disbelief on captchas as a solution altogether.

Comment 28 MZMcBride 2014-02-13 17:31:25 UTC

Possibly related: https://gerrit.wikimedia.org/r/113122

Comment 29 Aalekh Nigam 2014-03-20 20:25:03 UTC

Over a period of few months there has been active Development of Multilingual, usable and effective captchas for GSOC 2014.But currently it seems that there is no technical and primary mentor for the project. Therefore I Request all members to please have a thought about becoming a part of this project as primary technical mentor.

Comment 30 Quim Gil 2014-03-22 18:14:23 UTC

Let's move the GSoC 2014 discussion to 

Bug 62960 - Prototype CAPTCHA optimized for multilingual and mobile

Comment 31 Gerrit Notification Bot 2014-03-29 08:02:46 UTC

Change 121255 had a related patch set uploaded by Nemo bis:
Make captcha.py produce images in arbitrary language

https://gerrit.wikimedia.org/r/121255

Comment 32 Nemo 2014-03-29 08:33:09 UTC

Plans for the ultimate solution are being discussed at bug 62960.

In the meanwhile, as workaround, we're testing making images in all languages with words taken from Wiktionary. For technical details please read and comment on https://gerrit.wikimedia.org/r/121255
You can see samples at https://www.dropbox.com/sh/i2af7xvn4y593gc/-RRtFyoJji/captchas

In my testing the images seem rather good, mainly depending on the availability of a good font. DejaVu is a well known high quality font covering most languages and DejaVuSans-Bold seems to work well for the languages it covers: https://sourceforge.net/p/dejavu/code/HEAD/tree/trunk/dejavu-fonts/langcover.txt

Caveats:
* We still have to handle RTL languages. Results probably don't make any sense now.
* We've not yet made the blacklist multilingual but it's not too hard, ignore the bad words if any.
* We still have to figure out how to exclude confusable words. It's not impossible, there is a Unicode library for that (but not for python perhaps). See bug 63216.
* Of 165 languages for which Amgine gave me "big" dictionaries, 20 were not in DejaVu and for 10 I used FreeSerif instead. Those are lower quality. We may end up using [[mw:ULS]] font repo with some hacks, if many languages need it; or we could just skip them: I wonder if a captcha in e.g. Gujarati or Japanese will ever make sense.
* Security fixes demanded by http://cdn.ly.tl/publications/text-based-captcha-strengths-and-weaknesses.pdf will be in a separate patch. They're several small things that someone familiar with PIL can do easily enough in the existing code. One of them is "printing" each letter separately with some aspect variations, which may solve some problems with ligatures too.

Comment 33 Yusuke Matsubara 2014-03-30 10:16:25 UTC

(In reply to Nemo from comment #32)
> Plans for the ultimate solution are being discussed at bug 62960.
> 
> In the meanwhile, as workaround, we're testing making images in all
> languages with words taken from Wiktionary. For technical details please
> read and comment on https://gerrit.wikimedia.org/r/121255
> You can see samples at
> https://www.dropbox.com/sh/i2af7xvn4y593gc/-RRtFyoJji/captchas
> 
> In my testing the images seem rather good, mainly depending on the
> availability of a good font. DejaVu is a well known high quality font
> covering most languages and DejaVuSans-Bold seems to work well for the
> languages it covers:
> https://sourceforge.net/p/dejavu/code/HEAD/tree/trunk/dejavu-fonts/langcover.
> txt
> 

I think zh-* (Chinese) variants are mistakenly included. They are not claimed to be covered by the font, and many substitute squares (tofus) appear in your samples.

Comment 34 Nemo 2014-03-30 10:29:53 UTC

(In reply to Yusuke Matsubara from comment #33)
> I think zh-* (Chinese) variants are mistakenly included.

Indeed; deleted. If someone thinks a captcha in CJK locales makes sense and/or has ideas on how to support them, please share.

Comment 35 Siebrand Mazeland 2014-03-30 10:45:32 UTC

ISO code got does not make sense, tofu at https://www.dropbox.com/sh/i2af7xvn4y593gc/a6Kz0eSXZ4/captchas/got#f:image_5edb52ac_e04341dd3d25c8f8.png

Comment 36 Nemo 2014-03-30 11:01:09 UTC

(In reply to Siebrand Mazeland from comment #35)
> ISO code got does not make sense, tofu at
> https://www.dropbox.com/sh/i2af7xvn4y593gc/a6Kz0eSXZ4/captchas/got#f:
> image_5edb52ac_e04341dd3d25c8f8.png

Right. Maybe https://www.gnu.org/software/freefont/coverage.html lies? I'm getting more and more inclined to only use DejaVu. For the languages it doesn't support we'd need to ensure native speakers like the font (e.g. by using ULS fonts) but it's also quite hard to design image distortions that make sense with those scripts.
If you know one of the following languages please speak up!

* bn Bengali
* chr Cherokee
* gu Gujarati
* hi Hindi (Devanagari script)
* mr Marathi (Devanagari script)
* sa Sanskrit (Devanagari script)
* ml Malayalam
* si Sinhala/Sinhalese
* ta Tamil
* th Thai 1%

Missing in FreeFont too:

* am Amharic
* bo Tibetan
* ja Japanese
* km Central Khmer
* kn Kannada
* ko Korean
* my Burmese (Myanmar)
* pa Panjabi/Punjabi
* te Telugu
* ug Uyghur 87%
* ur Urdu 92%

Comment 37 Nemo 2014-03-30 11:07:56 UTC

Sorry for double message; another idea I had is that some of those languages don't have an OCR, as Wikisource folks painfully know (for instance Malayam). Maybe for such languages we could just disable distortions, given bots are unlikely to parse them on their own anyway.
Cf. http://finereader.abbyy.com/recognition_languages/

Comment 38 Niharika 2014-03-30 12:41:18 UTC

I went through the pictures for CAPTCHAs in Hindi. They're mostly understandable except for in a few of the images it's impossible to distinguish the character. Hindi has quite a few similar-looking characters differing just by a small line or a dot. 

For example, the middle character is not-recognizable in https://www.dropbox.com/sh/i2af7xvn4y593gc/050a6S-21C/captchas/hi#lh:null-image_76947daa_e5d5575a79755d28.png

But mostly they read just fine.

Comment 39 Mormegil 2014-03-30 13:40:11 UTC

I must say the Czech (cs) version is better than I’d expect. The only issue seems to be diacritics: especially the difference between i/í is practically indistinguishable after the distortion. For most words, you can probably tell from context, but in some cases, both versions would make correct words (e.g. https://www.dropbox.com/sh/i2af7xvn4y593gc/bSYQyGEMBH/captchas/cs#lh:null-image_6d80659d_e7c8421a61605559.png can be both “dobyti” and “dobytí”). Removing all words with “í” would probably be enough, ignoring the difference between “í” and “i” would be perfect, but I guess having some (low) nonzero expected error rate would be acceptable as well.

Comment 40 Minh Nguyễn 2014-04-01 08:35:30 UTC

For Vietnamese, 27 of the images contain a piece of tofu instead of a second word; 2 images contain more than one piece. It’s odd, because this font clearly supports the Vietnamese half of Latin Extended Additional. The high distortion is problematic and probably unnecessary, because Vietnamese OCR is still pretty rudimentary, with little support for diacritics. As it is, though, a different font may help with many of the following legibility challenges:

ú or ủ?
https://www.dropbox.com/sh/i2af7xvn4y593gc/v_jkVCy5Xg/captchas/vi/image_6876ca11_cc6e08a95ea5b935.png

ẽ or ế?
https://www.dropbox.com/sh/i2af7xvn4y593gc/rfc7TwizAo/captchas/vi/image_432bfc9d_d02d9707bcb0a02b.png

If I didn’t know this font used two-story a’s, I’d see ã instead of ỗ:
https://www.dropbox.com/sh/i2af7xvn4y593gc/CGUSde4hfC/captchas/vi/image_5cbb4b12_976cd14e4e332a23.png

ú or ứ?
https://www.dropbox.com/sh/i2af7xvn4y593gc/Svdiq4ZLS5/captchas/vi/image_c90d4c3d_6b4e7a877b3e79dc.png

d or đ?
https://www.dropbox.com/sh/i2af7xvn4y593gc/Ah4yviImWT/captchas/vi/image_ae9020dd_0a618ab7494104fd.png

Comment 41 Nemo 2014-04-01 10:03:43 UTC

Tofu is because of things like [[wikt:裘]] and [[wikt:意見]] being in the dictionary. As with Malayalam issues reported on mailing list, I'm unsure how to handle such "extraneous" "words" for all languages; though in this and the Serbian's case we could "just" check the dictionary is in the main language's script (if we know the language code...).

About vi, I was reading earlier this morning on Gentium: «version of the font with redesigned diacritics (flatter ones) to make it more suitable for use with stacking diacritics, and for languages such as Vietnamese». <http://scripts.sil.org/cms/scripts/page.php?item_id=Gentium_faq&_sc=1#5d25a5da>
How many languages have such complex diacritics and is there some generic enough font? I doubt we can exclude words with diacritics, we'd only have 500 left out of thousands in vi's case. I'm uploading a new attempt with Arimo font, please check if it's any better.

Comment 42 Minh Nguyễn 2014-04-01 10:21:09 UTC

(In reply to Nemo from comment #41)
> Tofu is because of things like [[wikt:裘]] and [[wikt:意見]] being in the
> dictionary. As with Malayalam issues reported on mailing list, I'm unsure
> how to handle such "extraneous" "words" for all languages; though in this
> and the Serbian's case we could "just" check the dictionary is in the main
> language's script (if we know the language code...).

Yep, that’s what’s required for Vietnamese then.

> About vi, I was reading earlier this morning on Gentium: «version of the
> font with redesigned diacritics (flatter ones) to make it more suitable for
> use with stacking diacritics, and for languages such as Vietnamese».
> <http://scripts.sil.org/cms/scripts/page.
> php?item_id=Gentium_faq&_sc=1#5d25a5da>
> How many languages have such complex diacritics and is there some generic
> enough font?

Among Latin alphabets that we’ll be displaying, Vietnamese is a bit of a special case for stacking diacritics. GentiumAlt’s flatter diacritics allow it to fit Vietnamese on a standard-height line at the cost of some legibility. If anything, we need more exaggerated diacritics that can survive the distortions.

> I doubt we can exclude words with diacritics, we'd only have
> 500 left out of thousands in vi's case.

Right, the whole point of this exercise is to include the diacritics. :-)

Comment 43 Amgine 2014-04-01 13:49:36 UTC

Comments regarding sinitic captcha's in third paragraph of this revision: https://en.wiktionary.org/w/index.php?title=User_talk%3AWyang&action=historysubmit&diff=26075006&oldid=26066452

Comment 44 Nasir Khan Saikat 2014-04-02 09:03:05 UTC

Hi,
Bengali (bn) text are not displaying properly. All the conjunctions are misplaced and that is why almost none of the image represents any word. In some images (Example:https://www.dropbox.com/sh/i2af7xvn4y593gc/7fTaoyiaSb/captchas/bn#lh:null-image_a060ec4f_d15f04bc689bb980.png) parts of the characters are missing because of the padding/border.

I am not sure it is a problem of the font or not. But if can tell me the name of the font i can test that. 

--
Nasir Khan Saikat

Comment 45 Minh Nguyễn 2014-04-02 09:11:52 UTC

(In reply to Nemo from comment #41)
> I'm uploading a new attempt with
> Arimo font, please check if it's any better.

Yes, it’s better. The only severe ambiguity I ran into was:

h or n? Knowing the word, it’s n, but it sure looks like h:
https://www.dropbox.com/sh/i2af7xvn4y593gc/-GesxDHeX9/captchas/vi-arimo/image_03be064f_70c0338194b8dca2.png

Another issue for Vietnamese: the  ̃ and  ̉ diacritics can look like each other when stacked over  ̂ and distorted. The southern dialect merges the two tones into  ̉, so southerners won’t always be able to rely on the words they know to resolve the ambiguity. I’ve asked the Vietnamese Wikipedia community for feedback on this issue: [[vi:Wikipedia:Thảo luận#Việt hóa các hình CAPTCHA]].

Finally, many Vietnamese Wikipedia users rely on an IME script embedded via a gadget, but gadgets are disabled at [[Special:UserLogin/signup]]. We’d need to port the (rather complex) IME to ULS to keep the signup form accessible. Otherwise, as others have mentioned on the mailing lists, there will have to be an option to fall back to an English CAPTCHA.

Comment 46 Nikola Smolenski 2014-04-02 11:19:44 UTC

Suggestion regarding Bengali and similar: they do not have to be distorted as much. This because OCR for these scripts is less developed than OCR for Latin alphabet, and I doubt spammers will be willing to bother so much for relatively small Wikipedias. If we notice that the captchas are being ignored, more distortion could be added.

Comment 47 Sumana Harihareswara 2014-04-02 16:21:19 UTC

Also see comments at http://lists.wikimedia.org/pipermail/wikitech-ambassadors/2014-April/thread.html#644 about Swedish, French, Bengali, Romanian, and Catalan.

Comment 48 Željko Filipin 2014-04-03 08:47:42 UTC

Croatian (hr) is _completely_ broken. For example, 80% of them (or so) is completely (or half) in Cyrillic. Some older people will be able to read it, but almost nobody in Croatia will be able to enter Cyrillic text, since that is not an official script here.

Comment 49 Željko Filipin 2014-04-03 08:55:51 UTC

Serbian (sr) is strange too. I think both Latin and Cyrillic are official there, but isn't it strange to ask for people to change input method (from Latin to Cyrillic, and vice versa) in the middle of captcha, like here?

https://www.dropbox.com/sh/i2af7xvn4y593gc/ocKv1yBPuf/captchas/sr#lh:null-image_e373536e_2ecbd37b76d67185.png

The above is not the only example.

Comment 50 Željko Filipin 2014-04-03 09:08:05 UTC

Bosnian (ba) has completely Cyrillic CAPTCHAs, as far as I can see, but according to Wikipedia "Standard Bosnian uses a Latin alphabet."[1]

Željko
--
1: https://en.wikipedia.org/wiki/Bosnian_language

Comment 51 Nemo 2014-04-03 09:19:40 UTC

ba is not Bosnian https://translatewiki.net/wiki/Portal:Ba

Thanks for these comments, but we're already aware of the mixed/wrong script issues: it was the first thing people brought to our knowledge, no need for more examples.
http://thread.gmane.org/gmane.org.wikimedia.mediawiki.i18n/846

As previously said (see comment 41), we'll rely on the ICU interface to Unicode data to remove mixed script and (where possible) secondary/wrong scripts for each language. Problems with the source dictionary (en.wiktionary.org) should be dealt by editing said wiki.

Comment 52 Minh Nguyễn 2014-04-03 11:03:14 UTC

So far, the general sentiment from the Vietnamese Wikipedia community has been that the added difficulty of distinguishing diacritics vastly outweighs any readability improvements from using actual Vietnamese words instead of English words or random letters. Moreover, there is skepticism that the wiki even has a problem with CAPTCHA-solving bots. These are gut feelings rather than hard data, of course, but I can imagine a couple changes that would mitigate the community's concerns:

1a. Minimize or eliminate distortions in Vietnamese. High-quality OCR solutions like Google's already have enough difficulty with clear, undistorted Vietnamese text.
1b. Alternatively, strip diacritics *before* display and accept diacritic-less input. There would likely be no change in difficulty for bots, but Vietnamese users would still be able to employ their knowledge of Vietnamese spelling patterns.
2. Provide an option to solve a standard English CAPTCHA. (Not sure what the default should be.) Many websites that require CAPTCHAs offer some alternative for accessibility; Vietnamese CAPTCHAs with diacritics would be insurmountable to those with declining eyesight.

Comment 53 Sorawee Porncharoenwase 2014-04-07 13:51:32 UTC

IMHO, for Thai language, the pictures are very blurred. Although some can be guessed easily, the rest needs a lot of effort. In some cases, it is impossible to determine the correct word at all.

Comment 54 Sorawee Porncharoenwase 2014-04-08 15:24:44 UTC

Results from [[th:WP:HELPDESK#CAPTCHA]] from Thai Wikipedia: S: 0, O: 6, N: 0

Comments:

Nullzero: See the above comment

G(x): Too hard too read

Taweetham: (1) Too hard to read (2) Contain swearing words (3) Not convenient for interwiki users (4) Thai language is complex. He doesn't know whether the software will generate words which are impossible to enter or not

BlackKoro: Unable to read

Lerdsuwa: Can't distinguish between "ท" and "ห", "ล" and "ส"

Aristitleism: (1) Very hard to read (2) Contain swearing words (3) Contain some obsolete characters which no one uses anymore such as "ฦ" It is also hard to find these obsolete characters on Thai keyboard.

Comment 55 Bryce Glover 2014-05-02 18:45:39 UTC

(In reply to Minh Nguyễn from comment #52)
> Moreover, there is skepticism that the wiki even has a problem with CAPTCHA- 
> solving bots. These are gut feelings rather than hard data, of course, but I 
> can imagine a couple changes that would mitigate the community's concerns:  
> 
> …

Could some hard data be found on whether or not the Vietnamese Wikipedia has ever had any problems with CAPTCHA-solving bots?

Wikimedia Bugzilla is closed!

Search

Personal tools

Navigation

Links

541329866
a.d.bergi
aalekh1993
alolita.sharma
amir.aharoni
b
bawolff+wn
eip
everton137
federicoleva
gangleri
Gerard.meijssen
gpaumier
he7d3r+bugs
jforrester
jhsoby
liangent
lugusto
matanya
meno25mail
nasir8891
niharikakohli29
nullzero.free
pginer
qgil
RandomDSdevel
rillke
siebrand
smolensk
solstag
trijnstel
whym
wikibugs
wmf.amgine3691
xenondwb
yannfo
zfilipin