Last modified: 2014-03-22 18:13:44 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T34695, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 32695 - Review and Deploy Wikicaptcha
Review and Deploy Wikicaptcha
Status: NEW
Product: MediaWiki extensions
Classification: Unclassified
ConfirmEdit (CAPTCHA extension) (Other open bugs)
unspecified
All All
: Low enhancement with 2 votes (vote)
: ---
Assigned To: Nobody - You can work on this!
https://wikimania2012.wikimedia.org/w...
: design
Depends on:
Blocks: 5309 31235 Wikisource 38640
  Show dependency treegraph
 
Reported: 2011-11-28 22:28 UTC by Sumana Harihareswara
Modified: 2014-03-22 18:13 UTC (History)
18 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Sumana Harihareswara 2011-11-28 22:28:34 UTC
Idea: Write a version of reCAPTCHA (for use by ConfirmEdit) that uses document images that have been processed by MediaWiki's ProofreadPage extension for WikiSource. In other words, a CAPTCHA that feeds data to ProofreadPage to augment its OCR processing.  Some existing code to build on: http://lists.wikimedia.org/pipermail/wikitech-l/2011-November/thread.html#56121 (Neil Harris & ConfirmEdit)
Comment 1 Nemo 2011-11-29 18:37:29 UTC
This has been discussed a few times and a proof of concept was produced: http://lists.wikimedia.org/pipermail/wikisource-l/2011-February/000939.html
If I remember correctly, starting from a properly mapped DjVu it's not so difficult to identify the words which need to be checked, extract the corresponding (portion of) image and put the new text back in the DjVu.
It's way less obvious how to translate the activity on a Page: to the corresponding DjVu page and vice versa.
Comment 2 Sumana Harihareswara 2012-08-30 18:31:51 UTC
Alex, is wikicaptcha, in its current form, ready for a deployment review?  Or is it still in an experimental/prototype phase?  It would probably be good to clarify that in the README at https://github.com/CristianCantoro/wikicaptcha  .

Am cc'ing Andrea Zanni (Aubrey).

Thanks for working on this!
Comment 3 Sumana Harihareswara 2012-09-17 03:13:57 UTC
Alex, it looks like WikiCAPTCHA awaits a design review https://www.mediawiki.org/wiki/WMF_Project_Design_Review_Process before we can move forward with deploying it on Wikimedia sites.  Just wanted to let you know.  Thanks.
Comment 4 Quim Gil 2013-03-25 00:49:33 UTC
This is a very nice idea! What is the status? Would a Google Summer of Code project help getting a MediaWiki extension running and polished, ready to be used in any MediaWiki enabled site?

Another question would be whether this extension is put in use in Wikimedia sites.

If the idea makes sense and there is at least one mentor available I would like to push it as a candidate to 

http://www.mediawiki.org/wiki/Mentorship_programs/Possible_projects

and move it to https://www.mediawiki.org/wiki/Summer_of_Code_2013#Project_ideas
Comment 5 Bawolff (Brian Wolff) 2013-04-01 21:09:57 UTC
(In reply to comment #3)
> Alex, it looks like WikiCAPTCHA awaits a design review
> https://www.mediawiki.org/wiki/WMF_Project_Design_Review_Process before we
> can
> move forward with deploying it on Wikimedia sites.  Just wanted to let you
> know.  Thanks.

The code looks to be an early prototype. I only did a five minute read through but it looks to be a proof of concept, not a feature complete implementation.

Open questions about this whole idea:
*how would data propogate back to wikisource.
*is this even effective as a captcha
**the dataset used to generate the images are publically available. It is unclear that the dataset is large enough that someone downloading the entire thing wouldn't happen.
**an attacker could add entries to the dataset. Im not sure how exploitable that is, but its something that is concerning
**its unclear this will actually prevent spam. Computers do not get bored. Even with 1% getting through, it would not be effective. This is using texts that ocr software marked as low confidence, which sounds significantly weaker than what recaptcha does according to wikipedia and ive heard rumours that recaptcha is not entirely effective. (Not sure if this is true).
Comment 6 Nikola Smolenski 2013-06-23 11:38:08 UTC
To answer to the open questions:

(In reply to comment #5)
> *how would data propogate back to wikisource.

I don't see that it is practically possible to propagate data back to Wikisource. Rather, this would be used to perform initial OCR for Wikisource, perhaps primarily for works where machine-based OCR would be ineffective.

> *is this even effective as a captcha

I don't see that it would be any less effective than the current captcha.

> **the dataset used to generate the images are publically available. It is
> unclear that the dataset is large enough that someone downloading the entire
> thing wouldn't happen.

Actual dataset used on Wikipedia doesn't need to be publicly available.

> **an attacker could add entries to the dataset. Im not sure how exploitable
> that is, but its something that is concerning

I don't see how could an attacker add entries to the dataset. Actual dataset used on Wikipedia would probably be tightly controlled.

> **its unclear this will actually prevent spam. Computers do not get bored.
> Even
> with 1% getting through, it would not be effective. This is using texts that

I don't see that it would be any less effective than the current captcha.
Comment 7 Nemo 2013-06-24 11:32:26 UTC
(In reply to comment #6)
> > **its unclear this will actually prevent spam. Computers do not get bored.
> > Even
> > with 1% getting through, it would not be effective. This is using texts that
> 
> I don't see that it would be any less effective than the current captcha.

Anything less than the current 25 % failure would be an improvement, though over 1 % a captcha is considered broken (according to the paper on [[mw:CAPTCHA]]).
Comment 8 Jared Zimmerman (WMF) 2013-06-27 21:34:49 UTC
This is a low priority roadmap feature, the Product and Design teams would welcome community support. 

Please contact me for design review when prototype is ready to review by UX team.
Comment 9 Alessandro Brollo 2013-06-27 22:33:01 UTC
I'm exploring a new and IMHO interesting path: to ignore djvu text layer, and toparse (both to extract naked text layer and some interesting parameters) from abbyy.xml file. This file (really heavy and discouraging at a firs glance) is published by Internet Archive into its file download area. 

The interesting thing is, that that heavy file contains both coordinates of words, and an interesting 'wordPenalty' parameter, something like a "uncertainty score" for the whole word; but there's too a character-by-character score of "certainty score". 

I'm sharing scripts  with http://www.mediawiki.org/wiki/User:Rtdwivedi, who is MUCH skilled than me, since the idea is to upload text layer from abbyy.xml file and to wrap uncertain words into a span tag, making them easy to be fized by VisualEditor. A test output of extracring scripts can be seen into any page of http://it.wikisource.org/wiki/Indice:Ricordi_di_Londra.djvu, where words with a wordPenalty > 0 are red; unluckily VisualEditor doesn't run presently in wikisource, but you can test the resulting code with VisualEditor in a wikipedia sandbox.

I presume that similar scripts, using abbyy.xml files, could extract lists of uncertain words and their images from abbyy.xml file and related scans and feed a CAPTCHA engine. 

My suggestion is, to ask Rtdwivedi for comments; personally I feel myself curious, bold and sometimes lucky, but very far from a "programmer".
Comment 10 Aarti Dwivedi 2013-07-07 06:07:00 UTC
Hi everyone,

   As Alessandro said, the words that should be chosen for CAPTCHA from the DjVu layer should be chosen on the basis of their confidence level. The confidence level of words shall be decided by the ProofreadPage extension itself. Words with high penalty would be used for CAPTCHA. I would also suggest not using the words in their complete sense, but mixing two high penalty words together. Presently, ProofreadPage extension doesn't have the facilities to do so. The spell checker( which would use the word penalty ) would be implemented after the integration with VisualEditor has been done.
Comment 11 Greg Grossmeier 2013-08-29 18:29:04 UTC
Hello, this is a quasi-automated-but-not-really message:

I am reviewing all tracking bugs for extensions to review and deploy to WMF servers. See the list here:
https://bugzilla.wikimedia.org/showdependencytree.cgi?id=31235&hide_resolved=1

The [[mw:Review queue]] page lists the steps necessary to complete the review. I have copied them below and done some initial filling out based on what I can easily gleen from this bug and any linked to sources that are obvious. If I miss something/state something false, please do correct me.

Also, if you haven't yet done so, please review the information on and linked to from:
https://www.mediawiki.org/wiki/Writing_an_extension_for_deployment


== TODO/Check list ==
Extension page on mediawiki.org: no?
Bugzilla component: no?
Extension in Gerrit: in github, please transfer to gerrit
Design Review: not yet done (see comment 8
Archeticecture/Performance Review: some
Security Review: no?
Screencast (if applicable): no
Community support: seems to be the initial beginnings of (at least some of the tech community)

Other than the obvious things above that are 'no's, what else can I/WMF help with here to move it along?
Comment 12 Jared Zimmerman (WMF) 2013-08-30 23:56:43 UTC
Is there a "working" prototype that the functionality can be testing somewhere (without setting up a development environment) that Design can evaluate.
Comment 13 Bawolff (Brian Wolff) 2013-08-31 05:35:51 UTC
(In reply to comment #12)
> Is there a "working" prototype that the functionality can be testing
> somewhere
> (without setting up a development environment) that Design can evaluate.

I'm not exactly sure why a design review would be needed at this stage. The design is probably going to look very much like what the current captcha looks like, since its mostly proposed replacing the backend, not the front end.

/me still thinks my questions in comment 5 aren't sufficiently answered. I'd like answers to the tune of "we know this will be a good idea because of X", not we think we couldn't possibly do worse than the current system, because the current system sucks so much (Which I wouldn't bet on). Heck I'd even settle for a concrete description (something that could actually be evaluated) of what folks working on this even plan to do.
Comment 14 Nemo 2013-08-31 11:05:02 UTC
I don't understand if that was clear enough, but there isn't any developer working on this project. The contributions Cristian and Alex can make are what they already did and mention: make a proof-of-concept and investigating specifications for interaction with Wikisource, DjVu and so on.
Comment 15 vladjohn2013 2013-12-01 15:46:09 UTC
Hi, this project is still listed at  https://www.mediawiki.org/wiki/Mentorship_programs/Possible_projects#Multilingual.2C_usable_and_effective_captchas 

Should this project be still listed in that page? If not, please remove it. If it still makes sense, then it could be moved to the "Featured projects" section if it has community support and mentors.
Comment 16 Aalekh Nigam 2014-02-17 18:02:33 UTC
Hello,
I have been a frequent contributor to mediawiki.....and as a part of contribution process is looking for project for upcoming Google Summer Of Code 2014 ........as mentoined in bug 32695 and prototype developed by Pginer
my idea for the project is:
             
                              ""Develop a captcha service with wikimedia commons images this captcha service 
                                 will comprise of some custom question with random keyword 1 and 2 .......upon selection of 
                                 random keyword 1, a question will be generated along with images fetched from commons
                                 database,also there will be few images from another keyword 2 which will show some
                                  images which will be not related to the question we can also take help of image
                                 annotations as mentioned here (https://commons.wikimedia.org/wiki/File:Vitraux_de_la_basilique_Notre-Dame,_Genève_23.jpg).""

I therefore request you all to place comment into my idea regarding the project as i am really interested to work for this challenging but amazing project :) .

Eagerly Waiting for your reply.
Comment 17 Aalekh Nigam 2014-02-17 18:03:53 UTC
Hello,
I have been a frequent contributor to mediawiki.....and as a part of contribution process is looking for project for upcoming Google Summer Of Code 2014 ........as mentoined in bug 32695 and prototype developed by Pginer
my idea for the project is:
             
                              ""Develop a captcha service with wikimedia commons images this captcha service 
                                 will comprise of some custom question with random keyword 1 and 2 .......upon selection of 
                                 random keyword 1, a question will be generated along with images fetched from commons
                                 database,also there will be few images from another keyword 2 which will show some
                                  images which will be not related to the question we can also take help of image
                                 annotations as mentioned here (https://commons.wikimedia.org/wiki/File:Vitraux_de_la_basilique_Notre-Dame,_Genève_23.jpg).""

I therefore request you all to place comment into my idea regarding the project as i am really interested to work for this challenging but amazing project :) .

Eagerly Waiting for your reply.
Comment 18 Andre Klapper 2014-02-17 18:57:31 UTC
(In reply to Aalekh Nigam from comment #17)
> I therefore request you all to place comment into my idea regarding the
> project as i am really interested to work for this challenging but amazing
> project :) .

This should probably go to a wikipage where you explain your idea and where people could comment. Bugzilla might not be the best place for a lenghty discussion. Feel free to paste a link here as a comment.
Comment 19 Nemo 2014-02-17 19:26:41 UTC
Also, Aalekh, this bug is about Wikisource (scanned books) images, a CAPTCHA from Commons images would need a separate bugzilla report.
Comment 20 Aalekh Nigam 2014-02-17 19:40:59 UTC
Actually this was a simple idea for way to handle the project......since commons is a part of wiki....so my idea is that it might just act as an database for various captcha options as mentoined by pginer in http://pauginer.tumblr.com/post/33445896205/captcha-ideas
Comment 21 Quim Gil 2014-03-13 14:41:08 UTC
Aalekh, your proposal is still missing in Google Melange. Please submit it there as a draft linking to your wiki page. In any case, we will evaluate your proposal in mediawiki.org. Thank you!
Comment 22 Aalekh Nigam 2014-03-20 15:10:18 UTC
Over a period of few months there has been active Development of Multilingual, usable and effective captchas for GSOC 2014.But currently it seems that there is no technical and primary mentor for the project. Therefore I Request all members to please have a thought about becoming a part of this project as primary technical mentor.
Comment 23 Quim Gil 2014-03-22 18:13:44 UTC
Let's move the GSoC 2014 discussion to 

Bug 62960 - Prototype CAPTCHA optimized for multilingual and mobile

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links