CAPTCHAs, those pesky challenge-response tests that many web sites use to determine whether you are human or a spambot, are an annoyance to many users. According to a report in Science (subscription required), users now solve about 100 million CAPTCHAs a day. ReCAPTCHA, a project based at Carnegie Mellon University, has found an ingenious way to harness all this work and, according to the findings published in Science this week, CAPTCHAs could be used to transcribe printed texts at the rate of 160 books a day.
The current implementation of reCAPTCHA is being used by over 40,000 web sites. The basic idea behind reCAPTCHA is that optical character recognition (OCR), even though it is constantly improving, is still unable to cope with texts where the print has faded or a page is slightly damaged. While humans can transcribe a text with about 99% accuracy, OCR software often doesn't get beyond 80% when dealing with a slightly damaged text.

reCAPTCHA combines traditional OCR with an approach similar to Amazon's Mechanical Turk. Every text is analyzed by two different OCR programs and whenever those two program disagree on a word, it is marked as 'suspicious.' Those suspicious words are then fed into reCAPTCHA, which creates a CAPTCHA with both the suspicious word and a known control word. Once a certain number of users have solved the suspicious word with the same result, it becomes a control word itself.
Overall, reCAPTCHA achieves an accuracy of 99.1%, which is on par with the accuracy achieved by having two humans type the text and then verify the results.
While it is mostly a proof of concept right now, reCAPTCHA's developers calculate that the system can be used to transcribe the equivalent of 160 books a day.
The most fascinating aspect of this idea is that it turns mental energy, which would otherwise be wasted, into something useful. Other projects like fold.it, which turns protein folding into a game, or Google's Image Labeler take a similar approach, but the user has to actively decide to play a game. reCAPTCHA, on the other hand, turns a chore into a useful project.
Comments
Subscribe to comments for this post OR Subscribe to comments for all Read/WriteWeb posts
You need to check the site's archives before writing a new post Frederic. This service has been covered 3 or 4 times already.
Posted by: Dan Grossman | September 24, 2008 12:25 PM
Dan - we did indeed cover it when it first launched - however, this is the first time we see actual results and data from the project.
Posted by: Frederic Lardinois | September 24, 2008 12:33 PM
Although I am not the biggest fan of Catpchas (they significantly lower completion rates), I think it's great that we are helping transcribe books in the process by using reCaptcha. What a fantastic and original idea. You go!
Posted by: Sergio | September 24, 2008 12:37 PM
Excellent coverage Frederic. I would like to see more projects turn wasted mental energy into something useful.
Posted by: Brady Brim-DeForest | September 24, 2008 12:37 PM
Sometimes bloggers get accidentally penalized for sspamming when they don't
Posted by: "www.ShawnDrewry.com" | September 24, 2008 12:57 PM
Wisdom of the crowds? This is a very intelligent approach. Hurrah.
Posted by: Allan | September 24, 2008 2:39 PM
unfortunately, by using reCaptcha, you're turning your users into workhorses and asking them to decipher a potentially indecipherable word while they're most likely trying to fill out a form and register for a site.
This system seems like a noble idea, but discourages many users from completing the registration process.
Don't even get me started on their audio option...
Posted by: mc | September 24, 2008 3:09 PM
I am current using recaptcha, but I am looking alternative solution.
Some words are just very hard to read. Constantly, I need to enter 3 or 4 times in order to post.
Posted by: Alex | September 24, 2008 3:10 PM
It's annoying that the Science article can't be read without a subscription :/
But this is a great use for the human processing power that is put into solving CAPTCHAs.
Like Mechanical Turk's goal, if certain problems can be efficiently and accurately solved by humans, it can significantly contribute to important efforts like Archive.org's transcription program.
Still it's important to use CAPTCHAs carefully, like making only anonymous users fill one out, or just requiring one at the time of registration. For preventing spam, are there any other methods that work as well?
Posted by: mike | September 24, 2008 11:36 PM