Wednesday, February 4, 2009

CAPTCHA and proposed alternative, SAPTCHA

Introduction

(skip to next section if you are familiar with concept of CAPTCHA)

CAPTCHA stands for Completely Automated Public Turing Test to Tell Computers and Humans Apart. [Wikipedia / Captcha]

Simply put, CAPTCHA is a set of methods commonly used to block automated account registration and similar massive abuse by making it costlier to spammer. Most common type of CAPTCHA is visual CAPTCHAs that test for image recognition. Though, at current moment, computers (using good software) is no worse than humans at single character image recognition (source) (fortunately spammers don't bother to use such software yet.)

Most likely you have already been tested by CAPTCHAs - that's those images od distorted and obstructed letters that you must enter into text field to complete registration of email account or to post reply to blog.

Verbal CAPTCHA would not discriminate against vision impaired, but computer will only be able to generate very limited subset of questions and thus it would be relatively simple to defeat such CAPTCHA. Audio CAPTCHAs is uncommon because one would still need visual one and it would double effort.



CAPTCHAs has numerous problems(see wikipedia article linked above for good overview); there is existing methods of character recognition, plus it is often possible to defeat captcha knowing the algorithm it uses.

Intuitively, while computers is not smart enough to pass true Turing test, computers may be smart enough to fool other computers.

In some CAPTCHAs, the image is obscured in a way that makes it harder to read for human, but will have no effect on computer - for example, computer won't have any problem at all filtering out colored background, but it can confuse human(especially colorblind).

Often, human don't know how many letters should be there, and random lines may look like yet another distorted letter, confusing human but not computer that knows how many letters should be there. Some letters in common fonts differ too little to be reliably recognized by human when distorted (such as 0,O ; I,l,i,!,j ; vv,w and so on). Humans recognize heavily distorted letters in handwriting based on the context, but letters in CAPTCHAs lack context.Last but not least, such methods unnecessarily discriminate against disabled who can not see the image.

SAPTCHA.

SAPTCHA stands for Semi Automatic Public Turing Test to Tell Computers and Humans Apart.

The key concept is same as with CAPTCHA: user is presented with test question or instructions and must give correct answer to use resource. Main difference is that computer does not try to automatically generate "unique" test questions on each query; only verification of answer is automatic. Instead, unique test question and answer[s] is set by moderator or owner when SAPTCHA is installed, and should be easy to change if needed.

SAPTCHA is proposed as more accessible alternative to CAPTCHA that may replace CAPTCHA in services such as most blogs and forums. SAPTCHA works as lightweight CAPTCHA.

The concept follows from observation that there is many cases where automated generation of unique test question or image does not add much to prevention of abuse - spammer do not need to pass test more than once on same forum or blog anyway. Often, there's no human spammer interacting with website at all [who wouldn't love to think that his site is so important that it is spammed personally :-)]; in such cases static question is not worse at stopping bot than dynamic. Human generated questions has much broader diversity and is thus harder for computer to answer. It must be also noted that CAPTCHA itself is not really "completely automatic" - human has to write and maintain test software, which will not change often but is costly to develop.

Example questions: User is given instruction like "write [no i'm not a computer!] in this text field" or "write 'i'm human' in reverse" or "write[or copy-paste] web address of this page there" (please don't use too similar things. No default questions and answers. Think up something yourself. Don't try to be clever. It should be not more complex to understand and do than rest of registration instruction and resource usage, and thus shouldn't decrease website's accessability(!). It's better if answer is more than 1 character long or if there is delay or block for bots that "try again".)

Bots can try to understand text written by human in normal language (very hard problem in AI) or try to guess (some delay can make it pointless) or try some common test answers if any (but the common test questions and answers will quickly disappear)

Spammer have to manually answer the question to start spamming. This is exactly same problem as with CAPTCHA at registration. Similarly to CAPTCHA at registration, human invervention is necessary to stop spam. - account must be banned and for SAPCHA question must be changed(if bot can reuse answer automatically).

In a way, SAPTCHA can be viewed as light weight disposable CAPTCHA test that is cheap to replace when it get compromised.

Comparison

Sample use scenarios

SAPTCHA

s.0) Normal user comes accross your blog. If he can answer question, he can post reply, unless you made bad question/instructions. If user can't read your question, probably he can't read your blog either, so the SAPTCHA shouldn't make it less accessible.

s.1) Spammer bot comes accross your blog. No spamming happens. Bots can't understand human language yet.

s.2) Spammer human comes accross your blog/forum, answer question, register account, and possibly add answer and account to spambot database or proceeds to spam manually. You are spammed. It will take a moderator to ban spammer and stop spam; the banning form may also ask moderator for new question and new answer that needs to be provided if spamming was done by bot that "knows" answer to question.

CAPTCHA

s.0) Normal user comes accross your blog/forum. If he can see, and CAPTCHA is simple he can post reply with small hastle if he doesn't have to pass CAPTCHA every time he replies. If CAPTCHA is "unbreakable" or uses bad colors, he will need few tries and is going to get annoyed, especially so if he need to pass it for every reply. If he is blind or otherwise can't see it, no way.

s.1) Spammer bot comes accross your blog. You might get spammed if bot can recognize image (it is possible if you are using popular CAPTCHA), but most likely you won't.

s.2) Spammer human comes accross your blog/forum. He can answer question, register account, add it to spambot database. You are spammed. It will take moderator to ban the bot, and delete spam[assuming that spam filters alone don't suffice without CAPTCHA]; so you still need human intervention from your side. As have been said before, if you'd ask to pass CAPTCHA for every message it'd be too annoying for normal users as well.

Comparison of SAPTCHA versus CAPTCHA features

Advantages of SAPTCHA over CAPTCHA:

  1. SAPTCHA software is much easier to implement than CAPTCHA
  2. Textual SAPTCHA does not discriminate against disabled who can use internet. [Audio CAPTCHA plus visual CAPTCHA would double effort and is thus very uncommon in practice]
  3. There is methods for breaking image based CAPTCHAs. If you use popular CAPTCHA, you may still get spammed by entirely automatic bot. SAPTCHAs can be much more varied and there won't be common method of breaking until it becomes possible for computers to interpret human instructions in normal human language.
Advantages of CAPTCHA over SAPTCHA (disadvantages of SAPTCHA):

  1. With SAPTCHA, when banning spammer, moderator must enter new question and answer. With CAPTCHA, though, there's point 1 above (& CAPTCHA code won't remain useful forever either), so for not extremely popular websites it seems highly unlikely that even in long run CAPTCHA would save work.
  2. If SAPTCHA is used to protect registration, it is easier to register many accounts at once than with CAPTCHA; may matter with popular email services.
  3. Verbal SAPTCHA is problematic when it is multi-language resource that needs frequent changes.
  4. When it is something like photo gallery, visual CAPTCHA is allright as it doesn't contribute to inaccessability.

Conclusion:

SAPTCHA can be viable alternative to CAPTCHA for web resources like forums and blogs and in other situations when spammer can not afford to target resources individually. With textual resources, SAPTCHA does not lessen accessability of resource.

It is suggested that forum and blogging software should offer support for SAPTCHA in addition to existing support for CAPTCHA, thus allowing administrator to use SAPTCHA and switch to CAPTCHA only when and if SAPTCHA is found to be really inadequate in this situation (which is expected to happen only on very popular web resources). By the method of operation, SAPTCHA can give only limited protection against account registration abuses when abuser is willing to solve SAPTCHA and consequently run bot that register really many accounts (e.g. for use of email as storage), which would be prevented by CAPTCHA on every registration.

Live example of question

John had one thousand apples and five oranges. He ate as many of his apples as there is letters in word "apple". Also he ate two bananas :-). How many appl es John have?

Your answer:


If you are annoyed by CAPTCHA, think about alternatives and discuss concept of SAPTCHA with others. Make the best meme win.

Source: http://dmytry.pandromeda.com/