Skip navigation

Chris Wilson over at Slate posted an article calling for a move away from CAPTCHA tests and towards algorithms that observe interaction with a webpage to verify your humanity.

The problem, of course, is that observing the user’s interactions requires Javascript. Javascript requires a browser, and spammers aren’t using browsers to fill out forms on the web, are they? Sure, if interaction-based verification systems were varied enough, it would easily hinder the average spammer’s ability to spam, but it would be quite trivial to overcome this.

No matter how you arrange it, in the end the spammer not only can read the Javascript code to see what’s going on, but he can mimick it’s responses to the server.

Let’s say this hypothetical spambot detector told the server it’s result in a hidden HTML field. Easy, spammer fills the field with the expected value. OK, let’s say the detector sent an AJAX call to the server once it was sure this wasn’t a spammer, but oh yeah the spammer can send AJAX calls too.

The reason CAPTCHAs are effective is because the server withholds a bit of information that the user must figure out from the image. With any sort of behavior-based (and thus user-side script-based) CAPTCHA, there really isn’t any information you can hold back. Worse yet, how would the spambot detector vary the situation for each unique page access? It couldn’t.

There’s a reason why the brightest minds in the CS field have been only tweaking the current model, because it’s the most effective (and possibly only) way to stop spam. Even if it’s only temporary. So I don’t think we’ll be able to get rid of “human checks” of some form or another any time soon.

That being said, the field has been shifting substantially toward knowledge based CAPTCHAs like the ones mentioned earlier in the article. Case in point is the forums for the open source game Cube, where you are asked a question like “what color is the sky?” Without being prepared for this particular question, a computer would not know the answer was “blue” unless it was capable of understanding the concept of a sky, as well as being familiar with the sky itself. With a sufficient number of questions chosen in a uniformly random way, this is a good way to stop spam. Also, the efficiency of this model continues to increase linearly with the amount of questions your system can ask: if you have 10,000 questions (and thus 10,000 correct answers), the spammer will need to prepare his bot to answer a great deal of these questions.

Furthermore, unlike letter-based CAPTCHAs, you can assume that a user is a bot with much less evidence. Honestly, assuming the user can read the language of the questions and that they are truly common knowledge questions, a user shouldn’t need to request a new question more than twice, right? So why not limit it to two. This would mean that out of those 10,000 questions in your system, the spammer would have to prepare 5,000 responses to 5,000 known questions to have a reasonable chance at getting through.

Naturally, this model depends on the secrecy of the questions. Any open source CAPTCHA system could be easily cracked since the spammer could just grep through the source to pull the questions and corresponding answers out of the code. This really is the downfall of all set-based (or non-random) CAPTCHAs.

The article also mentions something I hadn’t heard about before: using a hidden field to trick spambots into filling it, considering no human could fill a box they couldn’t actually get to. The author dismisses this too quickly. It would be trivial to come up with a system where the server randomizes the name and order of the hidden field and the genuine message field. The server would not indicate inside the web page which was the correct one, but would instead keep that information in session storage for each user. The main CSS file for the page would be dynamic, and would provide a rule matching the field indicated in the session information, such that the hidden style is only applied to the field which is supposed to be hidden per any given request.

This can be circumvented, but not without quite a bit of work. The spambot would have to parse and apply the CSS to the input fields to determine which ones are hidden. Go ahead and add some more layers of obfuscation, such as providing that rule in one of 3-4 different CSS files, then make the system capable of doing this with _any_ CSS file, and perhaps have multiple hidden fields and you’ve got a pretty tough system.

I think the *real* key to reducing illicit CAPTCHA solutions is to make the process as varied as possible among sites. Just as operating system monoculture promotes the spread of computer virii, CAPTCHA monoculture promotes the spread of illicit solutions.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: