Comment spam: wake up and smell the Hashcash coffee
22 May 2007 - 18:29
This blog (and other's on exaflop.org) use weblog software called Pivot. A number of versions back, Pivot implemented a system called hashcash to defeat the comment spambots that are the scourge of the bloggosphere.
Lets compare hashcash to the alternatives:-
- Keyword blocking: this as ineffective in blogging as it was in the email world before it. It'll catch many spams, but it will also let many through. The blog owner has to keep looking through the spams and adding more and more keywords to the block list. Eventually you have to stop adding keywords or nobody will be able to add anything to the blog!
- IP blocking: this is ineffective as well because spamming is typically performed by zombie botnets (arrays of PCs that are infected with malware that follow the instructions of remote users while appearing to their owners to work normally). The spams appear from all manner of different IP addresses and besides, you still need to delete all the spams by hand with this method.
- CAPTCHAs - Completely Automated Public Turing Test to tell Computers and Humans Apart. These show the commenter an image that contains a string of letters and numbers. The user has to type this string input a field on the comment form along with their other details. There are two problems with CAPTCHAs. The first is that they are a usability nightmare - nobody likes having to pass a test like this every time they submit a form and for people with sight problems it can be impossible to pass. The second problem is that OCR (optical character recognition) techniques can used to defeat the test. As a result of 2, CAPTCHAs have been made progressively more and more difficult to pass, exacerbating problem 1.
- Bayesian filtering: This is the same as the most popular method of email filtering. A mathematical analysis is made to try and recognise if the comment looks like a spam in the same sort of way that we would recognise many spams as spam without even reading them - we recognise many different signs such as strings of garbage characters, implausible names and other features. This works about as well as it does with email spam, i.e. it produces some false positives (real comments tagged as spam) and false negatives (spams that get through).
As you can see from the above list, two of the four main alternative approaches are completely useless and the other two have serious problems of usability and effectiveness. Hashcash by comparison has been completely effective. That needs more emphasis really. In a year I have had less than 10 spam comments appear on my blog. That is few enough for me to believe that the spams that did get through were entered manually. That is something I can live with.
Akismet is a Bayesian filter system used by many bloggers including those on Wordpress. It uses a 'hive mind' approach that combines spam data from many users to improve effectiveness. Even so, lately I am seeing several bloggers (most notably Robert Scoble) complaining about Akismet either not filtering out all the spam, or catching too many genuine comments in it's filter. Apparently there is (or at least has been in the past) a Hashcash plugin for Wordpress. I would strongly suggest people check out this option. Akismet is a nice idea, but it is clearly not as effective as it should be. Hashcash is effective. I don't doubt that Scoble gets more attempts on his blog than I do on mine, but the results should scale because 100% effective scales.
The final issue then is that spammers will someday build spambots that can defeat Hashcash. This is a completely bogus reason not to use Hashcash on your blog now. It is possible that one day hashcash will not be enoughto stop spambots. But at the moment the picture is far better for those using hashcash than it is for those relying on CAPTCHAs and Bayesian filtering. Make hay while the sun shines I say!