In the last two weeks the spam has run rampant in the comments of the blog, even though I have not linked it from anywhere, Google managed to find it and probably is a gateway to the bots. I decided not to include (again) a reCaptcha in the comments submission form because it is quite annoying, so it was time to set up a spam filter.
I came up with the idea of using a statistical, content-based approach on this, and very little googling led me to find Paul Graham's A Plan For Spam, which made me decide for this method.
I have coded a simple Bayesian classifier which takes into account which words appear in spam comments and which don't. So far it has been effective in filtering spam (no false negatives), but since I have no readers yet (this web is taking too long to publish) I cannot speak for false positives.
Anyway, I am sure that as false positives arise (if they do) and I flag them as not spam, the classifier will train correctly on recognizing spam and not spam.