Spam statistics are live!

I've blogged about spam a few times before, and as you might have guessed defending against it and analysing the statistics thereof is a bit of a hobby of mine. Since I first installed the comment key system (and then later upgraded) in 2015, I've been keeping a log of all the attempts to post spam comments on my blog. Currently it amounts to ~27K spam attempts, which is about ~14 comments per day overall(!) - so far too many to sort out manually!

This tracking system is based on mistakes. I have a number of defences in place, and each time that defence is tripped it logs it. For example, here are some of the mistake codes for some of my defences:

Code	Meaning
website	A web address was entered (you'll notice you can't see a website address field in the comment form below - it's hidden to regular users)
shortcomment	The comemnt was too short
invalidkey	The comment key)
http10notsupported	The request was made over HTTP 1.0 instead of HTTP 1.1+
invalidemail	The email address entered was invalid

These are the 5 leading causes of comment posting failures over the past month or so. Until recently, the system would only log the first defence that was tripped - leaving other defences that might have been tripped untouched. This saves on computational resources, but doesn't help the statistics I've been steadily gathering.

With the new system I implemented on the 12th June 2020, a comment is checked against all current defences - even if one of them has been tripped already, leading to some interesting new statistics. I've also implemented a quick little statistics calculation script, which is set to run every day via cron. The output thereof is public too, so you can view it here:

Failed comment statistics

Some particularly interesting things to note are the differences in the mistake histograms. There are 2 sets thereof: 1 pair that tracks all the data, and another that only tracks the data that was recorded after 12th June 2020 (when I implemented the new mistake recording system).

From this, we can see that if we look at only the first mistake made, invalidkey catches more spammers out by a landslide. However, if we look at all the mistakes made, the website check wins out instead - this is because the invalidkey check happens before the website check, so it was skewing the results because the invalidkey defence is the first line of defence.

Also interesting is how comment spam numbers have grown over time in the spam-by-month histogram. Although it's a bit early to tell by that graph, there's a very clear peak around May / June, which I suspect are malicious actors attempting to gain an advantage from people who may not be able to moderate their content as closely due to recent happenings in the world.

I also notice that the overall amount of spam I receive has an upwards trend. I suspect this is due to more people knowing about my website since it's been around for longer.

Finally, I notice that in the average number of mistakes (after 2020-06-12) histogram, most spammers make at least 2 mistakes. Unfortunately there's also a significant percentage of spammers who make only a single mistake, so I can't yet relax the rules such that you need to make 2 or more mistakes to be considered a spammer.

Incidentally, it would appear that the most common pair of mistakes to make are shortcomment and website - perhaps this is an artefact of some specific scraping / spamming software? If I knew more in this area I suspect that it might be possible to identify the spammer given the mistakes they've made and perhaps their user agent.

This is, of course, a very rudimentary analysis of the data at hand. If you're interested, get in touch and I'm happy to consider sharing my dataset with you.

Comments

(no comments yet)

Type this	To get this	Notes
`bold text`	bold text	-
`_italics text_`	italics text	-
`~~deleted text~~`	~~deleted~~	-
`code text`	`code text`	Inserts some monospaced code. It is preferred that large blocks of code are linked to using a service such as Pastebin, Github Gists or Ideone.
`> Quote`	Quote	-
`[display text](//google.com)`	display text	Inserts a hyperlink. Please use responsibly. `[rel=nofollow]` is in use and spam will be deleted.
`---`		Inserts a horizontal line. The previous line must be blank.

Stardust
Blog