Starbeamrainbowlabs

Stardust
Blog


Archive


Mailing List Articles Atom Feed Comments Atom Feed Twitter Reddit Facebook

Tag Cloud

3d 3d printing account algorithms android announcement architecture archives arduino artificial intelligence artix assembly async audio automation backups bash batch blender blog bookmarklet booting bug hunting c sharp c++ challenge chrome os cluster code codepen coding conundrums coding conundrums evolved command line compilers compiling compression conference conferences containerisation css dailyprogrammer data analysis debugging defining ai demystification distributed computing dns docker documentation downtime electronics email embedded systems encryption es6 features ethics event experiment external first impressions freeside future game github github gist gitlab graphics guide hardware hardware meetup holiday holidays html html5 html5 canvas infrastructure interfaces internet interoperability io.js jabber jam javascript js bin labs latex learning library linux lora low level lua maintenance manjaro minetest network networking nibriboard node.js open source operating systems optimisation outreach own your code pepperminty wiki performance phd photos php pixelbot portable privacy problem solving programming problems project projects prolog protocol protocols pseudo 3d python reddit redis reference release releases rendering research resource review rust searching secrets security series list server software sorting source code control statistics storage svg systemquery talks technical terminal textures thoughts three thing game three.js tool tutorial twitter ubuntu university update updates upgrade version control virtual reality virtualisation visual web website windows windows 10 worldeditadditions xmpp xslt

Spam statistics are live!

I've blogged about spam a few times before, and as you might have guessed defending against it and analysing the statistics thereof is a bit of a hobby of mine. Since I first installed the comment key system (and then later upgraded) in 2015, I've been keeping a log of all the attempts to post spam comments on my blog. Currently it amounts to ~27K spam attempts, which is about ~14 comments per day overall(!) - so far too many to sort out manually!

This tracking system is based on mistakes. I have a number of defences in place, and each time that defence is tripped it logs it. For example, here are some of the mistake codes for some of my defences:

Code Meaning
website A web address was entered (you'll notice you can't see a website address field in the comment form below - it's hidden to regular users)
shortcomment The comemnt was too short
invalidkey The comment key)
http10notsupported The request was made over HTTP 1.0 instead of HTTP 1.1+
invalidemail The email address entered was invalid

These are the 5 leading causes of comment posting failures over the past month or so. Until recently, the system would only log the first defence that was tripped - leaving other defences that might have been tripped untouched. This saves on computational resources, but doesn't help the statistics I've been steadily gathering.

With the new system I implemented on the 12th June 2020, a comment is checked against all current defences - even if one of them has been tripped already, leading to some interesting new statistics. I've also implemented a quick little statistics calculation script, which is set to run every day via cron. The output thereof is public too, so you can view it here:

Failed comment statistics

Some particularly interesting things to note are the differences in the mistake histograms. There are 2 sets thereof: 1 pair that tracks all the data, and another that only tracks the data that was recorded after 12th June 2020 (when I implemented the new mistake recording system).

From this, we can see that if we look at only the first mistake made, invalidkey catches more spammers out by a landslide. However, if we look at all the mistakes made, the website check wins out instead - this is because the invalidkey check happens before the website check, so it was skewing the results because the invalidkey defence is the first line of defence.

Also interesting is how comment spam numbers have grown over time in the spam-by-month histogram. Although it's a bit early to tell by that graph, there's a very clear peak around May / June, which I suspect are malicious actors attempting to gain an advantage from people who may not be able to moderate their content as closely due to recent happenings in the world.

I also notice that the overall amount of spam I receive has an upwards trend. I suspect this is due to more people knowing about my website since it's been around for longer.

Finally, I notice that in the average number of mistakes (after 2020-06-12) histogram, most spammers make at least 2 mistakes. Unfortunately there's also a significant percentage of spammers who make only a single mistake, so I can't yet relax the rules such that you need to make 2 or more mistakes to be considered a spammer.

Incidentally, it would appear that the most common pair of mistakes to make are shortcomment and website - perhaps this is an artefact of some specific scraping / spamming software? If I knew more in this area I suspect that it might be possible to identify the spammer given the mistakes they've made and perhaps their user agent.

This is, of course, a very rudimentary analysis of the data at hand. If you're interested, get in touch and I'm happy to consider sharing my dataset with you.

How to run a successful blog: My experiences so far

About 2 and a half years (and 207 posts!) ago today, I took the advice of Rob Miles and started this blog. I've learned a lot since I started from some wonderful people, and the quality of my posts on here has improved immensely. I've been meaning to write a post on what I've learnt about running this blog, but haven't gotten around to it until now :-)

The first, and most important, thing you can do is to post often. By posting often you keep people coming back to read more. If you leave it too long between posts, people may forget about your blog and may not come back.

Of course, while having regular content is good, having a posting schedule is considered by some to be even better. While I'm not particularly good at this yet, if you do manage to achieve it you can ensure that people know when they can come back and expect another post.

It's also important that you post about something you find interesting. Doing so improves the quality of the post, and it makes for a much more interesting read. If you're passionate about what you're writing about, people will generally come back for more.

Next, blogging is something that you have to persevere with. You certainly won't acquire readers overnight. It's worth bearing in mind that most of your readers will be silent - only a very small proportion of readers will actually leave a comment, no matter how easy you make the process (I hardly ever get comments).

Also, don't set out with the goal of getting $x$ number of regular readers, or earning $y$ amount of money. They are bad motivators to have, and you will surely be disappointed. Rather, you should set out to just write about something you love, without expecting anyone to read what you've written. Then you'll be pleasantly surprised :D

That's all I can think of right now. There are surely other things that I haven't mentioned. If you have any tips you've learnt from running a blog of your own, post about them in the comments :D

Art by Mythdael