Stardust | Starbeamrainbowlabs

Spam statistics are live!

I've blogged about spam a few times before, and as you might have guessed defending against it and analysing the statistics thereof is a bit of a hobby of mine. Since I first installed the comment key system (and then later upgraded) in 2015, I've been keeping a log of all the attempts to post spam comments on my blog. Currently it amounts to ~27K spam attempts, which is about ~14 comments per day overall(!) - so far too many to sort out manually!

This tracking system is based on mistakes. I have a number of defences in place, and each time that defence is tripped it logs it. For example, here are some of the mistake codes for some of my defences:

Code	Meaning
website	A web address was entered (you'll notice you can't see a website address field in the comment form below - it's hidden to regular users)
shortcomment	The comemnt was too short
invalidkey	The comment key)
http10notsupported	The request was made over HTTP 1.0 instead of HTTP 1.1+
invalidemail	The email address entered was invalid

These are the 5 leading causes of comment posting failures over the past month or so. Until recently, the system would only log the first defence that was tripped - leaving other defences that might have been tripped untouched. This saves on computational resources, but doesn't help the statistics I've been steadily gathering.

With the new system I implemented on the 12th June 2020, a comment is checked against all current defences - even if one of them has been tripped already, leading to some interesting new statistics. I've also implemented a quick little statistics calculation script, which is set to run every day via cron. The output thereof is public too, so you can view it here:

Failed comment statistics

Some particularly interesting things to note are the differences in the mistake histograms. There are 2 sets thereof: 1 pair that tracks all the data, and another that only tracks the data that was recorded after 12th June 2020 (when I implemented the new mistake recording system).

From this, we can see that if we look at only the first mistake made, invalidkey catches more spammers out by a landslide. However, if we look at all the mistakes made, the website check wins out instead - this is because the invalidkey check happens before the website check, so it was skewing the results because the invalidkey defence is the first line of defence.

Also interesting is how comment spam numbers have grown over time in the spam-by-month histogram. Although it's a bit early to tell by that graph, there's a very clear peak around May / June, which I suspect are malicious actors attempting to gain an advantage from people who may not be able to moderate their content as closely due to recent happenings in the world.

I also notice that the overall amount of spam I receive has an upwards trend. I suspect this is due to more people knowing about my website since it's been around for longer.

Finally, I notice that in the average number of mistakes (after 2020-06-12) histogram, most spammers make at least 2 mistakes. Unfortunately there's also a significant percentage of spammers who make only a single mistake, so I can't yet relax the rules such that you need to make 2 or more mistakes to be considered a spammer.

Incidentally, it would appear that the most common pair of mistakes to make are shortcomment and website - perhaps this is an artefact of some specific scraping / spamming software? If I knew more in this area I suspect that it might be possible to identify the spammer given the mistakes they've made and perhaps their user agent.

This is, of course, a very rudimentary analysis of the data at hand. If you're interested, get in touch and I'm happy to consider sharing my dataset with you.

Where in the world does spam come from?

Answer: The US, apparently. I was having a discussion with someone recently, and since I have a rather extensive log of comment failures for debugging & analysis purposes (dating back to February 2015!) they suggested that I render a map of where the spam is coming from.

It was such a good idea that I ended up doing just that - and somehow also writing this blog post :P

First, let's start off by looking at the format of said log file:

[ Sun, 22 Feb 2015 07:37:03 +0000] invalid comment | ip: a.b.c.d | name: Nancyyeq | articlepath: posts/015-Rust-First-Impressions.html | mistake: longcomment
[ Sun, 22 Feb 2015 14:55:50 +0000] invalid comment | ip: e.f.g.h | name: Simonxlsw | articlepath: posts/015-Rust-First-Impressions.html | mistake: invalidkey
[ Sun, 22 Feb 2015 14:59:59 +0000] invalid comment | ip: x.y.z.w | name: Simontuxc | articlepath: posts/015-Rust-First-Impressions.html | mistake: invalidkey

Unfortunately, I didn't think about parsing it programmatically when I designed the log file format.... Oops! It's too late to change it now, I suppose :P

Anyway, as an output, we want a list of countries in 1 column, and a count of the number of IP addresses in another. First things first - we need to extract those IP addresses. awk is ideal for this. I cooked this up just quickly:

BEGIN {
    FS="|"
}

{
    gsub(" ip: ", "", $2);
    print $2;
}

This basically tells awk to split lines on the solid bar character (|), extracts the IP address bit (ip: p.q.r.s), and then strips out the ip: bit.

With this done, we're ready to lookup all these IP addresses to find out which country they're from. Unfortunately, IP addresses can change hands semi-regularly - even across country borders, so my approach here isn't going to be entirely accurate. I don't anticipate the error generated here to be all that big though, so I think it's ok to just do a simple lookup.

If I was worried about it, I could probably investigate cross-referencing the IP addresses with a GeoIP database from the date & time I recorded them. The effort here would be quite considerable - and this is a 'just curious' sort of thing, so I'm not going to do that here. If you have done this, I'd love to hear about it though - post a comment below.

Actually doing a GeoIP lookup itself is fairly easy to do, actually. While for the odd IP address here and there I usually use ipinfo.io, when there are lots of lookups to be done (10,479 to be exact! Wow.), it's probably best to utilise a local database. A quick bit of research reveals that Ubuntu Server has a package I can install that should do the job called geoip-bin:


sudo apt install geoip-bin
(....)
geoiplookup 1.1.1.1 # CloudFlare's 1.1.1.1 DNS service
GeoIP Country Edition: AU, Australia

Excellent! We can now lookup IP addresses automagically via the command line. Let's plug that in to the little command chain we got going on here:

cat failedcomments.log | awk 'BEGIN { FS="|" } { gsub(" ip: ", "", $2); print $2 }' | xargs -n1 geoiplookup

It doesn't look like geoiplookup supports multiple IP addresses at once, which is a shame. In that case, the above will take a while to execute for 10K IP addresses.... :P

Next up, we need to remove the annoying label there. That's easy with sed:

(...) | sed -E 's/^[A-Za-z: ]+, //g'

I had some trouble here getting sed to accept a regular expression. At some point I'll have to read the manual pages more closely and write myself a quick reference guide. Come to think about it, I could use such a thing for awk too - their existing reference guide appears to have been written by a bunch of mathematicians who like using single-letter variable names everywhere.

Anyway, now that we've got our IP address list, we need to strip out any errors, and then count them all up. The first point is somewhat awkward, since geoiplookup doesn't send errors to the standard error for some reason, but we can cheese it with grep -v:

(...) | grep -iv 'resolve hostname'

The -v here tells grep to instead remove any lines that match the specified string, instead of showing us only the matching lines. This appeared to work at first glance - I simply copied a part of the error message I saw and worked with that. If I have issues later, I can always look at writing a more sophisticated regular expression with the -P option.

The counting bit can be achieved in bash with a combination of the sort and uniq commands. sort will, umm, sort the input lines, and uniq with de-duplicate multiple consecutive input lines, whilst optionaly counting them. With this in mind, I wound up with the following:

(...) | sort | uniq -c | sort -n

The first sort call sorts the input to ensure that all identical lines are next to each other, reading for uniq.

uniq -c does the de-duplication, but also inserts a count of the number of duplicates for us.

Lastly, the final sort call with the -n argument sorts the completed list via a natural sort, which means (in our case) that it handles the numbers as you'd expect it too. I'd recommend you read the Wikipedia article on the subject - it explains it quite well. This should give us an output like this:

      1 Antigua and Barbuda
      1 Bahrain
      1 Bouvet Island
      1 Egypt
      1 Europe
      1 Guatemala
      1 Ireland
      1 Macedonia
      1 Mongolia
      1 Saudi Arabia
      1 Tuvalu
      2 Bolivia
      2 Croatia
      2 Luxembourg
      2 Paraguay
      3 Kenya
      3 Macau
      4 El Salvador
      4 Hungary
      4 Lebanon
      4 Maldives
      4 Nepal
      4 Nigeria
      4 Palestinian Territory
      4 Philippines
      4 Portugal
      4 Puerto Rico
      4 Saint Martin
      4 Virgin Islands, British
      4 Zambia
      5 Dominican Republic
      5 Georgia
      5 Malaysia
      5 Switzerland
      6 Austria
      6 Belgium
      6 Peru
      6 Slovenia
      7 Australia
      7 Japan
      8 Afghanistan
      8 Argentina
      8 Chile
      9 Finland
      9 Norway
     10 Bulgaria
     11 Singapore
     11 South Africa
     12 Serbia
     13 Denmark
     13 Moldova, Republic of
     14 Ecuador
     14 Romania
     15 Cambodia
     15 Kazakhstan
     15 Lithuania
     15 Morocco
     17 Latvia
     21 Pakistan
     21 Venezuela
     23 Mexico
     23 Turkey
     24 Honduras
     24 Israel
     29 Czech Republic
     30 Korea, Republic of
     32 Colombia
     33 Hong Kong
     36 Italy
     38 Vietnam
     39 Bangladesh
     40 Belarus
     41 Estonia
     44 Thailand
     50 Iran, Islamic Republic of
     53 Spain
     54 GeoIP Country Edition: IP Address not found
     60 Poland
     88 India
    113 Netherlands
    113 Taiwan
    124 Indonesia
    147 Sweden
    157 Canada
    176 United Kingdom
    240 Germany
    297 China
    298 Brazil
    502 France
   1631 Russian Federation
   2280 Ukraine
   3224 United States

Very cool. Here's the full command for reference explainshell explanation:

cat failedcomments.log | awk 'BEGIN { FS="|" } { gsub(" ip: ", "", $2); print $2 }' | xargs -n1 geoiplookup | sed -e 's/GeoIP Country Edition: //g' | sed -E 's/^[A-Z]+, //g' | grep -iv 'resolve hostname' | sort | uniq -c | sort -n

With our list in hand, I imported it into LibreOffice Calc to parse it into a table with the fixed-width setting (Google Sheets doesn't appear to support this), and then pulled that into a Google Sheet in order to draw a heat map:

A world map showing the above output in a heat-map style. Countries with more failed comment attempt appear redder.

At first, the resulting graph showed just a few countries in red, and the rest in white. To rectify this, I pushed the counts through the natural log (log()) function, which yielded a much better map, where the countries have been spamming just a bit are still shown in a shade of red.

From this graph, we can quite easily conclude that the most 'spammiest' countries are:

The US
Russia
Ukraine (I get lots of spam emails from here too)
China (I get lots of SSH intrusion attempts from here)
Brazil (Wat?)

Personally, I was rather surprised to see the US int he top spot. I figured that with with tough laws on that sort of thing, spammers wouldn't risk attempting to buy a server and send spam from here.

On further thought though, it occurred to me that it may be because there are simply lots of infected machines in the US that are being abused (without the knowledge of their unwitting users) to send lots of spam.

At any rate, I don't appear to have a spam problem on my blog at the moment - it's just fascinating to investigate where the spam I do block comes from.

Found this interesting? Got an observation of your own? Plotted a graph from your own data? Comment below!

Just another day: More spam, more defences

When post is released I'll be in an exam, but I wanted to post again about the perfectly fascinating spam situation here on my blog. I've blogged about fending off spam on here before (exhibits a, b, c), but I find the problem is detecting it in a transparent manner that you as the reader don't notice very interesting, so I think I'll write another post on the subject. I could use a service like Google's ReCAPTCHA, but that would be boring :P

Recently I've had a trio of spam comments make it all the way through my (rather extensive) checks and onto my blog here. I removed them, of course, but it still baffled me as to why they made it through.

It didn't take long to find out. When I was first implementing comments on here, I added a logger specifically for purposes such as this that saves everything about current environmental state to a log file for later inspection - for both comments that make it through, and those that don't. It's not available publically available, of course (but if you'd like to take a look, just ask and I'll consider it). Upon isolating the entries for the spam comments, I discovered a few interesting things.

The comment keys were aged 21, 21, and 17 seconds respectively (the lower limit I have set is 10 seconds)
All 3 comments claimed that they were Firefox 57
2 out of 3 comments used HTTP 1.0 (even though they claimed to be Firefox 57, and despite my server offering HTTP/1.1 and HTTP/2.0)
All 3 comments utilised HTTPS
The IP Addresses that the comments came form were in Ukraine, Russia, and Canada (hey?) respectively
All 3 appear to be phishing scams, with a link leading to a likely malicious website
The 2 using HTTP/1.0 also asked my server to close the connection after sending a response
All 3 asked not to be tracked via the DNT HTTP header
The last comment had some really weird capitalisation. After consulting someone experienced on the subject, I learnt that the writer likely natively spoke an eastern language, such as Chinese

This was most interesting. From this, I can conclude:

The last comment was likely submitted by a Chinese operator - even though the source IP address is located in Ukraine
All three are spoofing their user agent string.
- Firefox 57 uses HTTP/2.0 by default if you're really in a browser, and the spam comments utilised HTTP/1.0 and HTTP/1.1
- Curiously, all of this took place over HTTPS. I'd be really curious to log which cipher was used for the connection here.
- In light of this, if I knew more about HTTP client libraries, I could probably identify what software was really used to submit the spam comments (and possibly even what operating system it was running on). If you know, please comment below!

To combat this development, I thought of a few options. Firstly, raising the minimum comment age, whilst effective, may disrupt the user experience, which I don't want to do. Plus, the bot owners could just increase the delay even more. To that end, I decided not to do this.

Secondly, with the amount of data I've collected, I could probably write an AI that takes the environment in and spits out a 'spaminess' score, much like SpamAssassin and rspamd do for email. Perhaps a multi-weighted system would work, with a series of tests that add or take away from the final score? I might investigate upgrading my spam detection system to do this in the future, as it would not only block spam more effectively, but provide a more distilled overview of the characteristics of each comment submission than I have currently.

Lastly, I could block HTTP/1.0 requests. While not perfect (1 out of 3 requests used HTTP/1.1), it would still catch some more bots out without disrupting user experience - as normal browsers (include text-based ones IIRC) use HTTP/1.1 or above. HTTP/1.1 has been around since 1991 (27 years!), so if you're not using it by now - upgrade! For now, this is the best option I can see.

From today, if you try to submit a comment and get a HTTP 505 HTTP Version Not Supported error and see a message saying something like this:

You sent your request via HTTP/1.0, but this is not supported for submitting comments due to high volume of spam. Please retry with HTTP/1.1 or higher.

...then you'll have to upgrade and / or reconfigure your web browser. Please let me know (my email address is on the homepage) if this causes any issues for anyone, and I'll help you out.

Found this interesting? Know more about this? Got a better solution? Comment below!

4287 Reasons why your comments weren't posted

I don't get a lot of real comments on here from what I can tell, as you've probably noticed. I don't particularly mind (though it's always awesome when I do get one!) - but what I do mind about is the spam. Since February 2015, I've gotten 4287 spam comments. 4287! It's actually quite silly, when you think about it.

The other day I was fiddling with the code behind this blog (posts about that coming soon!), and I discovered that I implemented a log ages ago that records each and every spammer, and the mistake they made - and I thought I'd share some statistics here, and some tips for dealing with spam yourself (I've posted about tactics before here.

Here's an extract from my logs (full logs available on request):

[ Sun, 02 Apr 2017 23:17:25 +0100] invalid comment | ip: 94.181.153.194 | name: ghkkll | articlepath: posts/120-Cpu-Registers.html | mistake: shortcomment
[ Mon, 03 Apr 2017 02:16:58 +0100] invalid comment | ip: 191.96.242.17 | name: exercise pants | articlepath: posts/010-Gif-Renderer.html | mistake: invalidkey
[ Mon, 03 Apr 2017 02:16:58 +0100] invalid comment | ip: 191.96.242.17 | name: exercise pants | articlepath: posts/010-Gif-Renderer.html | mistake: invalidkey
[ Mon, 03 Apr 2017 02:16:59 +0100] invalid comment | ip: 191.96.242.17 | name: exercise pants | articlepath: posts/010-Gif-Renderer.html | mistake: invalidkey
[ Mon, 03 Apr 2017 02:16:59 +0100] invalid comment | ip: 191.96.242.17 | name: exercise pants | articlepath: posts/010-Gif-Renderer.html | mistake: invalidkey

Since the output format I chose is nice and regular, I could use a quick bit of bash magic to whip up some statistics:

cat failedcomments.log | sed -e 's/^.*mistake\: //' | grep -iv '\[' | sort | uniq -c | sort -nr

(explanation, courtesy of explainshell.com )

That gave me this output:

Count	Reason
3922	invalidkey
148	website
67	nokey
67	noarticleid
56	noname
17	shortcomment
4	badarticlepath
3	longcomment
2	invalidemail
1	shortname

My first thoughts here were firstly along the lines of "wow, that's a lot of spam", secondly "that comment key is working really well!", and thirdly "I didn't realise how helpful that fake website field is". Still, though I had a table, I thought a visualisation might help to put things into perspective.

A pair of pie charts showing a breakdown on which mimstakes spammers made most often. Explained below.

There - much better :D As you might have suspected if you've been following my blog here for a while, having an invalid comment key is the most common mistake spammers make.

The comment key is a hidden field in the comment form that is actually a transformed timestamp of the time you loaded the page. Working it backwards, I can work out how long it took you to submit a comment from first loading the page.

Using a companion log file to the one that I generated the above pie chart from, I've calculated that 3558 potential comments were submitted within 10 seconds of loading the page! No ordinary humans are that fast.... (especially considering you probably want to read the article before commenting!) they have to be bots. Here's a graph to illustrate the dropoff (the time is in seconds):

A graph illustrating how many bots commented on my blog unnaturally fast.

Out of the other reasons that people failed, "website" was the second most common mistake, with ~3.45% of spammers getting caught out on it. This mistake refers to another of my little spam traps - defense in depth is always good! This particular one is a regular website address hidden field, which is hidden via some fancy CSS. Curious, I decided to investigate further - and what I found was fascinating.

About 497 spammers entered an invalid website address (i.e. one that doesn't start with http) into the website box - which I really can't understand, since it's got a (hidden) label and an appropriate name and type to match - 90 of which decided that "seo plugin" was a brilliant thing to fill it with! It's important to note here that spammers who got caught by the invalid comment key filter above are included in these statistics - here's the bash command I used here:

grep -i '"website"' rawsubmits.jsonlog  | sed -e 's/^.*"website": "//' -e 's/",//' -e 's/\\\///' | uniq | egrep -iv '^http' | wc -l

Other examples include "watch live sports free", "Samantha", "just click the following web site", long strings of html-encoded unicode characters (japanese I think, after decoding one), and more. Perfectly baffling, if you ask me (if you can shed some light on this one, please comment below!).

57 spambots forgot their own name. This could be because the box you put your name in below has a name of 'name', but an id of 'namebox' - which may have caused some confusion for some of the more stupid bots.

After all that, there were 3 long comments (probably a bunch of word salad), 2 invalid email addresses that weren't caught by any filters above, and 1 short name (under 3 characters).

That's about it for this impromptu analysis of my comments log! This took far longer than I thought it would to type up. Did you find it interesting? Thinking of putting some of these techniques into practice yourself? Comment below!

Stardust Blog

Tag Cloud

Spam statistics are live!

Where in the world does spam come from?

Just another day: More spam, more defences

4287 Reasons why your comments weren't posted

Stardust
Blog