Where in the world does spam come from?
Answer: The US, apparently. I was having a discussion with someone recently, and since I have a rather extensive log of comment failures for debugging & analysis purposes (dating back to February 2015!) they suggested that I render a map of where the spam is coming from.
It was such a good idea that I ended up doing just that - and somehow also writing this blog post :P
First, let's start off by looking at the format of said log file:
[ Sun, 22 Feb 2015 07:37:03 +0000] invalid comment | ip: a.b.c.d | name: Nancyyeq | articlepath: posts/015-Rust-First-Impressions.html | mistake: longcomment
[ Sun, 22 Feb 2015 14:55:50 +0000] invalid comment | ip: e.f.g.h | name: Simonxlsw | articlepath: posts/015-Rust-First-Impressions.html | mistake: invalidkey
[ Sun, 22 Feb 2015 14:59:59 +0000] invalid comment | ip: x.y.z.w | name: Simontuxc | articlepath: posts/015-Rust-First-Impressions.html | mistake: invalidkey
Unfortunately, I didn't think about parsing it programmatically when I designed the log file format.... Oops! It's too late to change it now, I suppose :P
Anyway, as an output, we want a list of countries in 1 column, and a count of the number of IP addresses in another. First things first - we need to extract those IP addresses. awk
is ideal for this. I cooked this up just quickly:
BEGIN {
FS="|"
}
{
gsub(" ip: ", "", $2);
print $2;
}
This basically tells awk
to split lines on the solid bar character (|
), extracts the IP address bit (ip: p.q.r.s
), and then strips out the ip:
bit.
With this done, we're ready to lookup all these IP addresses to find out which country they're from. Unfortunately, IP addresses can change hands semi-regularly - even across country borders, so my approach here isn't going to be entirely accurate. I don't anticipate the error generated here to be all that big though, so I think it's ok to just do a simple lookup.
If I was worried about it, I could probably investigate cross-referencing the IP addresses with a GeoIP database from the date & time I recorded them. The effort here would be quite considerable - and this is a 'just curious' sort of thing, so I'm not going to do that here. If you have done this, I'd love to hear about it though - post a comment below.
Actually doing a GeoIP lookup itself is fairly easy to do, actually. While for the odd IP address here and there I usually use ipinfo.io, when there are lots of lookups to be done (10,479 to be exact! Wow.), it's probably best to utilise a local database. A quick bit of research reveals that Ubuntu Server has a package I can install that should do the job called geoip-bin
:
sudo apt install geoip-bin
(....)
geoiplookup 1.1.1.1 # CloudFlare's 1.1.1.1 DNS service
GeoIP Country Edition: AU, Australia
Excellent! We can now lookup IP addresses automagically via the command line. Let's plug that in to the little command chain we got going on here:
cat failedcomments.log | awk 'BEGIN { FS="|" } { gsub(" ip: ", "", $2); print $2 }' | xargs -n1 geoiplookup
It doesn't look like geoiplookup
supports multiple IP addresses at once, which is a shame. In that case, the above will take a while to execute for 10K IP addresses.... :P
Next up, we need to remove the annoying label there. That's easy with sed
:
(...) | sed -E 's/^[A-Za-z: ]+, //g'
I had some trouble here getting sed
to accept a regular expression. At some point I'll have to read the manual pages more closely and write myself a quick reference guide. Come to think about it, I could use such a thing for awk
too - their existing reference guide appears to have been written by a bunch of mathematicians who like using single-letter variable names everywhere.
Anyway, now that we've got our IP address list, we need to strip out any errors, and then count them all up. The first point is somewhat awkward, since geoiplookup
doesn't send errors to the standard error for some reason, but we can cheese it with grep -v
:
(...) | grep -iv 'resolve hostname'
The -v
here tells grep
to instead remove any lines that match the specified string, instead of showing us only the matching lines. This appeared to work at first glance - I simply copied a part of the error message I saw and worked with that. If I have issues later, I can always look at writing a more sophisticated regular expression with the -P
option.
The counting bit can be achieved in bash with a combination of the sort
and uniq
commands. sort
will, umm, sort the input lines, and uniq
with de-duplicate multiple consecutive input lines, whilst optionaly counting them. With this in mind, I wound up with the following:
(...) | sort | uniq -c | sort -n
The first sort
call sorts the input to ensure that all identical lines are next to each other, reading for uniq
.
uniq -c
does the de-duplication, but also inserts a count of the number of duplicates for us.
Lastly, the final sort
call with the -n
argument sorts the completed list via a natural sort, which means (in our case) that it handles the numbers as you'd expect it too. I'd recommend you read the Wikipedia article on the subject - it explains it quite well. This should give us an output like this:
1 Antigua and Barbuda
1 Bahrain
1 Bouvet Island
1 Egypt
1 Europe
1 Guatemala
1 Ireland
1 Macedonia
1 Mongolia
1 Saudi Arabia
1 Tuvalu
2 Bolivia
2 Croatia
2 Luxembourg
2 Paraguay
3 Kenya
3 Macau
4 El Salvador
4 Hungary
4 Lebanon
4 Maldives
4 Nepal
4 Nigeria
4 Palestinian Territory
4 Philippines
4 Portugal
4 Puerto Rico
4 Saint Martin
4 Virgin Islands, British
4 Zambia
5 Dominican Republic
5 Georgia
5 Malaysia
5 Switzerland
6 Austria
6 Belgium
6 Peru
6 Slovenia
7 Australia
7 Japan
8 Afghanistan
8 Argentina
8 Chile
9 Finland
9 Norway
10 Bulgaria
11 Singapore
11 South Africa
12 Serbia
13 Denmark
13 Moldova, Republic of
14 Ecuador
14 Romania
15 Cambodia
15 Kazakhstan
15 Lithuania
15 Morocco
17 Latvia
21 Pakistan
21 Venezuela
23 Mexico
23 Turkey
24 Honduras
24 Israel
29 Czech Republic
30 Korea, Republic of
32 Colombia
33 Hong Kong
36 Italy
38 Vietnam
39 Bangladesh
40 Belarus
41 Estonia
44 Thailand
50 Iran, Islamic Republic of
53 Spain
54 GeoIP Country Edition: IP Address not found
60 Poland
88 India
113 Netherlands
113 Taiwan
124 Indonesia
147 Sweden
157 Canada
176 United Kingdom
240 Germany
297 China
298 Brazil
502 France
1631 Russian Federation
2280 Ukraine
3224 United States
Very cool. Here's the full command for reference explainshell explanation:
cat failedcomments.log | awk 'BEGIN { FS="|" } { gsub(" ip: ", "", $2); print $2 }' | xargs -n1 geoiplookup | sed -e 's/GeoIP Country Edition: //g' | sed -E 's/^[A-Z]+, //g' | grep -iv 'resolve hostname' | sort | uniq -c | sort -n
With our list in hand, I imported it into LibreOffice Calc to parse it into a table with the fixed-width setting (Google Sheets doesn't appear to support this), and then pulled that into a Google Sheet in order to draw a heat map:
At first, the resulting graph showed just a few countries in red, and the rest in white. To rectify this, I pushed the counts through the natural log (log()
) function, which yielded a much better map, where the countries have been spamming just a bit are still shown in a shade of red.
From this graph, we can quite easily conclude that the most 'spammiest' countries are:
- The US
- Russia
- Ukraine (I get lots of spam emails from here too)
- China (I get lots of SSH intrusion attempts from here)
- Brazil (Wat?)
Personally, I was rather surprised to see the US int he top spot. I figured that with with tough laws on that sort of thing, spammers wouldn't risk attempting to buy a server and send spam from here.
On further thought though, it occurred to me that it may be because there are simply lots of infected machines in the US that are being abused (without the knowledge of their unwitting users) to send lots of spam.
At any rate, I don't appear to have a spam problem on my blog at the moment - it's just fascinating to investigate where the spam I do block comes from.
Found this interesting? Got an observation of your own? Plotted a graph from your own data? Comment below!