I was trying to figure out a data set to use to create examples with Bokeh when I thought of trying to visualize failed attempts to scrape my web server. When I first opened up my network to the outside, I noticed a lot if incoming requests from random locations in my logs. From what I could find, this was weirdly normal. There are apparently bots constantly probing the internet, trying to find vulnerable endpoints. What I want to to is take these, and plot the numbers and locations. Our project can be broken up into the following sections to achieve this:

  • creating a small script to filter our logs for the events we want
  • creating another small script with Python to format the data
  • using the Python requests package with a third-party API for information on each of the IP addresses
  • using the Python’s concurrent.futures to make multiple requests at the same time
  • creating visualizations with the Python package Bokeh.

All the files for these are located here.

Before we go on, this piece will not be going into security; however, I’ll take this moment to recommend taking proper steps to secure your server if you plan on venturing in this direction. Avoid using any default settings. My ports and administration accounts have all been changed from the defaults. With the ports, a quick scan will reveal everything; however, it’s still better to have a non-standard setup. I also use fail2ban, which is a FOSS that scans your log files for IP addresses that make too many authentication attempts. I have this set up for a 24-hour ban after 3 failed attempts.


Filtering Log Files

The log files I’m interested in are the ones from Apache2 and sshd.

The log files for sshd are located under /var/log as some variant of auth.log. We’ll create a bash script to cycle through these and remove known log-ins, as well as disconnects.

LOG_DIR=/var/log
ACCESS_FILES=$(ls ${LOG_DIR}/auth.log*)
for FILE in ${ACCESS_FILES[@]}; do
	if [[ ${FILE} == *.gz ]]; then
		gunzip -c ${FILE} | grep "sshd" | grep -Ev "mtopacio|disconnect"
	else
		cat ${FILE} | grep "sshd" | grep -Ev "mtopacio|disconnect"
	fi
done

We’re cycling through each auth.log file. Some older ones are compressed and saved as a *.gz. For each file, we’re determining if it’s compressed or not. If it is, gunzip -c will write the contents to stdout without altering the original file. From that output, we’re piping it into a grep command that’s looking for the term sshd, which is the daemon that handles ssh connections. We’re sending this through one more pipe that uses the -Ev options of grep to invert the regex matches for mtopacio and disconnect. All these are ignored, and everything else is printed out to screen. The else statement does the same; however, there’s no need to use gunzip on an uncompressed file.

The Apache2 logs are located under /var/log/apache2. There will be also multiple variants of access.log. For these log files, we’ll do almost the same thing. We’ll introduce an additional filter for following errors:

Error CodeDescription
400HTTP_BAD_REQUEST
401HTTP_UNAUTHORIZED
403HTTP_FORBIDDEN
405HTTP_METHOD_NOT_ALLOWED
500HTTP_INTERNAL_SERVER_ERROR

We’ll also remove any results from any of the services I have open to the outside (i.e. a web page, git server, and Nextcloud server).

LOG_DIR=${LOG_DIR}/apache2
ACCESS_FILES=$(ls ${LOG_DIR}/access.log*)
ERROR_CODES=(400 401 403 404 405 500)
for FILE in ${ACCESS_FILES[@]}; do
    for ERROR in ${ERROR_CODES[@]}; do
        if [[ ${FILE} == *.gz ]]; then
        gunzip -c ${FILE} | grep -Ev "marktopac.io|git|Nextcloud" | grep ${ERROR}
        else
        cat ${FILE} | grep -Ev "marktopac.io|git|Nextcloud" | grep ${ERROR}
        fi
	done
done

I chose to focus on only the five error codes above. Further analysis of your logs would be a deeper dive into digital forensics and outside the scope of this tutorial.

The whole script should look something like this:

#!/usr/bin/env bash

LOG_DIR=/var/log

# ssh attempts

ACCESS_FILES=$(ls ${LOG_DIR}/auth.log*)

for FILE in ${ACCESS_FILES[@]}; do
    if [[ ${FILE} == *.gz ]]; then
        gunzip -c ${FILE} | grep "sshd" | grep -Ev "mtopacio|disconnect"
    else
        cat ${FILE} | grep "sshd" | grep -Ev "mtopacio|disconnect"
    fi
done

# webserver requests

LOG_DIR=${LOG_DIR}/apache2

ACCESS_FILES=$(ls ${LOG_DIR}/access.log*)
ERROR_CODES=(400 401 403 404 405 500)

for FILE in ${ACCESS_FILES[@]}; do 
    for ERROR in ${ERROR_CODES[@]}; do
        if [[ ${FILE} == *.gz ]]; then
            gunzip -c ${FILE} | grep -Ev "marktopac.io|git|Nextcloud" | grep ${ERROR} 
        else
            cat ${FILE} | grep -Ev "marktopac.io|git|Nextcloud" | grep ${ERROR}
        fi
    done
done

Run the script with and pipe the results into a text file with $ ./filter.sh > anomalies.txt.


Formatting Data

Now that we have our data, I want to format it to make it easier to work with. I’ll be using Python to parse the data into an acceptable format.

Our script will be simple string manipulation. It begins with reading the file into memory. The first few lines are from our ssh filter. We can split the line and distinguish this by the fourth element “Metis”, which is the name of my server. In every line where “Metis” is the fourth word, we can associate that with a line from the ssh log. We’ll keep this simple for now. The pieces of data I want are only the IP address and the timestamp. The rest can be lumped together as a description. The timestamp, however, is only written with the abbreviated month and date. I’ll use the datetime package to add the year 2020 and change this to a YYYY-MM-DD format. We’re going to output everything into a csv file. The whole things looks a little something like this:

#!/usr/bin/env python3

from datetime import datetime, timedelta

with open('anomalies.txt','r') as input_file:
    lines = [line.strip() for line in input_file.readlines()]

for line in lines[10:]:

    line = [lin for lin in line.split(' ') if lin != '']

    if line[3] == 'Metis':
        # formatting for lines from the sshd log files
        ip_address = line[-3]
        timestamp = " ".join(line[:3])
        desc = " ".join(line[5:])
        dt = datetime.strptime(timestamp, "%b %d %H:%M:%S")
        dt = dt.replace(year=2020)

    else:
        # formatting for lines from the apache2 log files
        ip_address = line[0]
        timestamp = line[3][1:]
        desc = " ".join(line[5:])
        dt = datetime.strptime(timestamp, "%d/%b/%Y:%H:%M:%S") - timedelta(hours=7)

    # remove any ',' in the description. It'll interfere with anything trying to
	# read a csv
	desc = desc.replace(',','')

	print(f"{ip_address},{dt},{desc}")

Run this with $ python format_data.py to create another file called anomalies.csv with all our formatted data.


freegeoip.app

freegeoip.app provides a free IP geolocation API. Given an IP address, it can return the country code, country name, region code, region name, city, zip code, time zone, metro code, and its longitude and latitude. Some of their info besides coordinates where hit-or-miss, so for now, I only saved the longitude/latitude and the country name. If needed, we can find a geocoding API to use. With freegeoip.app, we also get 15,000 requests per hour without having to sign up for anything.

To use their API, we’re going to make use of Python’s requests package. We’re also going to use concurrent.futures to help launch some parrelel tasks asynchronously. Specifically, we’ll rely on the ThreadPoolExecutor whose map function makes it easy to apply a function towards an iterable. It also focuses on using threads as opposed to processes. For our task, the bottleneck will be I/O since we’re waiting for the API’s servers to send us back the info. For this, threads are good enough to make use of the wait. Heavier code that makes full use of your CPU would warrant multiple processes.

We start by reading in the data we just formated.

with open('anomalies.csv', 'r') as input_file:
    lines = input_file.read().split("\n")[:-1]

For our input_file, we’re reading in the whole thing and splitting it on each newline (i.e. “\n”). The last “\n” on the last line gets separated into a new item. We’ll just read everything up until that point, hence the [:-1].

We’re now going to set up our GET request by declaring our headers and the base url.

url = "https://freegeoip.app/json"

headers = {
	"accept":"application/json",
	"content-type":"application/json"
	}

There are built-in functions within concurrent.futures that help maintain a tighter control over your workflow; however, we’re going to take the easy way out. We’re going to make a list, append each entry to it, and print everything once all the requests have been made and answered. Originally, I was having trouble when multiple requests were being answered at the same time. This would result in multiple entries on one line, messing up the general format of our data file. We could either try and control the flow of data, or create another script to format everything correctly.

After this, the next step will be to read everything into a pandas data frame, so I want to start with a list of headers to set it up correctly. You could also just as easily forget the headers and declare headers=None when reading in the data.

temp = ["ip_address,date,description,country,latitude,longitude"]

On to that list, I’ll append rows of the data we’re fetching. All of this is within our thread_function that will be mapped out to an iterable. In our case, that iterable is each of the lines of data we read in.

def thread_function(line):

	info = line.split(",")
	ip = info[0]

	response = requests.request("GET", f"{url}/{ip}", headers=headers)
	response = json.loads(response.text)

	temp.append(f"{line},{response['country_name']},{response['latitude']},{response['longitude']}")

with concurrent.futures.ThreadPoolExecutor(max_workers=8) as executor:
	executor.map(thread_function, lines)

max_workers has been set to 8; however, since these are threads, more could be supported. 8 was an arbitrary number based on my CPU cores. Each core can support multiple threads. Testing should be done to come up with an optimum number in most cases.

Then we print it all out.

[print(t) for t in temp]

In total, it should look like the following:

#!/usr/bin/env python3

from datetime import datetime, timedelta

with open('anomalies.txt','r') as input_file:
    lines = [line.strip() for line in input_file.readlines()]

for line in lines[10:]:

    line = [lin for lin in line.split(' ') if lin != '']

    if line[3] == 'Metis':
        # formatting for lines from the sshd log files
        ip_address = line[-3]
        timestamp = " ".join(line[:3])
        desc = " ".join(line[5:])
        dt = datetime.strptime(timestamp, "%b %d %H:%M:%S")
        dt = dt.replace(year=2020)

    else:
        # formatting for lines from the apache2 log files
        ip_address = line[0]
        timestamp = line[3][1:]
        desc = " ".join(line[5:])
        dt = datetime.strptime(timestamp, "%d/%b/%Y:%H:%M:%S") - timedelta(hours=7)

    # remove any ',' in the description. It'll interfere with anything trying to
	# read a csv
	desc = desc.replace(',','')

	print(f"{ip_address},{dt},{desc}")

Run this script and pipe everything into a new file.

$ python ip_info.py > geo_ips.csv

This should only take a few minutes since we’ve parallelized the brunt of our tasks. Without doing so, it could easily take twice as long. Also, so far, we’ve been writing all our data into separate files.


Plotting Histograms

For the visualizations, like mentioned before, I’ll be using Bokeh. We’ll also start with the basic histogram. I’m curious to see where most of these attempts are coming from.

With Bokeh, each of its elements are broken up into a module. When you take a look at the import statements, it can become a little daunting.

from bokeh.io import output_file, show
from bokeh.models import ColumnDataSource
from bokeh.models.tools import HoverTool
from bokeh.plotting import figure
from bokeh.palettes import cividis
from bokeh.transform import factor_map
from bokeh.resources import CDN
from bokeh.embed import file_html
import pandas as pd

output_file and show deal with what Bokeh does with the plots you create, as evident by the io module. Similarly, output_file will create at .html file determines the name of your output file. show will launch it in your browser. There is also another module called save that will just output the .html file. This does something similar to file_html in the last line. file_html offers outlets to include a lot more customization and templates into your output.

ColumnDataSource is a nice way to share column data with plots. Conveniently, it can also take in a pandas data object, HoverTool displays data when you hover your cursor over a glyph, a bar, or some plot object.

figure is the cornerstone in that is the object which we add the titles, the toolbars, the colors, and the plots.

cividis is a color palette Bokeh uses to easily map out colors to multiple objects. You can just as well define colors for each of your objects individually; however, this is a much easier way. factor_map is the machine that does this.

CDN links the Content Deliveryu Networks where it pulls the minified BokehJS and CSS files to place in your .html file.

Last, but not least, we have pandas.

We have the imports. Now, we’re going to name our output file and read in our data.

# name the output file
output_file("histogram.html")

# read in the data table
df = pd.read_csv('geo_ips.csv')

Next, we’ll format that data appropriately so we can use ColumnDataSource.

# format columns for `ColumnDataSource`
hist_df = df.drop_duplicates(subset="ip_address")
country_count = pd.DataFrame({"count":hist_df["country"].value_counts()})
country_count["country"] = country_count.index
source = ColumnDataSource(country_count)

The first thing we did is drop the duplicate IP address. For a number of them, multiple attempts were made. These could have differed by protocols, credentials, ports, endpoints, etc. The point is they all came from one address. All I’m trying to plot at this point is the different sources. We then create another dataframe with the country and its counts. The next line replaces the index, and the next creates a data column source for Bokeh.

# format the text when your cursor floats within the plot above your bar
hover = HoverTool()
hover.tooltips = [("Different IPs","@count (@country)")]
hover.mode = 'vline'

This implements the hover tool I mentioned earlier. Within the hovering text above each of th eplot elements will appear the number of different IPs and a reiteration of the country of origin. vline just species with location, which is on the vertical axis.

color_map = factor_cmap(field_name="country", palette=cividis(len(coutry_count.index)), 
        factors=coutry_count.index.tolist())

factor_cmap creates extends a color scheme to all of our plot elements. In this case, we seperate, or factor, them according to the country names.

# creating/formating the histogram
p = figure( x_range=country_count.index.tolist(),
          	y_axis_label = "Number of Unique IPs",
			x_axis_label = "Country",
           	plot_height = 600,
           	plot_width = 800,
           	toolbar_location=None)
p.add_tools(hover)
p.xaxis.major_label_orientation = "vertical"
p.xgrid.visible = False
p.background_fill_color = "gray"
p.background_fill_alpha = 0.1

p.toolbar.active_drag = None
p.toolbar.active_scroll = None
p.toolbar.active_tab = None

# format main title
p.title.text = "Unauthorized Access Attempts"
p.title.align = "center"
p.title.text_font_size = "18px"

# add the vertical bars for the histogram
p.vbar(x='country', top='count', source=source, with=0.6, color=color_map)

The first block has to do with creating and customizing our plot area. I’ve also added our hover tool and disabled a few others, namely drag, scroll, and tab. I added a title, then I added the bars to the plot. The last thing to do is render it.

show(p)
html = file_html(p, CDN, "Histogram")
with open("histogram.html", 'w+') as output_file:
    output_file.write(html)

show(p) can be optional depending on your needs. This will just automatically bring up your rendered .html file in a browser. An .html file is created with file_html, and the contents are written to histogram.html. Then that is it. This is the file we can embed in our .html files or wherever.

Our entire script looks something like the following…

#!/usr/bin/env python3

from datetime import datetime, timedelta

with open('anomalies.txt','r') as input_file:
    lines = [line.strip() for line in input_file.readlines()]

for line in lines[10:]:

    line = [lin for lin in line.split(' ') if lin != '']

    if line[3] == 'Metis':
        # formatting for lines from the sshd log files
        ip_address = line[-3]
        timestamp = " ".join(line[:3])
        desc = " ".join(line[5:])
        dt = datetime.strptime(timestamp, "%b %d %H:%M:%S")
        dt = dt.replace(year=2020)

    else:
        # formatting for lines from the apache2 log files
        ip_address = line[0]
        timestamp = line[3][1:]
        desc = " ".join(line[5:])
        dt = datetime.strptime(timestamp, "%d/%b/%Y:%H:%M:%S") - timedelta(hours=7)

    # remove any ',' in the description. It'll interfere with anything trying to
	# read a csv
	desc = desc.replace(',','')

	print(f"{ip_address},{dt},{desc}")

…with the following as the resulting plot:


Plotting Maps

Now, I want a map of where all these attempts are coming from. Same as above, there’s a lot going on in the import block; however, it’ll make sense when we break it down.

from bokeh.io import output_file, show
from bokeh.models import ColumnDataSource
from bokeh.models.tools import HoverTool
from bokeh.plotting import figure
from bokeh.tile_providers import get_provider
import pandas as np
import numpy as np

Most of these are actually the same as our histogram plot. The only differences are get_provider and numpy. Mapping graphics use what are known as tiles, which are square bitmap graphics organized in a grid to represent a map. Bokeh has ways to import your own sources; however, we just went with the OpenStreenMaps which they include. numpy as a math library for Python. I used it to convert the longitudes and latitudes into Mercator coordinates.

We’ll start the same way and name our output file, read in the data, and pre-process to feed into ColumnDataSource.

# name the output file
output_file("map.html")

# read in the data
df = pd.read_csv('geo_ips')

# create subset of the data
df = df.drop_duplicate(subset="ip_address")[['ip_address', 'country', 'latitude', 'longitude']]
# add mercator coordinates
k = 6378137
df['x'] = df['longitude'] * (k * np.pi/180)
df['y'] = np.log((np.tan((90 + df ['latitude']) * np.pi/360)) * k)
source = ColumnDataSource(df)

This is all our information and what we’ll use to plot our glyphs, which is what Bokeh calls all its little shapes. We now need to pull the tiles from a provider and create the default starting view of our map.

# grab tiles from provider - OpenStreetMap
tile_provider = get_provider('OSM')

# create the range for the default view of the map
diff = (max(df['x']) - min(df['x']))*0.1
x_range,y_range = ((min(df['x'])-diff,max(df['x'])+diff), (min(df['y']),max(df['y'])))

The rest is similar to the histogram plot. We create and format the hover tool and create and format our figure. To that figure, we add our glyphs, which is a filled in circle. I just used the .html file generated from output_file this time around instead of going through file_html. It’s more than enough, but remember file_html offers more customization if you have the need for it. That’s it. The following is the code for our map followed by our plot:

#!/usr/bin/env python3

from datetime import datetime, timedelta

with open('anomalies.txt','r') as input_file:
    lines = [line.strip() for line in input_file.readlines()]

for line in lines[10:]:

    line = [lin for lin in line.split(' ') if lin != '']

    if line[3] == 'Metis':
        # formatting for lines from the sshd log files
        ip_address = line[-3]
        timestamp = " ".join(line[:3])
        desc = " ".join(line[5:])
        dt = datetime.strptime(timestamp, "%b %d %H:%M:%S")
        dt = dt.replace(year=2020)

    else:
        # formatting for lines from the apache2 log files
        ip_address = line[0]
        timestamp = line[3][1:]
        desc = " ".join(line[5:])
        dt = datetime.strptime(timestamp, "%d/%b/%Y:%H:%M:%S") - timedelta(hours=7)

    # remove any ',' in the description. It'll interfere with anything trying to
	# read a csv
	desc = desc.replace(',','')

	print(f"{ip_address},{dt},{desc}")

Results

Mind you, this is based on unique attempts, meaning I did not take into account multiple attempts from a single address. This might alter the actual reality of what you can take away. For instance, DDoS would have never been detected with these methods.

I was surprised to see such a difference in the US. I didn’t expect it to be that much higher that other countries. They’re mostly from San Fransisco. Interpret that as you may. Other hot spots look like Europe, Columbia, China, and India. Maybe it’s also due to bots surveying only local IPs? I wouldn’t know unless I wanted to open up a server overseas. I’m also surprised to see attempts from 201.238.155.56, 80.82.71.118, and 80.82.70.178 which are all from separate islands in the middle of the sea. Even though attempts can be made from these remote locations, I’m still missing points as you move closer to the poles.

The next thing I would like to look into are the actual owners of these IPs, but that’s for another post.