Detecting anomalies just by seeing

Boxplots

As mentioned in the earlier sections, the generation of anomalies within data directly depends on the generation of the data points itself.

Histograms

In the above histogram plot also, we can see there's one particular bin that is just not right as it deviates hugely from the rest of the data (a phrase repeated intentionally to put emphasis on the deviation part). We can also infer that there are only two employees for whom the salaries seem to be distorted (look at the y-axis).

So what might be an immediate way to confirm that the dataset contains anomalies? Let's take a look at the minimum and maximum values of the column Salary (in USD).

Minimum: 17
Maximum: 2498

Look at the minimum value. From the accounts department of this hypothetical organization, you got to know that the minimum salary of an employee there is $1000. But you found out something different. Hence, it is worth enough to conclude that this is indeed an anomaly. Let's now try to look at the data from a different perspective other than just simply plotting it.

Although our dataset contains only one feature (i.e. Salary (in USD)) that contains anomalies but in reality, there can be a lot of features that will have anomalies in them. Even there, these little visualizations will help you a lot.

Clustering based approach for anomaly detection

We have seen how clustering and anomaly detection are closely related but they serve different purposes. But clustering can be used for anomaly detection. In this approach, we start by grouping the similar kind of objects. Mathematically, this similarity is measured by distance measurement functions like Euclidean distance, Manhattan distance, and so on. Euclidean distance is a very popular choice when choosing between several distance measurement functions. Let's take a look at what Euclidean distance is all about.

An extremely short note on Euclidean distance

If there are n points on a two-dimensional space(refer to the following figure) and their coordinates are denoted by $(x_1, y_1)$ , then the Euclidean distance between any two points $(x_1, y_1)$ and $(x_2, y_2)$ on this space is given by:

$\sqrt{\left(x_{1}-x_{2}\right)^{2}+\left(y_{1}-y_{2}\right)^{2}}$

We are going to use K-Means clustering which will help us cluster the data points (salary values in our case). The implementation that we are going to be using KMeans uses Euclidean distance internally. Let's get started.

# Convert the salary values to a numpy array
salary_raw = salary_df['Salary (in USD)'].values

# For compatibility with the SciPy implementation
#salary_raw = pok.reshape(-1, 1)
salary_raw = salary_raw.astype('float64')

We will now import the kmeans module from scipy.cluster.vq. SciPy stands for Scientific Python and provides a variety of convenient utilities for performing scientific experiments. Follow its documentation here. We will then apply KMeans to salary_raw.

from scipy.cluster.vq import vq, kmeans

# Supply the data and the number of clusters to kmeans()
codebook, distortion = kmeans(salary_raw, 4)

In the above chunk of code, we fed the salary data points to the kmeans(). We also specified the number of clusters to which we want to group the data points. cookbook is the centroids generated by kmeans() and distortion is the averagedEuclidean distance between the data points fed and the centroids generated by kmeans().

Let's assign the groups of the data points by calling the vq() method. It takes -

The data points
The centroid as generated by the clustering algorithm (kmeans() in our case)

It then returns the groups of the data points and the distance between the observation and its nearest group.

The above method for anomaly detection is purely unsupervised in nature. If we had the class labels of the data points, we could have easily converted this to a supervised learning problem, specifically a classification problem.

PreviousTechniques NextUnsupervised Learning

Last updated 5 months ago