Detecting anomalies just by seeing
Last updated
Last updated
As mentioned in the earlier sections, the generation of anomalies within data directly depends on the generation of the data points itself.
In the above histogram plot also, we can see there's one particular bin that is just not right as it deviates hugely from the rest of the data (a phrase repeated intentionally to put emphasis on the deviation part). We can also infer that there are only two employees for whom the salaries seem to be distorted (look at the y-axis).
So what might be an immediate way to confirm that the dataset contains anomalies? Let's take a look at the minimum and maximum values of the column Salary (in USD)
.
Minimum: 17
Maximum: 2498
Look at the minimum value. From the accounts department of this hypothetical organization, you got to know that the minimum salary of an employee there is $1000. But you found out something different. Hence, it is worth enough to conclude that this is indeed an anomaly. Let's now try to look at the data from a different perspective other than just simply plotting it.
Although our dataset contains only one feature (i.e. Salary (in USD)
) that contains anomalies but in reality, there can be a lot of features that will have anomalies in them. Even there, these little visualizations will help you a lot.
We have seen how clustering and anomaly detection are closely related but they serve different purposes. But clustering can be used for anomaly detection. In this approach, we start by grouping the similar kind of objects. Mathematically, this similarity is measured by distance measurement functions like Euclidean distance, Manhattan distance, and so on. Euclidean distance is a very popular choice when choosing between several distance measurement functions. Let's take a look at what Euclidean distance is all about.
If there are n points on a two-dimensional space(refer to the following figure) and their coordinates are denoted by , then the Euclidean distance between any two points and on this space is given by:
We are going to use K-Means clustering which will help us cluster the data points (salary values in our case). The implementation that we are going to be using KMeans
uses Euclidean distance internally. Let's get started.
We will now import the kmeans
module from scipy.cluster.vq
. SciPy stands for Scientific Python and provides a variety of convenient utilities for performing scientific experiments. Follow its documentation here. We will then apply KMeans
to salary_raw
.
In the above chunk of code, we fed the salary data points to the kmeans()
. We also specified the number of clusters to which we want to group the data points. cookbook
is the centroids generated by kmeans()
and distortion
is the averagedEuclidean distance between the data points fed and the centroids generated by kmeans()
.
Let's assign the groups of the data points by calling the vq()
method. It takes -
The data points
The centroid as generated by the clustering algorithm (kmeans()
in our case)
It then returns the groups of the data points and the distance between the observation and its nearest group.
The above method for anomaly detection is purely unsupervised in nature. If we had the class labels of the data points, we could have easily converted this to a supervised learning problem, specifically a classification problem.