Grouping and clustering with AI
Cluster analysis
Clustering is a statistical technique, often carried out with the help of machine learning
It consists of separating groups (or clusters) of similar data points based on certain characteristics out of a larger set of data
After clustering is completed, an analyst will look at the results and determine whether the clusters are meaningful
Using clustering, multi-dimensional pictures can be developed by a collection of data points
Each dimension represents a feature of the data
By examining groups of points in different dimensions, threat hunters can find useful correlations
This can be starting point for further investigation
Algorithms
A variety of different clustering algorithms exist:
K-Means
The algorithm groups point into a set of K groups
Fuzzy K-Means
Each point is assigned a percentage membership to each of the K groups
K-Means++
K-Means is run multiple times and the best classification is retained
DBScan
Classifies points into groups, where the number of groups is based upon point density
Using different clustering algorithms on the same data produces very different results
Pros and cons
Clustering algorithms have their pros and cons:
Pros
No training set required
Ability to condense data to a manageable size
Cons
No interpretation provided
No guarantee of finding useful correlations
Applications
Cluster analysis is useful for outlier detection
Clustering the size of packets on the network may be useful, since abnormally large or small packets may be signs of an attack
Clustering ports may be less useful since client ports are randomly selected from high-number ports
Two completely different applications can use the same port at different times
Grouping
Grouping involves taking a set of multiple unique artifacts and identifying when multiple of them appear together, based on specific criteria
The key for grouping techniques are the criteria used, like events occurring during a specific time window
Artifacts that are grouped together may potentially represent a tool or an attacker TTP
This technique works best when there are multiple, related instances of unique artifacts, and you are looking to isolate artifacts based on specific criteria
Grouping vs. clustering
Grouping is the next logical step after clustering is performed
The major difference between grouping and clustering is that in grouping your input is an explicit set of items that are already of interest
Clustering separates similar data points that can be further narrowed down by grouping, allowing the threat hunter to isolate data that is potentially of interest
Once potentially suspicious data has been identified, grouping works to eliminate benign data and generate more useful groups of suspicious data
Stack counting
Stack counting, also known as "stacking", involves counting the number of occurrences of values of a particular type and analyzing outliers of those results
It is an analysis method used in a simulated haystack to find the needle. It is the most popular practice conducted by hunters to examine a hypothesis
Data stacking is used to isolate and classify patterns by using frequency analyzes in mass quantities of related data
In the context of a large data set, the investigator identifies the characteristics that differentiate the odd data rows and may prove that they are malicious
Identifying similar or equal values across multiple artifacts may reveal useful correlations for threat hunting
Applications
Stack counting can be applied to any case where anomalies are significant
It works best when most data fits into a range, set of ranges or set of common values
Identifying that 99% of traffic uses HTTP(s), DNS or SMTP means that the remaining 1% of traffic deserves further investigation
Feature selection
When performing stack counting, the choice of features to stack is important
Features, where every value is expected to be unique, won't help with detecting outliers
Any anomalies that are detected need to be meaningful and useful
Tools
Stack counting can be automated using common tools
Excel has built-in functionality to help with data analysis
Sorting columns to expose anomalies
Cell highlighting rules
VBScript analysis
Languages like Python and R have lots of data analysis functionality
grep
and awk
can be used on the command line to quickly analyze log files
The pivot tables of Microsoft Office, the stats
command of Splunk or the top
command of Arcsight are all examples of stacking
Last updated