Grouping and clustering with AI

Cluster analysis

Clustering is a statistical technique, often carried out with the help of machine learning

  • It consists of separating groups (or clusters) of similar data points based on certain characteristics out of a larger set of data

  • After clustering is completed, an analyst will look at the results and determine whether the clusters are meaningful

  • Using clustering, multi-dimensional pictures can be developed by a collection of data points

  • Each dimension represents a feature of the data

  • By examining groups of points in different dimensions, threat hunters can find useful correlations

    • This can be starting point for further investigation

Algorithms

A variety of different clustering algorithms exist:

  • K-Means

    • The algorithm groups point into a set of K groups

  • Fuzzy K-Means

    • Each point is assigned a percentage membership to each of the K groups

  • K-Means++

    • K-Means is run multiple times and the best classification is retained

  • DBScan

    • Classifies points into groups, where the number of groups is based upon point density

Using different clustering algorithms on the same data produces very different results

Pros and cons

Clustering algorithms have their pros and cons:

Pros

  • No training set required

  • Ability to condense data to a manageable size

Cons

  • No interpretation provided

  • No guarantee of finding useful correlations

Applications

Cluster analysis is useful for outlier detection

Clustering the size of packets on the network may be useful, since abnormally large or small packets may be signs of an attack

Clustering ports may be less useful since client ports are randomly selected from high-number ports

Two completely different applications can use the same port at different times

Grouping

Grouping involves taking a set of multiple unique artifacts and identifying when multiple of them appear together, based on specific criteria

  • The key for grouping techniques are the criteria used, like events occurring during a specific time window

  • Artifacts that are grouped together may potentially represent a tool or an attacker TTP

  • This technique works best when there are multiple, related instances of unique artifacts, and you are looking to isolate artifacts based on specific criteria

Grouping vs. clustering

Grouping is the next logical step after clustering is performed

The major difference between grouping and clustering is that in grouping your input is an explicit set of items that are already of interest

Clustering separates similar data points that can be further narrowed down by grouping, allowing the threat hunter to isolate data that is potentially of interest

Once potentially suspicious data has been identified, grouping works to eliminate benign data and generate more useful groups of suspicious data

Stack counting

Stack counting, also known as "stacking", involves counting the number of occurrences of values of a particular type and analyzing outliers of those results

  • It is an analysis method used in a simulated haystack to find the needle. It is the most popular practice conducted by hunters to examine a hypothesis

  • Data stacking is used to isolate and classify patterns by using frequency analyzes in mass quantities of related data

  • In the context of a large data set, the investigator identifies the characteristics that differentiate the odd data rows and may prove that they are malicious

  • Identifying similar or equal values across multiple artifacts may reveal useful correlations for threat hunting

Applications

Stack counting can be applied to any case where anomalies are significant

It works best when most data fits into a range, set of ranges or set of common values

Identifying that 99% of traffic uses HTTP(s), DNS or SMTP means that the remaining 1% of traffic deserves further investigation

Feature selection

When performing stack counting, the choice of features to stack is important

Features, where every value is expected to be unique, won't help with detecting outliers

Any anomalies that are detected need to be meaningful and useful

Tools

Stack counting can be automated using common tools

Excel has built-in functionality to help with data analysis

  • Sorting columns to expose anomalies

  • Cell highlighting rules

  • VBScript analysis

Languages like Python and R have lots of data analysis functionality

grep and awk can be used on the command line to quickly analyze log files

The pivot tables of Microsoft Office, the stats command of Splunk or the top command of Arcsight are all examples of stacking

Last updated