# Grouping and clustering with AI

## Cluster analysis

**Clustering** is a statistical technique, often carried out with the help of machine learning

* It consists of separating groups (or clusters) of similar data points based on certain characteristics out of a larger set of data
* After clustering is completed, an analyst will look at the results and determine whether the clusters are meaningful
* Using clustering, multi-dimensional pictures can be developed by a collection of data points
* Each dimension represents a feature of the data
* By examining groups of points in different dimensions, threat hunters can find useful correlations
  * This can be starting point for further investigation

### Algorithms

**A variety of different clustering algorithms exist:**

* **K-Means**
  * The algorithm groups point into a set of K groups
* **Fuzzy K-Means**
  * Each point is assigned a percentage membership to each of the K groups
* **K-Means++**
  * K-Means is run multiple times and the best classification is retained
* **DBScan**
  * Classifies points into groups, where the number of groups is based upon point density

Using different clustering algorithms on the same data produces very different results

### Pros and cons

**Clustering algorithms have their pros and cons:**

#### **Pros**

* No training set required
* Ability to condense data to a manageable size

#### Cons

* No interpretation provided
* No guarantee of finding useful correlations

### Applications

**Cluster analysis** is useful for outlier detection

Clustering the size of packets on the network may be useful, since abnormally large or small packets may be signs of an attack

Clustering ports may be less useful since client ports are randomly selected from high-number ports

Two completely different applications can use the same port at different times

## Grouping

**Grouping** involves taking a set of multiple unique artifacts and identifying when multiple of them appear together, based on specific criteria

* The key for grouping techniques are the criteria used, like events occurring during a specific time window
* Artifacts that are grouped together may potentially represent a tool or an attacker TTP
* This technique works best when there are multiple, related instances of unique artifacts, and you are looking to isolate artifacts based on specific criteria

## Grouping vs. clustering

**Grouping** is the next logical step **after** clustering is performed

The major difference between grouping and clustering is **that in grouping your input is an explicit set of items that are already of interest**

**Clustering separates similar data points that can be further narrowed down by grouping**, allowing the threat hunter to isolate data that is potentially of interest

Once potentially suspicious data has been identified, grouping works to eliminate benign data and generate more useful groups of suspicious data

## Stack counting

**Stack counting**, also known as "**stacking**", involves counting the number of occurrences of values of a particular type and analyzing outliers of those results

* It is an analysis method used in a simulated haystack to find the needle. It is the most popular practice conducted by hunters to examine a hypothesis
* Data stacking is used to isolate and classify patterns by using frequency analyzes in mass quantities of related data
* In the context of a large data set, the investigator identifies the characteristics that differentiate the odd data rows and may prove that they are malicious
* Identifying similar or equal values across multiple artifacts may reveal useful correlations for threat hunting

### Applications

Stack counting can be applied to any case where anomalies are significant

It works best when most data fits into a range, set of ranges or set of common values

Identifying that 99% of traffic uses HTTP(s), DNS or SMTP means that the remaining 1% of traffic deserves further investigation

### Feature selection

When performing stack counting, the **choice of features to stack** is important

Features, where every value is expected to be unique, won't help with detecting outliers

Any anomalies that are detected need to be **meaningful and useful**

### **Tools**

Stack counting can be automated using common tools

Excel has built-in functionality to help with data analysis

* Sorting columns to expose anomalies
* Cell highlighting rules
* VBScript analysis

Languages like Python and R have lots of data analysis functionality

`grep` and `awk` can be used on the command line to quickly analyze log files

The **pivot tables** of Microsoft Office, the **`stats`** command of Splunk or the **`top`** command of Arcsight are all examples of stacking


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://davidjosearaujo.gitbook.io/online-courses/cyber-threat-hunting/threat-hunting-techniques-and-generative-ai/anomalies-and-baselining/grouping-and-clustering-with-ai.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
