Requirements

Deep Learning

Deep learning is a special machine learning approach that facilitates the extraction of features of a high level of abstraction from low-level data. Deep learning has proven successful in computer vision, speech recognition, natural language processing, and other tasks.

A deep learning model can learn complex feature hierarchies and incorporate diverse steps of the malware detection pipeline into one solid model that can be trained end-to-end so that all of the components of the model are learned simultaneously.

Large representative datasets are required

It is important to emphasize the data-driven nature of this approach. A created model depends heavily on the data it has seen during the training phase to determine which features are statistically relevant for predicting the correct label.

We must train our models on a data set that correctly represents the conditions where the model will be working in the real world. This makes the task of collecting a representative dataset crucial for machine learning to be successful.

The trained model has to be interpretable (XAI)

Most of the model families used currently, like deep neural networks, are called black box models. Black box models are given the input X, and they will produce Y through a complex sequence of operations that can hardly be interpreted by a human.

For example, when a false alarm occurs, and we want to understand why it happened, we ask whether it was a problem with a training set or the model itself. The interpretability of a model determines how easy it will be for us to manage it, assess its quality, and correct its operation.

False positive rates must be extremely low

False positives happen when an algorithm mistakes a malicious label for a benign file. Our aim is to make the false positive rate as low as possible. This is complicated by the fact that there are lots of clean files in the world, and they keep appearing.

To address this problem, it is important to impose high requirements for both machine learning models and metrics that will be optimized during training, with a clear focus on low false positive rate (FPR) models.

Models adaptability

Outside the malware detection domain, machine learning algorithms regularly work under the assumption of fixed data distribution, which means that it doesn’t change with time. When we have a training set that is large enough, we can train the model so that it will effectively reason any new sample in a test set. As time goes on, the model will continue working as expected.

After applying machine learning to malware detection, we have to face the fact that our data distribution isn’t fixed:

  • Active adversaries (malware writers) constantly work on avoiding detections and releasing new versions of malware files that differ significantly from those that have been seen during the training phase.

  • Thousands of software companies produce new types of benign executables that are significantly different from previously known types. The data on these types was lacking in the training set, but the model, nevertheless, needs to recognize them as benign.

Last updated