Machine Learning Approaches
Last updated
Last updated
The goal is to discover the structure of the data or the law of data generation.
Large unlabeled datasets are available to cybersecurity vendors and the cost of their manual labeling by experts is high – this makes unsupervised learning valuable for threat detection.
Supervised learning is a setting that is used when both the data and the right answers for each object are available. The goal is to fit the model that will produce the right answers for new objects.
Supervised learning consists of two stages:
Training a model and fitting a model to available training data.
Applying the trained model to new samples and obtaining predictions.
This training information is utilized during the training phase when we search for the best model that will produce the correct label Y for previously unseen objects given the feature set X.
In the case of malware detection, X could be some features of file content or behavior, for instance, file statistics and a list of used API functions. Labels Y could be malware or benign, or even a more precise classification, such as a virus, Trojan-Downloader, or adware.
After we have trained a model and verified its quality, we are ready for the next phase – applying the model to new objects. In this phase, the type of the model and its parameters do not change. The model only produces predictions.