Naive Bayes (Discrete)

The idea of this project is to write a simple Naive Bayes model to predict if an SMS message is spam or not.

Let us derive the necessary probabilities. Naive Bayes is a model that relies on Bayes' theorem:

P(YX)=P(XY)×P(Y)P(X)P(Y|X) = \frac{P(X|Y)\times P(Y)}{P(X)}

For this dataset, we can write the equation as.

P(yW1,...Wn)=P(W0,...Wny)×P(y)P(W1,...Wn)P(y|W_1, ... W_n) = \frac{P(W_0, ... W_n|y)\times P(y)}{P(W_1, ... W_n)}
P(yW1,...Wn)=P(W0W1,...Wn,y)×...×P(y)P(W0,...Wn) P(y|W_1, ... W_n) = \frac{P(W_0 | W_1, ... W_n,y)\times ...\times P(y)}{P(W_0, ... W_n)}
P(yW1,...Wn)=P(y)×i=0nP(Wiy)P(W0,...Wn)P(y|W_1, ... W_n) = \frac{P(y) \times \prod_{i=0}^{n}P(W_i|y)}{P(W_0, ... W_n)}
P(yW1,...Wn)=P(y)×i=0nP(Wiy)P(y)×i=0nP(Wiy)+P(¬y)×i=0nP(Wi¬y)P(y|W_1, ... W_n) = \frac{P(y) \times \prod_{i=0}^{n}P(W_i|y)}{P(y) \times \prod_{i=0}^{n}P(W_i|y) + P(\neg y) \times \prod_{i=0}^{n}P(W_i|\neg y)}

A Bayesian spam filter operates based on the principles of Bayesian probability theory to classify incoming messages as either spam or not spam (ham). The main principle behind its operation is to calculate the probability that a given message is spam or ham by analyzing the words and features within the message.

Steps Implementation

1. Training Phase

The filter starts by analyzing a set of pre-labeled training emails, where each email is marked as either spam or ham. It calculates the probability of each word occurring in spam and ham messages based on the training set. It also considers the probability of combinations of words, creating a model that reflects the likelihood of certain word patterns in spam or ham.

2. Calculating Message Probability

When a new email arrives, the filter breaks down the email into individual words and calculates the probability of it being spam or ham based on the pre-calculated probabilities from the training phase.

The filter considers both the presence and absence of specific words in the email.

3. Classifying the Message

The filter combines the individual probabilities for all the words in the message to calculate the overall probability of the message being spam or ham. If the probability exceeds a predefined threshold, the message is classified accordingly.

4. Adaptation and Learning

The filter continuously learns and adapts to new spam patterns by updating its probabilities based on new incoming messages.

Example

In the training phase, the filter determines that the word "free" occurs in 90% of spam messages and 5% of ham messages.

If an incoming message contains the word "free," the Bayesian filter calculates the probability of the message being spam based on this information.

P(SPAM"free")=P("free"SPAM)×P(SPAM)P("free")P(SPAM|"free") = \frac{P("free"|SPAM)\times P(SPAM)}{P("free")}

Where:

  • P("free"Spam)P("free"∣Spam) is the probability of finding the word "free" in a spam message.

  • P(Spam)P(Spam) is the prior probability of a message being spam.

  • P("free")P("free") is the overall probability of finding the word "free" in any message (spam or ham).

The filter repeats this process for all relevant words and calculates the overall probability. If this probability exceeds a certain threshold, the message is classified as spam.

Last updated