SPAM or HAM

Implementing a proper Classifier

Numerical Stability

Numerical instability is a concept that refers to the propensity of an algorithm or computational procedure to produce inaccurate results due to round-off errors, truncation errors, or other computational issues.

These errors may be small initially but can accumulate and escalate during iterations, leading to results significantly far off from the expected or precise value.

The previous implementation had to use a reduction to multiple likelihood probabilities together.

Due to rounding errors, the program can produce a bad result.

This is where smoothing can help (by providing a small probability of unseen words).

Furthermore, we can explore other operations to reduce the number of multiplications:

logab=loga+logb\log ab = \log a + \log b
explogx=x\exp \log x = x

Load Libraries

Import NLTK and download the additional data for the tokenizer and lemmatizer.

Tokenization

The process of tokenizing or splitting a string, or text into a list of tokens. One can think of tokens as parts like a word is a token in a sentence, and a sentence is a token in a paragraph.

Lemmatization

The process of grouping together the different inflected forms of a word so they can be analyzed as a single item. Lemmatization is similar to stemming but it brings context to the words. So it links words with similar meanings to one word.

Examples of lemmatization:

  • rocks:rock

  • corpora:corpus

  • better:good

Implementing proper Text Mining

Term frequency

Term frequency measures how frequently a term occurs within a document.

The easiest calculation is simply counting the number of times a word appears. However, there are ways to modify that value based on the document length or the frequency of the most frequently used word in the document.

Below we will explore the usage of TF-IDF as a ranking method.

Data visualization

There are two traditional visualizations for textual data:

  1. Bar plots (for frequency and ranking)

  2. WordClouds

Word Clouds

Word clouds or tag clouds are graphical representations of word frequency that give greater prominence to words that appear more frequently in a source text.

The larger the word in the visual the more common the word was in the document(s).

TF-IDF

TF-IDF (term frequency-inverse document frequency) is a statistical measure that evaluates how relevant a word is to a document in a collection of documents.

This is done by multiplying two metrics: how many times a word appears in a document, and the inverse document frequency of the word across a set of documents.

TF-IDF for a word in a document is calculated by multiplying two different metrics:

  • The term frequency of a word in a document. There are several ways of calculating this frequency, with the simplest being a raw count of instances a word appears in a document. Then, there are ways to adjust the frequency, by the length of a document, or by the raw frequency of the most frequent word in a document.

  • The inverse document frequency of the word across a set of documents. This means, how common or rare a word is in the entire document set. The closer it is to 0, the more common a word is. This metric can be calculated by taking the total number of documents, dividing it by the number of documents that contain a word, and calculating the logarithm.

So, if the word is very common and appears in many documents, this number will approach 0. Otherwise, it will approach 1.

tf-idf(t)=tf(t)×idf(t)tf\textnormal{-}idf(t) = tf(t) \times idf(t)
tf(t)=log(1+freq(t))tf(t) = \log(1+freq(t))
idf(t)=log(Ncount(tD)+1)idf(t) = \log\left(\frac{N}{count(t \in D)+1}\right)

Last updated