> For the complete documentation index, see [llms.txt](https://davidjosearaujo.gitbook.io/notes-mcs/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://davidjosearaujo.gitbook.io/notes-mcs/machine-learning-applied-to-security/spam/logistic-regression.md).

# Logistic Regression

Logistic regression estimates the probability of an event occurring, such as voting or not voting, based on a given dataset of independent variables.

## Inner Workings

Here we are going to look at the binary classification case, but it is straightforward to generalize the algorithm to multi-class classification.

Assume that we have a *k* predictor: $${X}^k\_{i=1} \in \mathbb{R}$$ and a binary response variable: $$Y \in {0,1}$$

In the logistic regression algorithm, the relationship between the predictors and the *logit* of the probability of a positive outcome $$Y=1$$ is assumed to be linear: $$logit(P(Y=1|w))=c+\sum\_{i=1}^kw\_iX\_i$$

where$${w\_i}^k\_{i=1} \in \mathbb{R}^k$$are the linear weights and $$c \in \mathbb{R}$$ the intercept.

Now what is the *logit* function? It is the log of odds: $$logit(p)=\ln\left(\frac{p}{1−p}\right)$$

We see that the *logit* function is a way to map a probability value from \[0,1] to $$\mathbb{R}$$

The inverse of the *logit* is the *logistic* curve \[also called sigmoid function], which we are going to note $$\sigma$$: $$\sigma(r)=\frac{1}{1+e^{−r}}$$

If we denote by $$w=\[c;w\_1;...;w\_k]^T$$ the weight vector, $$x=\[1;x\_1;...;x\_k]^T$$ the observed values of the predictors, and *y* the associated class value, we have: $$logit(P(y=1|w))=w^Tx$$

And thus: $$P(y=1|w)=\sigma(w^Tx)≡\sigma\_w(x)$$

For a given set of weights *w*, the probability of a positive outcome is $$\sigma\_w(x)$$.

This probability can be turned into a predicted class label $$\hat{y}$$ using a threshold value:$$\hat{y} = 1 ; \text{if} ; \sigma\_{\textbf{w}} (\textbf{x}) \geq 0.5, ; 0 ; \text{otherwise}$$

### Cost function

Now we assume that we have $$n$$ observations and that they are independently Bernoulli distributed: $${ \left( \textbf{x}^{(1)}, y^{(1)} \right), \left( \textbf{x}^{(2)}, y^{(2)} \right), ..., \left( \textbf{x}^{(n)}, y^{(n)} \right) }$$

The likelihood that we would like to maximize given the samples is the following one:

$$
L(\textbf{w}) = \prod\_{i=1}^n P( y^{(i)} | \textbf{x}^{(i)}; \textbf{w}) = \prod\_{i=1}^n \sigma\_{\textbf{w}} \left(\textbf{x}^{(i)} \right)^{y^{(i)}} \left( 1- \sigma\_{\textbf{w}} \left(\textbf{x}^{(i)} \right)\right)^{1-y^{(i)}}
$$

For some reasons related to numerical stability, we prefer to deal with a scaled log-likelihood. Also, we take the negative, to get a minimization problem:

$$
J(\textbf{w}) = - \frac{1}{n} \sum\_{i=1}^n \left\[ y^{(i)} \log \left( \sigma\_{\textbf{w}} \left(\textbf{x}^{(i)} \right) \right) + \left( 1-y^{(i)} \right) \log \left( 1- \sigma\_{\textbf{w}} \left(\textbf{x}^{(i)} \right)\right) \right] \tag{8}
$$

A great feature of this cost function is that it is differentiable and convex. A gradient-based algorithm should find the global minimum. Now let's also introduce some $$l2$$-regularization to improve the model: $$J\_r(\textbf{w}) = J(\textbf{w}) + \frac{\lambda}{2} \textbf{w}^T \textbf{w}$$ with $$\lambda \geq 0$$.

Regularization is a very useful method to handle collinearity \[high correlation among features], filter out noise from data, and eventually prevent overfitting.

## Learning the weights

So we need to minimize $$J\_r(\textbf{w})$$. For that, we are going to apply Gradient descent. This method requires the gradient of the cost function: $$\nabla\_{\textbf{w}} J\_r(\textbf{w})$$

### Compute the gradient

We could compute the gradient of this Logistic regression cost function analytically. However, we won't, because we are lazy and want JAX to do it for us! Also, we can say that JAX would be more relevant if applied to a very complex function for which an analytical derivative is very hard or impossible to compute, such as the cost function of a deep neural network for example.

So let's differentiate this cost function concerning the first and second positional arguments using JAX's grad function.


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://davidjosearaujo.gitbook.io/notes-mcs/machine-learning-applied-to-security/spam/logistic-regression.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
