Minimizing the least squares error with quadratic regularization

In this Notebook,

  1. we show that minimizing the least squares error in a regression context is related to a Gaussian assumption of the error distribution
  2. we also show that $L2$ regularization appears under an assumption of Gaussian distribution of the parameters and a Baysian treatment
  3. We show that maximizing the posterior distribution is equivalent to minimizing the regularized sum-of-squares error function with a quadratic regularizer

For a discution on the Cross Entropy loss function see :

http://romain.raveaux.free.fr/document/CrossEntropy.html

Recall : The Gaussian distribution

In particular, we can consider the distribution the variable $x$ itself, which is given by

$$Pr(x|\mu,\sigma^2) = \frac{1}{{\sqrt {2\pi \sigma^2 } }} \exp \{ \frac{- (x - \mu ))^2 }{2\sigma ^2} \}=\mathcal{N}(x|\mu,\sigma^2) $$ which is governed by two parameters: $\mu$, called the mean, and $\sigma^2$, called the variance. The square root of the variance, given by $\sigma$, is called the standard deviation.

$$\mathbb{E}[x,\mathcal{N}(x|\mu,\sigma^2)]=\mu$$ $$var[x,\mathcal{N}(x|\mu,\sigma^2)]=\sigma^2$$ $$\mathbb{E}[x^2,\mathcal{N}(x|\mu,\sigma^2)]=\mu^2 +\sigma^2 $$

Mean square error loss

We shall see that the least squares approach to finding the model parameters represents a specific case of maximum likelihood. The goal is to be able to make predictions for the target variable $t$ given some new value of the input variable $x$ on the basis of a set of training data comprising M input values $X = (x_1, . . . , x_M)^T$ and their corresponding target values $T = (t_1, . . . , t_M)^T$. We can express our uncertainty over the value of the target variable using a probability distribution.

The least square error can be defined as follows : $$MSE=\frac{1}{M}\sum_{n=1}^M { \{ \hat{t}_n-t_n }\}^2 =\frac{1}{M} ||\hat{T}-T ||_2^2$$

We now use the training data to determine the values of the unknown parameters $w$ of our predictor $f(x,w)$ by maximum likelihood. For this purpose, we shall assume that, given the value of $x$, the corresponding value of $t$ has a Gaussian distribution with a mean equal to the value $\hat{t}_n=f(x_n,w)$. $$Pr(t|x,w,\sigma)=Pr(t|\hat{t},\sigma)=\mathcal{N}(t|\hat{t}_n,\sigma^2)$$ $\sigma$ is the variance of the distribution. If the data are assumed to be drawn independently from the distribution, then the likelihood function is given by $$Pr(t|\hat{t},\sigma)=\Pi_{n=1}^M \mathcal{N}(t_n|\hat{t}_n,\sigma^2)$$ It is convenient to maximize the logarithm of the likelihood function.

$$\log Pr(t|\hat{t},\sigma)=-\frac{1}{2\sigma^2}\sum_{n=1}^M { \{ \hat{t}_n-t_n }\}^2- \frac{M}{2} \log \sigma^2 - \frac{M}{2} \log(2 \pi) $$

$$\log Pr(t|x,w,\sigma)=-\frac{1}{2\sigma^2}\sum_{n=1}^M { \{ f(x_n,w)-t_n }\}^2- \frac{M}{2} \log \sigma^2 - \frac{M}{2} \log(2 \pi) $$

$$w^*=arg \max_w \log Pr(t|x,w,\sigma)$$

Consider first the determination of the maximum likelihood solution which will be denoted by $w^*$. These are determined by maximizing the above equation with respect to $w$. For this purpose, we can omit the last two terms on the right-hand side of because they do not depend on $w$.

Also, we note that scaling the log likelihood by a positive constant coefficient does not alter the location of the maximum with respect to $w$, and so we can replace the coefficient $\frac{1}{2\sigma^2}$ with $\frac{1}{M}$. Finally, instead of maximizing the log likelihood, we can equivalently minimize the negative log likelihood. $$w^*=arg \min_w - \log Pr(t|x,w,\sigma)$$

To do so we simply derive the $-\log Pr(t|x,w,\sigma)$ and find where the derivative is equal to zero. $$\frac{\partial - log Pr(t|x,w,\sigma)}{\partial w} =\frac{\partial }{\partial w} ( \frac{1}{2\sigma^2}\sum_{n=1}^M { \{ f(x_n,w)-t_n }\}^2)=0$$

We therefore see that maximizing likelihood is equivalent to minimizing the sum-of-squares error function (so far as determining $w$).

Breaking news :

Thus the sum-of-squares error function has arisen as a consequence of maximizing likelihood under the assumption of a Gaussian noise distribution.

Breaking news : just said differently

We also showed that this error function could be motivated as the maximum likelihood solution under an assumed Gaussian noise model.

The target variable $t$ is given by a deterministic function $f(x,w)$ with additive Gaussian noise so that $t = f(x,w) + \epsilon$

where $\epsilon$ is a zero mean Gaussian random variable with variance $\sigma^2$. Thus we can write $$ Pr(t|x,w, \sigma) = N(t|f(x,w), \sigma^2)$$

If we assume a squared loss function, then the optimal prediction, for a new value of $x$, will be given by $Pr(t|x,w^*, \sigma)$

Let'us find $\sigma^2$

We can first determine the parameter vector $w^*$ governing the mean and subsequently use this to find the precision $\sigma^{2*}$ as was the case for the simple Gaussian see : http://romain.raveaux.free.fr/document/Overfittingbiaisedandunbiaisedvariance.html

Maximizing with respect to $\sigma^2$ gives $$\log Pr(t|x,w,\sigma)=-\frac{1}{2\sigma^2}\sum_{n=1}^M { \{ f(x_n,w)-t_n }\}^2- \frac{M}{2} \log \sigma^2 - \frac{M}{2} \log(2 \pi) $$

$$\sigma^{2*}=arg \max_{\sigma^2} \log Pr(t|x,w,\sigma)$$

To do so we simply derive the $-\log Pr(t|x,w,\sigma)$ with respect to $\sigma^2$ and find where the derivative is equal to zero. $$\frac{\partial - log Pr(t|x,w,\sigma)}{\partial \sigma^2} =0$$ $$\sigma^{2*}=\frac{1}{M} \sum_{n=1}^M (f(x_n,w^*)-t_n)^2 $$

Predictive distribution

Having determined the parameters $w$ and $\sigma^2$, we can now make predictions for new values of x. Because we now have a probabilistic model, these are expressed in terms of the predictive distribution that gives the probability distribution over $t$, rather than simply a point estimate, and is obtained by substituting the maximum likelihood parameters to give

$$Pr(t|x,w^*,\sigma^{*})=\mathcal{N}(t|f(x,w^*),\sigma^{2*})$$

Regularization of loss functions

Regularization is a common feature of machine learning technique to deal with generalization problem. One technique that is often used to control the over-fitting phenomenon in such cases is that of regularization, which involves adding a penalty term to the error function in order to discourage the coefficients from reaching large values. The simplest such penalty term takes the form of a sum of squares of all of the coefficients, leading to a modified error function of the form: $$Loss=\frac{1}{M}\sum_{n=1}^M { \{ f(x_n,w)-t_n }\}^2 + \frac{\lambda}{2} ||w||^2_2$$ where $||w||_2^2$ and the coefficient $\lambda$ governs the relative importance of the regularization term compared with the sum-of-squares error term. The $L2$ norm regularization can be explained by a Bayesian approach.

Therefore we introduce a prior distribution over the coefficients $w$. Let us consider a Gaussian distribution of the form $$Pr(w|\alpha) = \mathcal{N}(w|0,\alpha \textbf{I})= \exp{(-\frac{1}{2\alpha} w^Tw})$$ where $\alpha$ is the variance of the distribution. Variables such as $\alpha$, which control the distribution of model parameters, are called hyperparameters. Using Bayes’ theorem, the posterior distribution for $w$ is proportional to the product of the prior distribution and the likelihood function $$ \text{Posterior} \propto \text{Likelihood} \times \text{Prior}$$

$$Pr(w|x,t,\alpha,\sigma) \propto Pr(t|\hat{t},\sigma)Pr(w|\alpha)$$ We can now determine $w$ by finding the most probable value of $w$ given the data, in other words by maximizing the posterior distribution. This technique is called $\textit{Maximum A Posteriori}$, or simply MAP. Taking the negative logarithm, we find that the maximum of the posterior is given by the minimum of $$\frac{1}{2\sigma^2} \sum_{n=1}^M { \{ f(x_n,w)-t_n }\}^2 + \frac{1}{2\alpha} w^Tw$$

Breaking news : maximizing the posterior distribution is equivalent to minimizing the regularized sum-of-squares error function

We see that maximizing the posterior distribution is equivalent to minimizing the regularized sum-of-squares error function with a regularization parameter given by $\lambda=\sigma^2/\alpha$. This statement can be found in \citep{bishop:2006:PRML}. Many other regularization terms can be found in the literature such as $L1$ norm or the pseudo norm $L0$. Classically, the cost function in a learning problem can be written as : $$\text{Cost Function= Loss (i.e. MSE or cross entropy) + Regularization term} $$

This particular choice of regularizer is known in the machine learning literature as weight decay because in sequential learning algorithms, it encourages weight values to decay towards zero, unless supported by the data. In statistics, it provides an example of a parameter shrinkage method because it shrinks parameter values towards zero. It has the advantage that the error function remains a quadratic function of $w$

The particular case of a quadratic regularizer is called ridge regression. In the context of neural networks, this approach is known as weight decay.

Solving the problem

Let us denote the cost function $L$, the loss function $l$ and the regularizer $r$. We are seeking for the minimum of the function $L$ so we need to find where the derivative of the function is equal to zero: $$\dfrac{\partial L}{\partial W} = 0 $$

Gradient descent method

The problem can be solved by the gradient descent method.

The goal is to choose the parameter update to comprise a small step in the direction of the negative gradient, so that $$W^{t+1}=W^{t}-\alpha. \dfrac{\partial L}{\partial W}$$ where the parameter $\alpha > 0$ is known as the learning rate.

If we explicitly show the regularizer. $$W^{t+1}=W^{t}-\alpha. ( \dfrac{\partial L}{\partial W} + w^t)$$ $$W^{t+1}=W^{t}(1-\alpha)- ( \dfrac{\partial L}{\partial W} + w^t)$$

We explicitly see that $W^{t+1}$ will be decreased by a factor $\alpha$. This another explanation for the name weight decay. Overfitted (to the data) parameters are likely to develop large positive and negative values. One technique that is often used to control the over-fitting phenomenon in such cases is that of regularization, which involves adding a penalty term to the error function in order to discourage the parameters from reaching large values.

Todo : Show an example of Gradient Descent with a reguarized loss function