Gradient Descent : Momentum, RMSProp and ADAM

Author : Romain Raveaux

Gradient descent is an algorithm to find the parameters of a model.

On the basis of a set of data set comprising M input values $X = (x_1, . . . , x_M)^T$ and their corresponding target values $T = (t_1, . . . , t_M)^T$, we want to find the parameters $W$ of the model $f(x,W)$ such that a given criterion is minimized.

The criterion

In machine learning, the criterion is often an error function such that : $$L(X,T,W)=\frac{1}{M}\sum_{n=1}^M { \{ f(x_n,W)-t_n }\}^2 $$ $$or$$ $$L(X,T,W)=\frac{-1}{M}\sum_{n=1}^M \{ t_n \log (f(x_n,W)) + (1-t_n) \log( 1-f(x_n,W) ) \} \quad \text{with }t_n \in \{0,1\}$$

The optimization problem

The optimization problem is then defined as : $$ W^*=arg \min_W L(X,T,W)$$

We are seeking for the minimum of the function $L$ so we need to find where the derivative of the function is equal to zero: $$\dfrac{\partial L(X,T,W)}{\partial W} = 0 $$ This can be done by the gradient descent method.

Gradient descent method

If the fonction $L$ is convex then we can start with any random parameters $W$ and reach the global minimum of $L$ by gradient descent. If the the problem is not convex then the gradient descent method only finds a suboptimal solution.

The goal is to choose the parameter update to comprise a small step in the direction of the negative gradient, so that $$W^{(t+1)}=W^{(t)}-\alpha. \dfrac{\partial L}{\partial W^{(t)}}$$ where the parameter $\alpha > 0$ is known as the learning rate.

Issue of the Gradient descent

The gradient on its own can be a bit noisy it means it changes a lot at each iteration. In order to get a clear trend on the gradient values, it is possible to make it smoother.

Momemtum and Gradient descent

The principle

The iterations of the algorithm are modified as follows: $$U_{dw}^{(t)}=\beta U_{dw}^{(t-1)} +(1- \beta ) \dfrac{\partial L}{\partial W^{(t)}}$$ The update of the parameters is as follows : $$W^{(t+1)}=W^{(t)}-\alpha. U_{dw}^{(t)}$$

Hyper parameters

The first value that is often set to 0: $$U_{dw}^{(t=0)}=0$$ $\beta$ is a hyper parameter that is often set to $0.9$.

What it is doing

Momemtm is averaging past gradient values. The average is an exponential average. It means that each value of the gradient is weighted by an exponential function $\beta^t$ (see figure). Figure

$$U_{dw}^{(100)}=0.9 U_{dw}^{(99)} +0.1 \dfrac{\partial L}{\partial W^{(100)}}$$ $$U_{dw}^{(99)}=0.9 U_{dw}^{(98)} +0.1 \dfrac{\partial L}{\partial W^{(99)}}$$ $$U_{dw}^{(98)}=0.9 U_{dw}^{(97)} +0.1 \dfrac{\partial L}{\partial W^{(98)}}$$ $$...$$ Re-arranging, we can express $U_{dw}^{(100)}$ with the past values : $$U_{dw}^{(100)}=0.9 \big\{ 0.9 U_{dw}^{(98)} +0.1 \dfrac{\partial L}{\partial W^{(99)}} \big\}+0.1 \dfrac{\partial L}{\partial W^{(100)}}$$ $$U_{dw}^{(100)}=\big\{ 0.9^2 U_{dw}^{(98)} +0.9 \dfrac{\partial L}{\partial W^{(99)}} \big\}+0.1 \dfrac{\partial L}{\partial W^{(100)}}$$

$$U_{dw}^{(100)}= \big\{ \big( 0.9^3 U_{dw}^{(97)} +0.9^2 \dfrac{\partial L}{\partial W^{(98)}} \big) +0.9 \dfrac{\partial L}{\partial W^{(99)}} \big\}+0.1 \dfrac{\partial L}{\partial W^{(100)}}$$ The exponential weighting is appearing.

By setting $\beta=0.9$ then it is actually averaging the 10 gradient values from the 10 past iterations. The gradient at the iteration 10 is then divided by 3 making it less significant. More generally, Momemtum is actually averaging approximately $\frac{1}{1-\beta}$ iterations. $$(1- \epsilon)^{\frac{1}{\epsilon}}=\frac{1}{e}=0.35$$ With $\epsilon=(1-\beta)$, we obtain: $$\frac{1}{1-\beta}$$

By seetting these values, we see that the methods will need a warm up phase. The first value $U_{dw}^{(t=0)}=0$ will have a strong impact at the beginning. However, in practice, after 10 iterations the bias introduced by the first value will disappear. It is also paossible to correct this behaviour with a normalization that is called bias correction : $$U_{dw}^{(t,corrected)}=\frac{U_{dw}^{(t)}}{1 -\beta^t} $$

Note that in statistics, this procedure is called an exponentially weighted average.

The advantage

It only takes one line of code and one real number is kept in memory

TODO : Code the Momemtum principle, play with $\beta$, implement the bias correction

RMS Prop and Gradient descent

Root Mean Squared Prop

The principle

The iterations of the algorithm are modified as follows: $$S_{dw}^{(t)}=\beta_2 S_{dw}^{(t-1)} +(1- \beta_2 ) \big\{ \dfrac{\partial L}{\partial W^{(t)}} \big\}^2$$ The update of the parameters is as follows : $$W^{(t+1)}=W^{(t)}-\alpha. \dfrac{\partial L}{\partial W^{(t)}} \frac{1}{\sqrt{S_{dw}^{(t)}}}$$

To prevent $\sqrt(0)$, it is possible to add $\epsilon=10^{-8} :$ $$\frac{1}{\sqrt{S_{dw}^{(t)}+\epsilon}}$$

What it is doing

The square that is put on the gradient will emphase large gradient values such that $ \big\{ \dfrac{\partial L}{\partial W^{(t)}} \big\}^2$ will get very large. Consequently, $S_{dw}^{(t)}$ will be very high too. Accordingly, in the update rule, a large gradient value $\dfrac{\partial L}{\partial W^{(t)}}$ will be divided by a large value of $S_{dw}^{(t)}$. It will reduce high variations of the gradient.

RMS prop reduces the gradient oscillations and allows the use of a large learning rate

Legend : The legend says that this approach was not proposed in an academic research paper but during a course on Coursera made by Geoffrey Hinton.

ADAM optimizer

ADAM = Adapt Moment Adaptation

The principle at iteration $t$

The iterations of the algorithm are modified as follows: $$U_{dw}^{(t)}=\beta_1 U_{dw}^{(t-1)} +(1- \beta_1 ) \dfrac{\partial L}{\partial W^{(t)}} \to \text{ This is momemtum}$$ $$S_{dw}^{(t)}=\beta_2 S_{dw}^{(t-1)} +(1- \beta_2 ) \big\{ \dfrac{\partial L}{\partial W^{(t)}} \big\}^2 \to \text{ This is RMS prop}$$

The update of the parameters is as follows : $$W^{(t+1)}=W^{(t)}-\alpha. \dfrac{\partial L}{\partial W^{(t)}} \frac{U_{dw}^{(t)}}{\sqrt{S_{dw}^{(t)}}}$$

It is also possible to apply bias correction on $U_{dw}^{(t)}$ and $S_{dw}^{(t)}$.

The principle

ADAM is combining Momemtum and RMS prop

Hyper parameters

$$\beta_1 = 0.9$$ $$\beta_2 = 0.999$$ $$\epsilon=10^{-8}$$

Learning rate decay

If $\alpha$ is high then the method will not converge and it will keep on oscillating. At the opposite, if $\alpha$ is low then convergence will be slow.

A compromise would be to have a large learning rate at the beginning and then after some iterations, we could reduce the learning rate to obtain smaller oscillations in a tighter region of the solution space. $$\alpha=\frac{1}{1+decay*t}\alpha_0$$ With : $$\alpha_0=0.1$$ $${decay}=1$$