Gradient descent is an algorithm to find the parameters of a model.
On the basis of a set of data set comprising M input values $X = (x_1, . . . , x_M)^T$ and their corresponding target values $T = (t_1, . . . , t_M)^T$, we want to find the parameters $W$ of the model $f(x,W)$ such that a given criterion is minimized.
In machine learning, the criterion is often an error function such that : $$L(X,T,W)=\frac{1}{M}\sum_{n=1}^M { \{ f(x_n,W)-t_n }\}^2 $$ $$or$$ $$L(X,T,W)=\frac{-1}{M}\sum_{n=1}^M \{ t_n \log (f(x_n,W)) + (1-t_n) \log( 1-f(x_n,W) ) \} \quad \text{with }t_n \in \{0,1\}$$
The optimization problem is then defined as : $$ W^*=arg \min_W L(X,T,W)$$
We are seeking for the minimum of the function $L$ so we need to find where the derivative of the function is equal to zero: $$\dfrac{\partial L(X,T,W)}{\partial W} = 0 $$ This can be done by the gradient descent method.
If the fonction $L$ is convex then we can start with any random parameters $W$ and reach the global minimum of $L$ by gradient descent. If the the problem is not convex then the gradient descent method only finds a suboptimal solution.
The goal is to choose the parameter update to comprise a small step in the direction of the negative gradient, so that $$W^{(t+1)}=W^{(t)}-\alpha. \dfrac{\partial L}{\partial W^{(t)}}$$ where the parameter $\alpha > 0$ is known as the learning rate.
The gradient on its own can be a bit noisy it means it changes a lot at each iteration. In order to get a clear trend on the gradient values, it is possible to make it smoother.
The iterations of the algorithm are modified as follows: $$U_{dw}^{(t)}=\beta U_{dw}^{(t-1)} +(1- \beta ) \dfrac{\partial L}{\partial W^{(t)}}$$ The update of the parameters is as follows : $$W^{(t+1)}=W^{(t)}-\alpha. U_{dw}^{(t)}$$
The first value that is often set to 0: $$U_{dw}^{(t=0)}=0$$ $\beta$ is a hyper parameter that is often set to $0.9$.
Momemtm is averaging past gradient values. The average is an exponential average. It means that each value of the gradient is weighted by an exponential function $\beta^t$ (see figure).
$$U_{dw}^{(100)}=0.9 U_{dw}^{(99)} +0.1 \dfrac{\partial L}{\partial W^{(100)}}$$ $$U_{dw}^{(99)}=0.9 U_{dw}^{(98)} +0.1 \dfrac{\partial L}{\partial W^{(99)}}$$ $$U_{dw}^{(98)}=0.9 U_{dw}^{(97)} +0.1 \dfrac{\partial L}{\partial W^{(98)}}$$ $$...$$ Re-arranging, we can express $U_{dw}^{(100)}$ with the past values : $$U_{dw}^{(100)}=0.9 \big\{ 0.9 U_{dw}^{(98)} +0.1 \dfrac{\partial L}{\partial W^{(99)}} \big\}+0.1 \dfrac{\partial L}{\partial W^{(100)}}$$ $$U_{dw}^{(100)}=\big\{ 0.9^2 U_{dw}^{(98)} +0.9 \dfrac{\partial L}{\partial W^{(99)}} \big\}+0.1 \dfrac{\partial L}{\partial W^{(100)}}$$
$$U_{dw}^{(100)}= \big\{ \big( 0.9^3 U_{dw}^{(97)} +0.9^2 \dfrac{\partial L}{\partial W^{(98)}} \big) +0.9 \dfrac{\partial L}{\partial W^{(99)}} \big\}+0.1 \dfrac{\partial L}{\partial W^{(100)}}$$ The exponential weighting is appearing.
By setting $\beta=0.9$ then it is actually averaging the 10 gradient values from the 10 past iterations. The gradient at the iteration 10 is then divided by 3 making it less significant. More generally, Momemtum is actually averaging approximately $\frac{1}{1-\beta}$ iterations. $$(1- \epsilon)^{\frac{1}{\epsilon}}=\frac{1}{e}=0.35$$ With $\epsilon=(1-\beta)$, we obtain: $$\frac{1}{1-\beta}$$
By seetting these values, we see that the methods will need a warm up phase. The first value $U_{dw}^{(t=0)}=0$ will have a strong impact at the beginning. However, in practice, after 10 iterations the bias introduced by the first value will disappear. It is also paossible to correct this behaviour with a normalization that is called bias correction : $$U_{dw}^{(t,corrected)}=\frac{U_{dw}^{(t)}}{1 -\beta^t} $$
Note that in statistics, this procedure is called an exponentially weighted average.
It only takes one line of code and one real number is kept in memory
Root Mean Squared Prop
The iterations of the algorithm are modified as follows: $$S_{dw}^{(t)}=\beta_2 S_{dw}^{(t-1)} +(1- \beta_2 ) \big\{ \dfrac{\partial L}{\partial W^{(t)}} \big\}^2$$ The update of the parameters is as follows : $$W^{(t+1)}=W^{(t)}-\alpha. \dfrac{\partial L}{\partial W^{(t)}} \frac{1}{\sqrt{S_{dw}^{(t)}}}$$
To prevent $\sqrt(0)$, it is possible to add $\epsilon=10^{-8} :$ $$\frac{1}{\sqrt{S_{dw}^{(t)}+\epsilon}}$$
The square that is put on the gradient will emphase large gradient values such that $ \big\{ \dfrac{\partial L}{\partial W^{(t)}} \big\}^2$ will get very large. Consequently, $S_{dw}^{(t)}$ will be very high too. Accordingly, in the update rule, a large gradient value $\dfrac{\partial L}{\partial W^{(t)}}$ will be divided by a large value of $S_{dw}^{(t)}$. It will reduce high variations of the gradient.
ADAM = Adapt Moment Adaptation
The iterations of the algorithm are modified as follows: $$U_{dw}^{(t)}=\beta_1 U_{dw}^{(t-1)} +(1- \beta_1 ) \dfrac{\partial L}{\partial W^{(t)}} \to \text{ This is momemtum}$$ $$S_{dw}^{(t)}=\beta_2 S_{dw}^{(t-1)} +(1- \beta_2 ) \big\{ \dfrac{\partial L}{\partial W^{(t)}} \big\}^2 \to \text{ This is RMS prop}$$
The update of the parameters is as follows : $$W^{(t+1)}=W^{(t)}-\alpha. \dfrac{\partial L}{\partial W^{(t)}} \frac{U_{dw}^{(t)}}{\sqrt{S_{dw}^{(t)}}}$$
It is also possible to apply bias correction on $U_{dw}^{(t)}$ and $S_{dw}^{(t)}$.
ADAM is combining Momemtum and RMS prop
$$\beta_1 = 0.9$$ $$\beta_2 = 0.999$$ $$\epsilon=10^{-8}$$
If $\alpha$ is high then the method will not converge and it will keep on oscillating. At the opposite, if $\alpha$ is low then convergence will be slow.
A compromise would be to have a large learning rate at the beginning and then after some iterations, we could reduce the learning rate to obtain smaller oscillations in a tighter region of the solution space. $$\alpha=\frac{1}{1+decay*t}\alpha_0$$ With : $$\alpha_0=0.1$$ $${decay}=1$$