# Optimizers¶

## Adadelta¶

Be the first to contribute!

## Adagrad¶

Be the first to contribute!

## Adam¶

Adaptive Moment Estimation (Adam) combines ideas from both RMSProp and Momentum. It computes adaptive learning rates for each parameter and works as follows.

• First, it computes the exponentially weighted average of past gradients ($$v_{dW}$$).
• Second, it computes the exponentially weighted average of the squares of past gradients ($$s_{dW}$$).
• Third, these averages have a bias towards zero and to counteract this a bias correction is applied ($$v_{dW}^{corrected}$$, $$s_{dW}^{corrected}$$).
• Lastly, the parameters are updated using the information from the calculated averages.
$\begin{split}v_{dW} = \beta_1 v_{dW} + (1 - \beta_1) \frac{\partial \mathcal{J} }{ \partial W } \\ s_{dW} = \beta_2 s_{dW} + (1 - \beta_2) (\frac{\partial \mathcal{J} }{\partial W })^2 \\ v^{corrected}_{dW} = \frac{v_{dW}}{1 - (\beta_1)^t} \\ s^{corrected}_{dW} = \frac{s_{dW}}{1 - (\beta_1)^t} \\ W = W - \alpha \frac{v^{corrected}_{dW}}{\sqrt{s^{corrected}_{dW}} + \varepsilon}\end{split}$

Note

• $$v_{dW}$$ - the exponentially weighted average of past gradients
• $$s_{dW}$$ - the exponentially weighted average of past squares of gradients
• $$\beta_1$$ - hyperparameter to be tuned
• $$\beta_2$$ - hyperparameter to be tuned
• $$\frac{\partial \mathcal{J} }{ \partial W }$$ - cost gradient with respect to current layer
• $$W$$ - the weight matrix (parameter to be updated)
• $$\alpha$$ - the learning rate
• $$\epsilon$$ - very small value to avoid dividing by zero

## Conjugate Gradients¶

Be the first to contribute!

## BFGS¶

Be the first to contribute!

## Momentum¶

Used in conjunction Stochastic Gradient Descent (sgd) or Mini-Batch Gradient Descent, Momentum takes into account past gradients to smooth out the update. This is seen in variable $$v$$ which is an exponentially weighted average of the gradient on previous steps. This results in minimizing oscillations and faster convergence.

$\begin{split}v_{dW} = \beta v_{dW} + (1 - \beta) \frac{\partial \mathcal{J} }{ \partial W } \\ W = W - \alpha v_{dW}\end{split}$

Note

• $$v$$ - the exponentially weighted average of past gradients
• $$\frac{\partial \mathcal{J} }{ \partial W }$$ - cost gradient with respect to current layer weight tensor
• $$W$$ - weight tensor
• $$\beta$$ - hyperparameter to be tuned
• $$\alpha$$ - the learning rate

## Nesterov Momentum¶

Be the first to contribute!

## Newton’s Method¶

Be the first to contribute!

## RMSProp¶

Another adaptive learning rate optimization algorithm, Root Mean Square Prop (RMSProp) works by keeping an exponentially weighted average of the squares of past gradients. RMSProp then divides the learning rate by this average to speed up convergence.

$\begin{split}s_{dW} = \beta s_{dW} + (1 - \beta) (\frac{\partial \mathcal{J} }{\partial W })^2 \\ W = W - \alpha \frac{\frac{\partial \mathcal{J} }{\partial W }}{\sqrt{s^{corrected}_{dW}} + \varepsilon}\end{split}$

Note

• $$s$$ - the exponentially weighted average of past squares of gradients
• $$\frac{\partial \mathcal{J} }{\partial W }$$ - cost gradient with respect to current layer weight tensor
• $$W$$ - weight tensor
• $$\beta$$ - hyperparameter to be tuned
• $$\alpha$$ - the learning rate
• $$\epsilon$$ - very small value to avoid dividing by zero

## SGD¶

Stochastic Gradient Descent.

def SGD(data, batch_size, lr):
N = len(data)
np.random.shuffle(data)
mini_batches = np.array([data[i:i+batch_size]
for i in range(0, N, batch_size)])
for X,y in mini_batches:
backprop(X, y, lr)


References