Understanding Adam Optimizer: (an Efficient Optimization Algorithm)

Adam Optimizer

almajid.dev -Recently learning about something new for me called as Adam Optimizer, feels like a new thing to me as an electrical engineering student even though it is very related.

Studying on it for my final assignment as a suggestion from my lecturer Dr. Eng. Lukmanul Hakim, S.T., M.Sc., maybe a little late for me getting a bachelor on this but i will work hard for it. Back to Adam, i will write in this blog about the Adam optimizer that i have learned.

What is Adam Optimizer?

Adam, short for Adaptive Momentum Estimation, is an optimization algorithm widely used to training machine learning. It was introduced in 2014 by Diederik P. Kingma (University of Amsterdam) with Jimmy Lei Ba (University of Toronto), in their paper titled "Adam: A Method for Stochastic Optimization".

Adam Optimizer is an algorithm for first-order gradient-based optimization of stochastic objective function, based on adaptive estimates of lower-order moments. Where the methods is designed to combine advantages of two popular methods: Adagrad and RMSProp.

Adam combines the advantages of Adagrad, which provides adaptive learning rates for each parameter, with the stable and non-decaying learning rates of RMSProp. And because Adam only requires first-order gradients, it is computationally efficient and uses relatively little memory.

How Adam Optimizer Work

As we know from the explanation about Adam, Adam Optimizer is a type Stochastic Gradient Descent, but it's a lot smarter and more efficient. While traditional SGD updates model parameters using just the current gradient, Adam goes further by keeping track of two things:

The average of past gradients (this is like momentum, helping the optimizer move in the right direction).
The average of past squared gradients (which helps adjust the step size for each parameter).

These two pieces of information are updated every step and are then corrected (bias correction) to make sure they're accurate in the early stages of training. Finally, Adam updates the parameters by combining these corrected values in a way that results in smooth, adaptive, and reliable learning.

With the explanation in formula below, it will be more easier to understand how Adam Optimizer work.

Adam Optimizer Algorithm Formula

1. Initialization

m₀ = 0 (Initialize first moment vector)
v₀ = 0 (Initialize second moment vector)
t = 0 (Initialize timestep)

2. Compute Gradients

g_t = ∇_θf_t(θ_{t - 1})

3. Update biased first moment estimate

m_t = β₁. m_{t - 1} + (1 - β₁) . g_t

4. Update biased second moment estimate

v_t = β₂. v_{t - 1} + (1 - β₂) . g_t²

5. Correct the bias in the moments

m̂_t = m_t/(1 - β₁^t)

v̂_t = v_t/(1 - β₂^t)

6. Update parameters

θ_t_{+ 1} = θ_t - α . m̂_t/(√v̂_t + ϵ)

Advantages using Adam Optimizer

Some reason why Adam Optimizer better as optimization algorithm is relative faster for convergence, have adaptive learning rates, memory and computational efficiency.

source:

Adam: A method for stochastic optimization. arXiv:1412.6980. Diederik P. Kingma and Jimmy Lei Ba
An overview of gradient descent optimization algorithms. arXiv:1609.04747. Sebastian Ruder
Gentle Introduction to the Adam Optimization Algorithm for Deep Learning by Jason Brownlee in machinelearningmastery.com
The Math Behind the Adam Optimizer by Cristian Leo in towardsdatascience.com

Understanding Adam Optimizer: (an Efficient Optimization Algorithm)

What is Adam Optimizer?

How Adam Optimizer Work

Adam Optimizer Algorithm Formula

Advantages using Adam Optimizer

Related Posts

Post a Comment