In the previous chapter we trained a network with plain gradient descent: compute the gradient, take a small step in the opposite direction, repeat. It works, and for a single neuron it works fine. But once you scale up to a real network with millions of weights and noisy mini-batches, plain gradient descent becomes slow and fragile. Optimizers are the algorithms that fix this. They take the raw gradients from backpropagation and turn them into smarter weight updates.
Vanilla SGD has two practical weaknesses. The first is that it uses one fixed learning rate for every weight in the network and keeps that rate the same from the first step to the last. Some weights need large adjustments early on and tiny ones later; a single global rate cannot do both well.
The second problem is the shape of the loss surface. Real loss landscapes are full of long narrow valleys, flat plateaus, and saddle points. In a narrow valley, plain SGD tends to bounce from one wall to the other instead of moving along the floor toward the minimum. On a flat plateau it barely moves at all because the gradient is tiny. Add the noise that comes from updating on small mini-batches and the path to the minimum can be slow and jittery.
Every optimizer below is an answer to one or both of these problems.
The first improvement is to stop treating each step as independent. Instead of moving purely on the current gradient, momentum keeps a running average of recent gradients and moves along that. The usual analogy is a ball rolling downhill: it builds up speed in a consistent direction and isn't thrown off course by every small bump.
In practice you keep a velocity term that blends the old velocity with the new gradient:
velocity = (momentum * velocity) - (learning_rate * gradient)
weight = weight + velocity
The momentum coefficient is typically around 0.9. The effect is that consistent gradient directions accumulate and speed the descent, while the back-and-forth oscillation in a narrow valley cancels itself out. SGD with momentum is still widely used today, especially in computer vision, where it often generalizes a little better than the fancier methods.
Momentum solves the oscillation problem but still uses one learning rate for everything. RMSprop attacks that directly. It keeps a running average of the squared gradients for each weight and divides the update by the square root of that average. The result is that a weight which has been seeing large gradients gets a smaller effective step, and a weight with consistently small gradients gets a larger one.
This per-parameter scaling is what makes RMSprop good at handling the uneven loss surfaces that trip up plain SGD, and it was historically a popular choice for recurrent networks like the ones we cover in the RNN chapter, where gradient magnitudes vary a lot across time steps.
Adam is what you get when you combine the two ideas above. It keeps a momentum-style running average of the gradients (so it has direction and speed) and an RMSprop-style running average of the squared gradients (so each weight gets its own adapted learning rate). Two coefficients control how much history each average keeps, conventionally written as beta1 (around 0.9) and beta2 (around 0.999), and you rarely need to touch them.
The reason Adam became the default optimizer for most deep learning work is simple: it usually trains quickly and reliably with very little tuning. If you are starting a new model and don't have a strong reason to do otherwise, begin with Adam at a learning rate of 0.001 and only change course if training misbehaves.
from tensorflow.keras import optimizers
# The sensible default for most new models
model.compile(optimizer=optimizers.Adam(learning_rate=1e-3),
loss='categorical_crossentropy',
metrics=['accuracy'])
One variant worth knowing is AdamW, which handles weight decay (a regularization technique that gently shrinks weights to prevent overfitting) more correctly than the original Adam did. AdamW is now the standard optimizer for training Transformer models, including the large language models behind tools you may already be building with, such as the embedding and generation models discussed in the RAG field manual. If you train or fine-tune anything Transformer-based, AdamW is the one to use.
Optimizers adapt the step size automatically to some degree, but it usually still helps to lower the overall learning rate as training progresses. Early on you want larger steps to make fast progress; later you want smaller steps to settle precisely into a minimum. This is done with a learning rate schedule, which might decay the rate on a fixed timetable, drop it whenever progress stalls, or follow a cosine curve down to near zero. Transformer training often adds a short warmup at the very start, where the rate ramps up from a tiny value before the schedule begins, which helps stabilize those first unstable steps.
For almost any new project, start with Adam at 0.001 and get a baseline working before you optimize anything else. If you are training a vision model and care about squeezing out the best possible generalization, it's worth comparing against SGD with momentum, which sometimes edges ahead on those tasks. If you are working with Transformers or fine-tuning a pretrained language model, use AdamW. RMSprop is still a reasonable pick for some recurrent models, though Adam covers most of those cases too.
The honest summary is that the optimizer is rarely the thing standing between you and a working model. Get a reasonable one running, then spend your effort on data quality, architecture, and the learning rate, which matters far more than the choice between these algorithms.
Trying a different optimizer is a one-line change, which makes it easy to compare them on your own problem:
from tensorflow.keras import optimizers
# SGD with momentum
opt = optimizers.SGD(learning_rate=0.01, momentum=0.9)
# RMSprop
opt = optimizers.RMSprop(learning_rate=1e-3)
# Adam (default choice)
opt = optimizers.Adam(learning_rate=1e-3)
# AdamW (for Transformers / fine-tuning)
opt = optimizers.AdamW(learning_rate=1e-3, weight_decay=1e-2)
model.compile(optimizer=opt, loss='categorical_crossentropy', metrics=['accuracy'])
That completes the machinery of learning. You can now build a network, give it the right activations, and train it properly. From here the series turns to the architectures that made deep learning famous, starting with the convolutional networks that power almost everything in computer vision.
Sign in to join the discussion and post comments.
Sign in