Across this series we have built networks, chosen activations, trained them with backpropagation, picked optimizers, explored architectures, and reused pretrained models. Through all of it, a handful of decisions kept coming up that the model does not make for you: how fast it should learn, how big it should be, how hard you should push back against overfitting. These are the hyperparameters, and getting them roughly right is usually what separates a model that works from one that doesn't.
It helps to be clear on the distinction. Parameters are the weights and biases inside the network, and the whole point of training is that the model learns these on its own. Hyperparameters are the settings you choose before and around training, the ones the learning process never touches. The learning rate, the number of layers, the number of neurons per layer, the batch size, how much dropout to apply, and how many epochs to train for are all hyperparameters. The model cannot discover them by gradient descent, so the job falls to you.
If you only tune one thing, tune the learning rate. As we saw in the chapters on backpropagation and optimizers, a rate that is too high makes training diverge and a rate that is too low makes it crawl or stall. A good starting point with Adam is 0.001, and a useful habit is to try a few values spaced apart, like 0.01, 0.001, and 0.0001, and watch how the loss behaves in the first few epochs before committing.
Most hyperparameter decisions come down to balancing two failure modes. Underfitting is when the model is too simple or undertrained to capture the patterns in the data, so it performs poorly even on the examples it was trained on. Overfitting is the opposite: the model memorises the training data, including its noise, and then fails on anything new.
You cannot see either problem if you only look at training accuracy, which is why you always hold out a separate validation set that the model never trains on. When training accuracy keeps climbing while validation accuracy stalls or drops, you are overfitting. When both are low, you are underfitting. Watching the gap between the two is the single most informative thing you can do while tuning.
You do not need an elaborate method to start. Tuning a few values by hand, guided by the train-validation gap, gets you a long way. When you want to be more systematic, two automatic approaches are common. Grid search tries every combination from lists you specify, which is thorough but expensive. Random search samples combinations at random and, perhaps surprisingly, usually finds good settings faster than grid search for the same budget, because it explores the important hyperparameters more freely. Beyond those, tools that use smarter strategies can automate the whole process, but they are worth reaching for only after the simpler approaches stop being enough.
A reliable order of operations looks like this. Get a baseline model training at all, even a mediocre one, so you have something to improve. Tune the learning rate first, since it has the largest effect. If the model is underfitting, increase its capacity or train longer. If it is overfitting, add dropout or weight decay and lean on early stopping. Change one thing at a time so you can tell what actually helped, and let the validation set be your judge throughout.
Two callbacks handle a lot of this automatically:
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau
callbacks = [
# stop when validation loss stops improving, keep the best weights
EarlyStopping(monitor='val_loss', patience=5,
restore_best_weights=True),
# drop the learning rate when progress plateaus
ReduceLROnPlateau(monitor='val_loss', factor=0.5, patience=3),
]
model.fit(X_train, y_train,
validation_data=(X_val, y_val),
epochs=100,
callbacks=callbacks)
Between early stopping and an automatic learning rate reduction, you can set a generous epoch count and let training find its own sensible stopping point.
That completes the journey from a single neuron to training, tuning, and adapting modern deep learning models. From here the path splits by interest. If images are your focus, the ideas here lead straight into computer vision, which builds on the CNN material. If language is yours, they lead into natural language processing, picking up where the Transformer chapter left off. And when you are ready to build real applications on top of large language models, the RAG field manual takes the embeddings and Transformer ideas from this series straight into production systems.
Sign in to join the discussion and post comments.
Sign in