Overfitting and Underfitting in Models

  Add to Bookmark

When training machine learning models, achieving a balance between accuracy and generalization is key. Two common issues that arise during this process are overfitting and underfitting. Understanding these concepts helps build models that perform well on both training and unseen data.


What You'll Learn

  • What overfitting and underfitting mean
  • How to detect them
  • Examples using Python
  • Strategies to prevent and correct these problems

What is Overfitting?

Overfitting happens when a model learns the training data too well, including noise and minor fluctuations. It performs well on training data but poorly on new, unseen data.

Symptoms:

  • High accuracy on training data
  • Low accuracy on test/validation data

Visual Example:

A highly complex curve that passes through every training point but fails to predict test data properly.


What is Underfitting?

Underfitting occurs when a model is too simple to capture the underlying pattern of the data. It performs poorly on both training and test data.

Symptoms:

  • Low training accuracy
  • Low test accuracy

Visual Example:

A straight line trying to fit a non-linear relationship, missing the trend entirely.


Example in Python

from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_regression
from sklearn.metrics import mean_squared_error
# Generate dataset
X, y = make_regression(n_samples=100, n_features=1, noise=15, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Underfitting model
linear_model = LinearRegression()
linear_model.fit(X_train, y_train)
y_pred_linear = linear_model.predict(X_test)
# Overfitting model
tree_model = DecisionTreeRegressor(max_depth=20)
tree_model.fit(X_train, y_train)
y_pred_tree = tree_model.predict(X_test)
# Evaluation
print("Linear Regression MSE (Underfitting):", mean_squared_error(y_test, y_pred_linear))
print("Decision Tree MSE (Overfitting):", mean_squared_error(y_test, y_pred_tree))

Output-

Linear Regression MSE (Underfitting): 234.45500969670806
Decision Tree MSE (Overfitting): 500.20712573958053

How to Detect Overfitting and Underfitting

ObservationTraining ErrorTest ErrorDiagnosis
HighHighUnderfitting 
LowHighOverfitting 
LowLowGood Fit 

How to Prevent Overfitting

  • Use simpler models (reduce depth, number of features)
  • Apply regularization (L1, L2 penalties)
  • Use cross-validation
  • Prune decision trees
  • Increase training data
  • Apply early stopping in iterative models

How to Prevent Underfitting

  • Use more complex models
  • Increase model capacity (more layers, features)
  • Reduce regularization
  • Improve feature engineering

Conclusion

Overfitting and underfitting are crucial problems in model development. Striking the right balance helps build models that generalize well to unseen data. Monitoring training and validation metrics throughout the training process is a good way to keep these issues in check.