When training machine learning models, achieving a balance between accuracy and generalization is key. Two common issues that arise during this process are overfitting and underfitting. Understanding these concepts helps build models that perform well on both training and unseen data.
Overfitting happens when a model learns the training data too well, including noise and minor fluctuations. It performs well on training data but poorly on new, unseen data.
A highly complex curve that passes through every training point but fails to predict test data properly.
Underfitting occurs when a model is too simple to capture the underlying pattern of the data. It performs poorly on both training and test data.
A straight line trying to fit a non-linear relationship, missing the trend entirely.
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_regression
from sklearn.metrics import mean_squared_error
# Generate dataset
X, y = make_regression(n_samples=100, n_features=1, noise=15, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Underfitting model
linear_model = LinearRegression()
linear_model.fit(X_train, y_train)
y_pred_linear = linear_model.predict(X_test)
# Overfitting model
tree_model = DecisionTreeRegressor(max_depth=20)
tree_model.fit(X_train, y_train)
y_pred_tree = tree_model.predict(X_test)
# Evaluation
print("Linear Regression MSE (Underfitting):", mean_squared_error(y_test, y_pred_linear))
print("Decision Tree MSE (Overfitting):", mean_squared_error(y_test, y_pred_tree))Output-
Linear Regression MSE (Underfitting): 234.45500969670806
Decision Tree MSE (Overfitting): 500.20712573958053| Observation | Training Error | Test Error | Diagnosis |
|---|---|---|---|
| High | High | Underfitting | |
| Low | High | Overfitting | |
| Low | Low | Good Fit |
Overfitting and underfitting are crucial problems in model development. Striking the right balance helps build models that generalize well to unseen data. Monitoring training and validation metrics throughout the training process is a good way to keep these issues in check.
Sign in to join the discussion and post comments.
Sign in