Decision Trees and Random Forests are powerful supervised learning algorithms used for both classification and regression tasks. They are easy to understand, interpret, and visualize, making them popular choices for real-world problems.
A Decision Tree is a flowchart-like tree structure where:
The tree splits the data recursively based on the feature that provides the best separation using metrics like Gini Index or Information Gain (based on entropy).
We’ll classify whether a person will buy a product based on age and income.
from sklearn.tree import DecisionTreeClassifier
import pandas as pd
# Sample dataset
data = {
'Age': [25, 30, 45, 35, 22],
'Income': [50000, 60000, 80000, 120000, 30000],
'Buy': ['No', 'No', 'Yes', 'Yes', 'No']
}
df = pd.DataFrame(data)
# Feature and target
X = df[['Age', 'Income']]
y = df['Buy']
# Create and train model
model = DecisionTreeClassifier()
model.fit(X, y)
# Predict with column names to match training data
new_data = pd.DataFrame([[28, 55000]], columns=['Age', 'Income'])
print(model.predict(new_data))Output-
['No']from sklearn.tree import plot_tree
import matplotlib.pyplot as plt
plt.figure(figsize=(10, 6))
plot_tree(model, feature_names=['Age', 'Income'], class_names=['No', 'Yes'], filled=True)
plt.show()A Random Forest is an ensemble of decision trees. It builds multiple decision trees using random subsets of the data and features, then averages their predictions (regression) or uses majority voting (classification).
This reduces overfitting and improves accuracy and generalization.
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
# Create and train Random Forest model
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X, y)
# Prediction with column names to avoid warning
new_data = pd.DataFrame([[28, 55000]], columns=['Age', 'Income'])
print(rf_model.predict(new_data))Output-
['No']| Decision Trees | Random Forests |
|---|---|
| Easy to interpret and visualize | More accurate than single trees |
| Can overfit easily | Resistant to overfitting |
| Fast training on small data | Slower and more resource-intensive |
Use the same metrics as other classifiers:
from sklearn.metrics import accuracy_score, classification_report
y_pred = model.predict(X)
print("Accuracy:", accuracy_score(y, y_pred))
print(classification_report(y, y_pred))Output -
Accuracy: 1.0
precision recall f1-score support
No 1.00 1.00 1.00 3
Yes 1.00 1.00 1.00 2
accuracy 1.00 5
macro avg 1.00 1.00 1.00 5
weighted avg 1.00 1.00 1.00 5max_depth, min_samples_split, and n_estimators to improve performance.RandomForestClassifier with feature importance (model.feature_importances_) to rank inputs.Sign in to join the discussion and post comments.
Sign in