Feature selection is the process of identifying and selecting a subset of the most relevant features (input variables) from a larger set of available features in a dataset. The goal is to retain the most informative variables while removing those that are redundant, irrelevant, or noisy.
This process plays a critical role in building efficient and accurate machine learning models. Feature selection helps reduce overfitting, enhance model generalization, speed up training, and improve model interpretability.
Feature selection methods are broadly classified into three categories:
Filter methods use statistical techniques to assess the relationship between each input feature and the target variable. Features are ranked based on a score, and a selection is made independently of any machine learning algorithm.
from sklearn.datasets import load_breast_cancer
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.preprocessing import MinMaxScaler
import pandas as pd
# Load a sample dataset
data = load_breast_cancer()
X, y = data.data, data.target
feature_names = data.feature_names
# Chi-square test requires non-negative features, so we scale the data
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)
# Apply SelectKBest with chi2 score function to select top 10 features
selector = SelectKBest(score_func=chi2, k=10)
X_new = selector.fit_transform(X_scaled, y)
# Get selected feature names
selected_indices = selector.get_support(indices=True)
selected_features = feature_names[selected_indices]
# Show selected feature names
print("Top 10 selected features using Chi-Square Test:")
print(selected_features)
# Optionally, display as DataFrame
X_selected_df = pd.DataFrame(X_new, columns=selected_features)
print("\nTransformed Feature Set Shape:", X_selected_df.shape)
print(X_selected_df.head())Output -
Top 10 selected features using Chi-Square Test:
['mean radius' 'mean perimeter' 'mean area' 'mean concavity'
'mean concave points' 'worst radius' 'worst perimeter' 'worst area'
'worst concavity' 'worst concave points']
Transformed Feature Set Shape: (569, 10)
mean radius mean perimeter mean area mean concavity ... worst perimeter worst area worst concavity worst concave points
0 0.521037 0.545989 0.363733 0.703140 ... 0.668310 0.450698 0.568610 0.912027
1 0.643144 0.615783 0.501591 0.203608 ... 0.539818 0.435214 0.192971 0.639175
2 0.601496 0.595743 0.449417 0.462512 ... 0.508442 0.374508 0.359744 0.835052
3 0.210090 0.233501 0.102906 0.565604 ... 0.241347 0.094008 0.548642 0.884880
4 0.629893 0.630986 0.489290 0.463918 ... 0.506948 0.341575 0.319489 0.558419
[5 rows x 10 columns]sklearn.datasets, which is a binary classification problem.SelectKBest(score_func=chi2, k=10) selects the top 10 features that have the strongest relationship with the target variable.Wrapper methods evaluate subsets of features by training and testing a model on different combinations. They use model performance (e.g., accuracy or F1-score) as the selection criterion.
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import RFE
import pandas as pd
# Load a sample dataset
data = load_breast_cancer()
X, y = data.data, data.target
feature_names = data.feature_names
# Initialize the model (using liblinear solver for small datasets)
model = LogisticRegression(solver='liblinear')
# Create RFE object to select top 5 features
rfe = RFE(estimator=model, n_features_to_select=5)
X_selected = rfe.fit_transform(X, y)
# Get the selected feature indices and names
selected_indices = rfe.get_support(indices=True)
selected_features = feature_names[selected_indices]
# Display selected feature names
print("Top 5 selected features using RFE with Logistic Regression:")
print(selected_features)
# Optionally, display the transformed dataset
X_selected_df = pd.DataFrame(X_selected, columns=selected_features)
print("\nTransformed Feature Set Shape:", X_selected_df.shape)
print(X_selected_df.head())Output -
Top 5 selected features using RFE with Logistic Regression:
['mean radius' 'mean concavity' 'worst radius' 'worst concavity'
'worst concave points']
Transformed Feature Set Shape: (569, 5)
mean radius mean concavity worst radius worst concavity worst concave points
0 17.99 0.3001 25.38 0.7119 0.2654
1 20.57 0.0869 24.99 0.2416 0.1860
2 19.69 0.1974 23.57 0.4504 0.2430
3 11.42 0.2414 14.91 0.6869 0.2575
4 20.29 0.1980 22.54 0.4000 0.1625n_features_to_select=5) remains.get_support(indices=True) returns the indices of the selected features.Embedded methods perform feature selection during the model training process. These methods combine the advantages of both filter and wrapper methods and are usually more efficient.
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import Lasso
from sklearn.preprocessing import StandardScaler
import pandas as pd
import numpy as np
# Load dataset
data = load_breast_cancer()
X, y = data.data, data.target
feature_names = data.feature_names
# Standardize features (important for regularization methods like Lasso)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Initialize and fit Lasso regression model
lasso = Lasso(alpha=0.01) # Lower alpha means less regularization
lasso.fit(X_scaled, y)
# Identify selected features (non-zero coefficients)
selected_mask = lasso.coef_ != 0
selected_features = feature_names[selected_mask]
# Output results
print("Selected Features using Lasso Regression:")
print(selected_features)
# Optionally display selected feature data
X_selected = X_scaled[:, selected_mask]
X_selected_df = pd.DataFrame(X_selected, columns=selected_features)
print("\nTransformed Feature Set Shape:", X_selected_df.shape)
print(X_selected_df.head())Output -
Selected Features using Lasso Regression:
['mean texture' 'mean concave points' 'mean fractal dimension'
'radius error' 'smoothness error' 'concavity error' 'worst radius'
'worst texture' 'worst smoothness' 'worst concavity'
'worst concave points' 'worst symmetry']
Transformed Feature Set Shape: (569, 12)
mean texture mean concave points mean fractal dimension ... worst concavity worst concave points worst symmetry
0 -2.073335 2.532475 2.255747 ... 2.109526 2.296076 2.750622
1 -0.353632 0.548144 -0.868652 ... -0.146749 1.087084 -0.243890
2 0.456187 2.037231 -0.398008 ... 0.854974 1.955000 1.152255
3 0.253732 1.451707 4.910919 ... 1.989588 2.175786 6.046041
4 -1.151816 1.428493 -0.562450 ... 0.613179 0.729259 -0.868353
[5 rows x 12 columns]alpha reduces regularization, possibly retaining more features.from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
import numpy as np
# Load dataset
data = load_breast_cancer()
X, y = data.data, data.target
feature_names = data.feature_names
# Initialize and fit Random Forest classifier
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X, y)
# Extract feature importances
importances = model.feature_importances_
importance_threshold = 0.01 # You can adjust this threshold
# Identify features above threshold
selected_mask = importances > importance_threshold
selected_features = feature_names[selected_mask]
# Output results
print("Selected Features using Random Forest Feature Importances:")
for feature, importance in zip(feature_names[selected_mask], importances[selected_mask]):
print(f"{feature}: {importance:.4f}")
# Optionally display selected feature data
X_selected = X[:, selected_mask]
X_selected_df = pd.DataFrame(X_selected, columns=selected_features)
print("\nTransformed Feature Set Shape:", X_selected_df.shape)
print(X_selected_df.head())Output -
Selected Features using Random Forest Feature Importances:
mean radius: 0.0348
mean texture: 0.0152
mean perimeter: 0.0680
mean area: 0.0605
mean compactness: 0.0116
mean concavity: 0.0669
mean concave points: 0.1070
radius error: 0.0143
perimeter error: 0.0101
area error: 0.0296
worst radius: 0.0828
worst texture: 0.0175
worst perimeter: 0.0808
worst area: 0.1394
worst smoothness: 0.0122
worst compactness: 0.0199
worst concavity: 0.0373
worst concave points: 0.1322
Transformed Feature Set Shape: (569, 18)
mean radius mean texture mean perimeter mean area ... worst smoothness worst compactness worst concavity worst concave points
0 17.99 10.38 122.80 1001.0 ... 0.1622 0.6656 0.7119 0.2654
1 20.57 17.77 132.90 1326.0 ... 0.1238 0.1866 0.2416 0.1860
2 19.69 21.25 130.00 1203.0 ... 0.1444 0.4245 0.4504 0.2430
3 11.42 20.38 77.58 386.1 ... 0.2098 0.8663 0.6869 0.2575
4 20.29 14.34 135.10 1297.0 ... 0.1374 0.2050 0.4000 0.1625
[5 rows x 18 columns]Many algorithms provide a direct way to assess feature importance, which can guide selection:
| Method Type | Technique Examples | Model Dependency | Interaction Consideration | Scalability |
|---|---|---|---|---|
| Filter | Correlation, Chi-Square, ANOVA | No | No | High |
| Wrapper | RFE, Forward/Backward Selection | Yes | Yes | Low |
| Embedded | Lasso, Decision Trees | Yes | Yes | Medium |
Sign in to join the discussion and post comments.
Sign inUnsupervised Learning
Explore Unsupervised Learning techniques to uncover patterns, structures, and relationships in unlabeled data.
Supervised Learning
Discover what Supervised Learning is, how it works, and what you'll learn in this hands-on tutorial series covering top ML algorithms like Linear Regression, Decision Trees, SVM, and more.