- Feature Engineering & Data Preprocessing
-
Overview
- Handling Missing Data in ML
- Feature Scaling (Normalization vs. Standardization)
- Encoding Categorical Variables
- Feature Selection Techniques
- Dimensionality Reduction Techniques
- Feature Extraction from Text and Images
- Handling Imbalanced Data (SMOTE, Class Weights)
Feature Selection Techniques
Add to BookmarkFeature selection is the process of identifying and selecting a subset of the most relevant features (input variables) from a larger set of available features in a dataset. The goal is to retain the most informative variables while removing those that are redundant, irrelevant, or noisy.
This process plays a critical role in building efficient and accurate machine learning models. Feature selection helps reduce overfitting, enhance model generalization, speed up training, and improve model interpretability.
Why Feature Selection Is Important
- Reduces Overfitting: By eliminating irrelevant or noisy features, the model becomes less likely to learn from noise in the data, which improves generalization.
- Improves Model Accuracy: Focusing on the most informative features often results in better predictive performance.
- Decreases Training Time: Fewer features mean fewer computations and faster training.
- Enhances Model Interpretability: With fewer features, it's easier to understand how the model makes predictions.
Categories of Feature Selection Techniques
Feature selection methods are broadly classified into three categories:
- Filter Methods
- Wrapper Methods
- Embedded Methods
1. Filter Methods
Filter methods use statistical techniques to assess the relationship between each input feature and the target variable. Features are ranked based on a score, and a selection is made independently of any machine learning algorithm.
Common Techniques:
- Pearson Correlation Coefficient: Measures linear correlation between numerical features and the target.
- Chi-Square Test: Tests the independence between categorical features and the target.
- Mutual Information: Captures both linear and non-linear relationships between variables.
- ANOVA F-test: Measures the variance between groups for classification problems.
Example: Using Chi-Square Test in Python
from sklearn.datasets import load_breast_cancer
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.preprocessing import MinMaxScaler
import pandas as pd
# Load a sample dataset
data = load_breast_cancer()
X, y = data.data, data.target
feature_names = data.feature_names
# Chi-square test requires non-negative features, so we scale the data
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)
# Apply SelectKBest with chi2 score function to select top 10 features
selector = SelectKBest(score_func=chi2, k=10)
X_new = selector.fit_transform(X_scaled, y)
# Get selected feature names
selected_indices = selector.get_support(indices=True)
selected_features = feature_names[selected_indices]
# Show selected feature names
print("Top 10 selected features using Chi-Square Test:")
print(selected_features)
# Optionally, display as DataFrame
X_selected_df = pd.DataFrame(X_new, columns=selected_features)
print("\nTransformed Feature Set Shape:", X_selected_df.shape)
print(X_selected_df.head())
Output -
Top 10 selected features using Chi-Square Test:
['mean radius' 'mean perimeter' 'mean area' 'mean concavity'
'mean concave points' 'worst radius' 'worst perimeter' 'worst area'
'worst concavity' 'worst concave points']
Transformed Feature Set Shape: (569, 10)
mean radius mean perimeter mean area mean concavity ... worst perimeter worst area worst concavity worst concave points
0 0.521037 0.545989 0.363733 0.703140 ... 0.668310 0.450698 0.568610 0.912027
1 0.643144 0.615783 0.501591 0.203608 ... 0.539818 0.435214 0.192971 0.639175
2 0.601496 0.595743 0.449417 0.462512 ... 0.508442 0.374508 0.359744 0.835052
3 0.210090 0.233501 0.102906 0.565604 ... 0.241347 0.094008 0.548642 0.884880
4 0.629893 0.630986 0.489290 0.463918 ... 0.506948 0.341575 0.319489 0.558419
[5 rows x 10 columns]
Example Explanation:
- We use the Breast Cancer dataset from
sklearn.datasets
, which is a binary classification problem. - Since Chi-Square requires non-negative input values, we apply MinMaxScaler to scale all features between 0 and 1.
SelectKBest(score_func=chi2, k=10)
selects the top 10 features that have the strongest relationship with the target variable.- We then print out the names of the selected features and a preview of the new feature matrix.
Pros:
- Fast and scalable for high-dimensional data.
- Model-independent.
Cons:
- Does not account for feature interactions.
- Treats each feature independently.
2. Wrapper Methods
Wrapper methods evaluate subsets of features by training and testing a model on different combinations. They use model performance (e.g., accuracy or F1-score) as the selection criterion.
Common Techniques:
- Forward Selection: Start with no features and add one at a time that improves the model most.
- Backward Elimination: Start with all features and remove the least significant one iteratively.
- Recursive Feature Elimination (RFE): Recursively removes the least important feature based on model coefficients.
Example: Using RFE in Python
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import RFE
import pandas as pd
# Load a sample dataset
data = load_breast_cancer()
X, y = data.data, data.target
feature_names = data.feature_names
# Initialize the model (using liblinear solver for small datasets)
model = LogisticRegression(solver='liblinear')
# Create RFE object to select top 5 features
rfe = RFE(estimator=model, n_features_to_select=5)
X_selected = rfe.fit_transform(X, y)
# Get the selected feature indices and names
selected_indices = rfe.get_support(indices=True)
selected_features = feature_names[selected_indices]
# Display selected feature names
print("Top 5 selected features using RFE with Logistic Regression:")
print(selected_features)
# Optionally, display the transformed dataset
X_selected_df = pd.DataFrame(X_selected, columns=selected_features)
print("\nTransformed Feature Set Shape:", X_selected_df.shape)
print(X_selected_df.head())
Output -
Top 5 selected features using RFE with Logistic Regression:
['mean radius' 'mean concavity' 'worst radius' 'worst concavity'
'worst concave points']
Transformed Feature Set Shape: (569, 5)
mean radius mean concavity worst radius worst concavity worst concave points
0 17.99 0.3001 25.38 0.7119 0.2654
1 20.57 0.0869 24.99 0.2416 0.1860
2 19.69 0.1974 23.57 0.4504 0.2430
3 11.42 0.2414 14.91 0.6869 0.2575
4 20.29 0.1980 22.54 0.4000 0.1625
Example Explanation:
- We use the Breast Cancer dataset for demonstration.
- A Logistic Regression model is used as the base estimator.
- RFE works by recursively removing the least important feature based on the model’s coefficients until only the specified number (
n_features_to_select=5
) remains. get_support(indices=True)
returns the indices of the selected features.- The final output includes the names of the top 5 selected features and a preview of the transformed dataset.
Pros:
- Takes into account feature dependencies.
- Tailored to the specific algorithm used.
Cons:
- Computationally expensive, especially with large datasets.
- Prone to overfitting if not carefully cross-validated.
3. Embedded Methods
Embedded methods perform feature selection during the model training process. These methods combine the advantages of both filter and wrapper methods and are usually more efficient.
Common Techniques:
- L1 Regularization (Lasso Regression): Adds a penalty that shrinks less important feature coefficients to zero.
- Tree-Based Feature Importance: Uses ensemble models like Random Forest or Gradient Boosting to rank features based on importance.
Example 1: Using Lasso for Feature Selection
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import Lasso
from sklearn.preprocessing import StandardScaler
import pandas as pd
import numpy as np
# Load dataset
data = load_breast_cancer()
X, y = data.data, data.target
feature_names = data.feature_names
# Standardize features (important for regularization methods like Lasso)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Initialize and fit Lasso regression model
lasso = Lasso(alpha=0.01) # Lower alpha means less regularization
lasso.fit(X_scaled, y)
# Identify selected features (non-zero coefficients)
selected_mask = lasso.coef_ != 0
selected_features = feature_names[selected_mask]
# Output results
print("Selected Features using Lasso Regression:")
print(selected_features)
# Optionally display selected feature data
X_selected = X_scaled[:, selected_mask]
X_selected_df = pd.DataFrame(X_selected, columns=selected_features)
print("\nTransformed Feature Set Shape:", X_selected_df.shape)
print(X_selected_df.head())
Output -
Selected Features using Lasso Regression:
['mean texture' 'mean concave points' 'mean fractal dimension'
'radius error' 'smoothness error' 'concavity error' 'worst radius'
'worst texture' 'worst smoothness' 'worst concavity'
'worst concave points' 'worst symmetry']
Transformed Feature Set Shape: (569, 12)
mean texture mean concave points mean fractal dimension ... worst concavity worst concave points worst symmetry
0 -2.073335 2.532475 2.255747 ... 2.109526 2.296076 2.750622
1 -0.353632 0.548144 -0.868652 ... -0.146749 1.087084 -0.243890
2 0.456187 2.037231 -0.398008 ... 0.854974 1.955000 1.152255
3 0.253732 1.451707 4.910919 ... 1.989588 2.175786 6.046041
4 -1.151816 1.428493 -0.562450 ... 0.613179 0.729259 -0.868353
[5 rows x 12 columns]
Explanation:
- Lasso (L1) adds a penalty to the regression loss function that forces some feature coefficients to become exactly zero.
- Features with non-zero coefficients are selected.
- Standardization is necessary because Lasso is sensitive to feature scale.
- A smaller
alpha
reduces regularization, possibly retaining more features.
Example 2: Using Random Forest
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
import numpy as np
# Load dataset
data = load_breast_cancer()
X, y = data.data, data.target
feature_names = data.feature_names
# Initialize and fit Random Forest classifier
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X, y)
# Extract feature importances
importances = model.feature_importances_
importance_threshold = 0.01 # You can adjust this threshold
# Identify features above threshold
selected_mask = importances > importance_threshold
selected_features = feature_names[selected_mask]
# Output results
print("Selected Features using Random Forest Feature Importances:")
for feature, importance in zip(feature_names[selected_mask], importances[selected_mask]):
print(f"{feature}: {importance:.4f}")
# Optionally display selected feature data
X_selected = X[:, selected_mask]
X_selected_df = pd.DataFrame(X_selected, columns=selected_features)
print("\nTransformed Feature Set Shape:", X_selected_df.shape)
print(X_selected_df.head())
Output -
Selected Features using Random Forest Feature Importances:
mean radius: 0.0348
mean texture: 0.0152
mean perimeter: 0.0680
mean area: 0.0605
mean compactness: 0.0116
mean concavity: 0.0669
mean concave points: 0.1070
radius error: 0.0143
perimeter error: 0.0101
area error: 0.0296
worst radius: 0.0828
worst texture: 0.0175
worst perimeter: 0.0808
worst area: 0.1394
worst smoothness: 0.0122
worst compactness: 0.0199
worst concavity: 0.0373
worst concave points: 0.1322
Transformed Feature Set Shape: (569, 18)
mean radius mean texture mean perimeter mean area ... worst smoothness worst compactness worst concavity worst concave points
0 17.99 10.38 122.80 1001.0 ... 0.1622 0.6656 0.7119 0.2654
1 20.57 17.77 132.90 1326.0 ... 0.1238 0.1866 0.2416 0.1860
2 19.69 21.25 130.00 1203.0 ... 0.1444 0.4245 0.4504 0.2430
3 11.42 20.38 77.58 386.1 ... 0.2098 0.8663 0.6869 0.2575
4 20.29 14.34 135.10 1297.0 ... 0.1374 0.2050 0.4000 0.1625
[5 rows x 18 columns]
Code Explanation:
- Random Forests compute the importance of each feature based on how much it improves the purity of the split (e.g., Gini importance).
- A threshold (e.g., 0.01) is used to retain the most relevant features.
- Unlike Lasso, no scaling is required because tree-based models are invariant to feature scaling.
Pros:
- Efficient and effective.
- Integrates with the model training process.
Cons:
- Model-dependent.
- May not be ideal for very small datasets.
Evaluating Feature Importance
Many algorithms provide a direct way to assess feature importance, which can guide selection:
- Coefficients in linear models.
- Feature importance in tree-based models.
- Permutation importance methods.
Comparison of Feature Selection Methods
Method Type | Technique Examples | Model Dependency | Interaction Consideration | Scalability |
---|---|---|---|---|
Filter | Correlation, Chi-Square, ANOVA | No | No | High |
Wrapper | RFE, Forward/Backward Selection | Yes | Yes | Low |
Embedded | Lasso, Decision Trees | Yes | Yes | Medium |
Best Practices for Feature Selection
- Understand your data: Use EDA (exploratory data analysis) to detect correlations, data types, and feature distributions.
- Use domain knowledge: Human insight can often reveal feature relevance that statistics alone cannot.
- Apply multiple methods: Use a combination of filter and embedded methods to cross-validate your selection.
- Use cross-validation: Always validate your model’s performance after feature selection to avoid overfitting.
- Be cautious with automated methods: Some algorithms may discard useful features if hyperparameters are not tuned properly.
When to Perform Feature Selection
- Before training your model.
- As part of a pipeline (especially with scikit-learn pipelines).
- When dealing with datasets that have a large number of features relative to the number of samples.
Prepare for Interview
- JavaScript Interview Questions for 5+ Years Experience
- JavaScript Interview Questions for 2–5 Years Experience
- JavaScript Interview Questions for 1–2 Years Experience
- JavaScript Interview Questions for 0–1 Year Experience
- JavaScript Interview Questions For Fresher
- SQL Interview Questions for 5+ Years Experience
- SQL Interview Questions for 2–5 Years Experience
- SQL Interview Questions for 1–2 Years Experience
- SQL Interview Questions for 0–1 Year Experience
- SQL Interview Questions for Freshers
- Design Patterns in Python
- Dynamic Programming and Recursion in Python
- Trees and Graphs in Python
- Linked Lists, Stacks, and Queues in Python
- Sorting and Searching in Python
Random Blogs
- What to Do When Your MySQL Table Grows Too Wide
- Extract RGB Color From a Image Using CV2
- The Beginner’s Guide to Normalization and Denormalization in Databases
- Create Virtual Host for Nginx on Ubuntu (For Yii2 Basic & Advanced Templates)
- Transforming Logistics: The Power of AI in Supply Chain Management
- How to Start Your Career as a DevOps Engineer
- Python Challenging Programming Exercises Part 2
- AI Agents: The Future of Automation, Work, and Opportunities in 2025
- Downlaod Youtube Video in Any Format Using Python Pytube Library
- What is YII? and How to Install it?
- The Ultimate Guide to Starting a Career in Computer Vision
- How AI Companies Are Making Humans Fools and Exploiting Their Data
- Where to Find Free Datasets for Your Next Machine Learning & Data Science Project
- Datasets for Speech Recognition Analysis
- Government Datasets from 50 Countries for Machine Learning Training
Datasets for Machine Learning
- Amazon Product Reviews Dataset
- Ozone Level Detection Dataset
- Bank Transaction Fraud Detection
- YouTube Trending Video Dataset (updated daily)
- Covid-19 Case Surveillance Public Use Dataset
- US Election 2020
- Forest Fires Dataset
- Mobile Robots Dataset
- Safety Helmet Detection
- All Space Missions from 1957
- OSIC Pulmonary Fibrosis Progression Dataset
- Wine Quality Dataset
- Google Audio Dataset
- Iris flower dataset
- Artificial Characters Dataset