Feature Selection Techniques

  Add to Bookmark

Feature selection is the process of identifying and selecting a subset of the most relevant features (input variables) from a larger set of available features in a dataset. The goal is to retain the most informative variables while removing those that are redundant, irrelevant, or noisy.

This process plays a critical role in building efficient and accurate machine learning models. Feature selection helps reduce overfitting, enhance model generalization, speed up training, and improve model interpretability.


Why Feature Selection Is Important

  1. Reduces Overfitting: By eliminating irrelevant or noisy features, the model becomes less likely to learn from noise in the data, which improves generalization.
  2. Improves Model Accuracy: Focusing on the most informative features often results in better predictive performance.
  3. Decreases Training Time: Fewer features mean fewer computations and faster training.
  4. Enhances Model Interpretability: With fewer features, it's easier to understand how the model makes predictions.

Categories of Feature Selection Techniques

Feature selection methods are broadly classified into three categories:

  1. Filter Methods
  2. Wrapper Methods
  3. Embedded Methods

1. Filter Methods

Filter methods use statistical techniques to assess the relationship between each input feature and the target variable. Features are ranked based on a score, and a selection is made independently of any machine learning algorithm.

Common Techniques:

  • Pearson Correlation Coefficient: Measures linear correlation between numerical features and the target.
  • Chi-Square Test: Tests the independence between categorical features and the target.
  • Mutual Information: Captures both linear and non-linear relationships between variables.
  • ANOVA F-test: Measures the variance between groups for classification problems.

Example: Using Chi-Square Test in Python

from sklearn.datasets import load_breast_cancer
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.preprocessing import MinMaxScaler
import pandas as pd

# Load a sample dataset
data = load_breast_cancer()
X, y = data.data, data.target
feature_names = data.feature_names

# Chi-square test requires non-negative features, so we scale the data
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)

# Apply SelectKBest with chi2 score function to select top 10 features
selector = SelectKBest(score_func=chi2, k=10)
X_new = selector.fit_transform(X_scaled, y)

# Get selected feature names
selected_indices = selector.get_support(indices=True)
selected_features = feature_names[selected_indices]

# Show selected feature names
print("Top 10 selected features using Chi-Square Test:")
print(selected_features)

# Optionally, display as DataFrame
X_selected_df = pd.DataFrame(X_new, columns=selected_features)
print("\nTransformed Feature Set Shape:", X_selected_df.shape)
print(X_selected_df.head())

Output - 

Top 10 selected features using Chi-Square Test:
['mean radius' 'mean perimeter' 'mean area' 'mean concavity'
 'mean concave points' 'worst radius' 'worst perimeter' 'worst area'
 'worst concavity' 'worst concave points']

Transformed Feature Set Shape: (569, 10)
   mean radius  mean perimeter  mean area  mean concavity  ...  worst perimeter  worst area  worst concavity  worst concave points
0     0.521037        0.545989   0.363733        0.703140  ...         0.668310    0.450698         0.568610              0.912027
1     0.643144        0.615783   0.501591        0.203608  ...         0.539818    0.435214         0.192971              0.639175
2     0.601496        0.595743   0.449417        0.462512  ...         0.508442    0.374508         0.359744              0.835052
3     0.210090        0.233501   0.102906        0.565604  ...         0.241347    0.094008         0.548642              0.884880
4     0.629893        0.630986   0.489290        0.463918  ...         0.506948    0.341575         0.319489              0.558419

[5 rows x 10 columns]

Example Explanation:

  • We use the Breast Cancer dataset from sklearn.datasets, which is a binary classification problem.
  • Since Chi-Square requires non-negative input values, we apply MinMaxScaler to scale all features between 0 and 1.
  • SelectKBest(score_func=chi2, k=10) selects the top 10 features that have the strongest relationship with the target variable.
  • We then print out the names of the selected features and a preview of the new feature matrix.

Pros:

  • Fast and scalable for high-dimensional data.
  • Model-independent.

Cons:

  • Does not account for feature interactions.
  • Treats each feature independently.

2. Wrapper Methods

Wrapper methods evaluate subsets of features by training and testing a model on different combinations. They use model performance (e.g., accuracy or F1-score) as the selection criterion.

Common Techniques:

  • Forward Selection: Start with no features and add one at a time that improves the model most.
  • Backward Elimination: Start with all features and remove the least significant one iteratively.
  • Recursive Feature Elimination (RFE): Recursively removes the least important feature based on model coefficients.

Example: Using RFE in Python

from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import RFE
import pandas as pd

# Load a sample dataset
data = load_breast_cancer()
X, y = data.data, data.target
feature_names = data.feature_names

# Initialize the model (using liblinear solver for small datasets)
model = LogisticRegression(solver='liblinear')

# Create RFE object to select top 5 features
rfe = RFE(estimator=model, n_features_to_select=5)
X_selected = rfe.fit_transform(X, y)

# Get the selected feature indices and names
selected_indices = rfe.get_support(indices=True)
selected_features = feature_names[selected_indices]

# Display selected feature names
print("Top 5 selected features using RFE with Logistic Regression:")
print(selected_features)

# Optionally, display the transformed dataset
X_selected_df = pd.DataFrame(X_selected, columns=selected_features)
print("\nTransformed Feature Set Shape:", X_selected_df.shape)
print(X_selected_df.head())

Output - 

Top 5 selected features using RFE with Logistic Regression:
['mean radius' 'mean concavity' 'worst radius' 'worst concavity'
 'worst concave points']

Transformed Feature Set Shape: (569, 5)
   mean radius  mean concavity  worst radius  worst concavity  worst concave points
0        17.99          0.3001         25.38           0.7119                0.2654
1        20.57          0.0869         24.99           0.2416                0.1860
2        19.69          0.1974         23.57           0.4504                0.2430
3        11.42          0.2414         14.91           0.6869                0.2575
4        20.29          0.1980         22.54           0.4000                0.1625

Example Explanation:

  • We use the Breast Cancer dataset for demonstration.
  • A Logistic Regression model is used as the base estimator.
  • RFE works by recursively removing the least important feature based on the model’s coefficients until only the specified number (n_features_to_select=5) remains.
  • get_support(indices=True) returns the indices of the selected features.
  • The final output includes the names of the top 5 selected features and a preview of the transformed dataset.

Pros:

  • Takes into account feature dependencies.
  • Tailored to the specific algorithm used.

Cons:

  • Computationally expensive, especially with large datasets.
  • Prone to overfitting if not carefully cross-validated.

3. Embedded Methods

Embedded methods perform feature selection during the model training process. These methods combine the advantages of both filter and wrapper methods and are usually more efficient.

Common Techniques:

  • L1 Regularization (Lasso Regression): Adds a penalty that shrinks less important feature coefficients to zero.
  • Tree-Based Feature Importance: Uses ensemble models like Random Forest or Gradient Boosting to rank features based on importance.

Example 1: Using Lasso for Feature Selection

from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import Lasso
from sklearn.preprocessing import StandardScaler
import pandas as pd
import numpy as np

# Load dataset
data = load_breast_cancer()
X, y = data.data, data.target
feature_names = data.feature_names

# Standardize features (important for regularization methods like Lasso)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Initialize and fit Lasso regression model
lasso = Lasso(alpha=0.01)  # Lower alpha means less regularization
lasso.fit(X_scaled, y)

# Identify selected features (non-zero coefficients)
selected_mask = lasso.coef_ != 0
selected_features = feature_names[selected_mask]

# Output results
print("Selected Features using Lasso Regression:")
print(selected_features)

# Optionally display selected feature data
X_selected = X_scaled[:, selected_mask]
X_selected_df = pd.DataFrame(X_selected, columns=selected_features)
print("\nTransformed Feature Set Shape:", X_selected_df.shape)
print(X_selected_df.head())

Output - 

Selected Features using Lasso Regression:
['mean texture' 'mean concave points' 'mean fractal dimension'
 'radius error' 'smoothness error' 'concavity error' 'worst radius'
 'worst texture' 'worst smoothness' 'worst concavity'
 'worst concave points' 'worst symmetry']

Transformed Feature Set Shape: (569, 12)
   mean texture  mean concave points  mean fractal dimension  ...  worst concavity  worst concave points  worst symmetry
0     -2.073335             2.532475                2.255747  ...         2.109526              2.296076        2.750622
1     -0.353632             0.548144               -0.868652  ...        -0.146749              1.087084       -0.243890
2      0.456187             2.037231               -0.398008  ...         0.854974              1.955000        1.152255
3      0.253732             1.451707                4.910919  ...         1.989588              2.175786        6.046041
4     -1.151816             1.428493               -0.562450  ...         0.613179              0.729259       -0.868353

[5 rows x 12 columns]

Explanation:

  • Lasso (L1) adds a penalty to the regression loss function that forces some feature coefficients to become exactly zero.
  • Features with non-zero coefficients are selected.
  • Standardization is necessary because Lasso is sensitive to feature scale.
  • A smaller alpha reduces regularization, possibly retaining more features.

Example 2: Using Random Forest

from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
import numpy as np

# Load dataset
data = load_breast_cancer()
X, y = data.data, data.target
feature_names = data.feature_names

# Initialize and fit Random Forest classifier
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X, y)

# Extract feature importances
importances = model.feature_importances_
importance_threshold = 0.01  # You can adjust this threshold

# Identify features above threshold
selected_mask = importances > importance_threshold
selected_features = feature_names[selected_mask]

# Output results
print("Selected Features using Random Forest Feature Importances:")
for feature, importance in zip(feature_names[selected_mask], importances[selected_mask]):
    print(f"{feature}: {importance:.4f}")

# Optionally display selected feature data
X_selected = X[:, selected_mask]
X_selected_df = pd.DataFrame(X_selected, columns=selected_features)
print("\nTransformed Feature Set Shape:", X_selected_df.shape)
print(X_selected_df.head())

Output - 

Selected Features using Random Forest Feature Importances:
mean radius: 0.0348
mean texture: 0.0152
mean perimeter: 0.0680
mean area: 0.0605
mean compactness: 0.0116
mean concavity: 0.0669
mean concave points: 0.1070
radius error: 0.0143
perimeter error: 0.0101
area error: 0.0296
worst radius: 0.0828
worst texture: 0.0175
worst perimeter: 0.0808
worst area: 0.1394
worst smoothness: 0.0122
worst compactness: 0.0199
worst concavity: 0.0373
worst concave points: 0.1322

Transformed Feature Set Shape: (569, 18)
   mean radius  mean texture  mean perimeter  mean area  ...  worst smoothness  worst compactness  worst concavity  worst concave points
0        17.99         10.38          122.80     1001.0  ...            0.1622             0.6656           0.7119                0.2654
1        20.57         17.77          132.90     1326.0  ...            0.1238             0.1866           0.2416                0.1860
2        19.69         21.25          130.00     1203.0  ...            0.1444             0.4245           0.4504                0.2430
3        11.42         20.38           77.58      386.1  ...            0.2098             0.8663           0.6869                0.2575
4        20.29         14.34          135.10     1297.0  ...            0.1374             0.2050           0.4000                0.1625

[5 rows x 18 columns]

Code Explanation:

  • Random Forests compute the importance of each feature based on how much it improves the purity of the split (e.g., Gini importance).
  • A threshold (e.g., 0.01) is used to retain the most relevant features.
  • Unlike Lasso, no scaling is required because tree-based models are invariant to feature scaling.

Pros:

  • Efficient and effective.
  • Integrates with the model training process.

Cons:

  • Model-dependent.
  • May not be ideal for very small datasets.

Evaluating Feature Importance

Many algorithms provide a direct way to assess feature importance, which can guide selection:

  • Coefficients in linear models.
  • Feature importance in tree-based models.
  • Permutation importance methods.

Comparison of Feature Selection Methods

Method TypeTechnique ExamplesModel DependencyInteraction ConsiderationScalability
FilterCorrelation, Chi-Square, ANOVANoNoHigh
WrapperRFE, Forward/Backward SelectionYesYesLow
EmbeddedLasso, Decision TreesYesYesMedium

Best Practices for Feature Selection

  1. Understand your data: Use EDA (exploratory data analysis) to detect correlations, data types, and feature distributions.
  2. Use domain knowledge: Human insight can often reveal feature relevance that statistics alone cannot.
  3. Apply multiple methods: Use a combination of filter and embedded methods to cross-validate your selection.
  4. Use cross-validation: Always validate your model’s performance after feature selection to avoid overfitting.
  5. Be cautious with automated methods: Some algorithms may discard useful features if hyperparameters are not tuned properly.

When to Perform Feature Selection

  • Before training your model.
  • As part of a pipeline (especially with scikit-learn pipelines).
  • When dealing with datasets that have a large number of features relative to the number of samples.