- Feature Engineering & Data Preprocessing
-
Overview
- Handling Missing Data in ML
- Feature Scaling (Normalization vs. Standardization)
- Encoding Categorical Variables
- Feature Selection Techniques
- Dimensionality Reduction Techniques
- Feature Extraction from Text and Images
- Handling Imbalanced Data (SMOTE, Class Weights)
Dimensionality Reduction Techniques
Add to BookmarkDimensionality reduction is the process of reducing the number of input variables or features in a dataset. It is a crucial preprocessing step in machine learning, especially when dealing with high-dimensional data. Reducing the number of features can help improve model performance, reduce computational cost, and combat the curse of dimensionality.
Why Dimensionality Reduction Is Important
- Improves Model Performance: Redundant or irrelevant features can degrade model performance. Removing them helps the model focus on meaningful patterns.
- Reduces Overfitting: With fewer features, the model is less likely to learn noise from the training data.
- Increases Training Efficiency: Fewer features mean faster training and less memory usage.
- Enables Visualization: Reducing features to two or three dimensions enables data visualization and insight.
Types of Dimensionality Reduction Techniques
Dimensionality reduction techniques fall into two major categories:
- Feature Selection: Selecting a subset of the original features without transforming them.
- Feature Extraction: Transforming data from a high-dimensional space to a lower-dimensional space.
In this tutorial, we focus on Feature Extraction techniques.
1. Principal Component Analysis (PCA)
Description:
Principal Component Analysis is a linear technique that projects data onto a lower-dimensional subspace such that the variance of the projected data is maximized. It finds the axes (principal components) along which the data varies the most.
Key Concepts:
- PCA transforms the original features into new uncorrelated features (principal components).
- The first component captures the most variance, the second captures the second-most, and so on.
- PCA is unsupervised (does not use target labels).
Use Case:
Best used when you want to reduce dimensionality for continuous variables and when you suspect multicollinearity among features.
Code Example:
from sklearn.datasets import load_breast_cancer
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import pandas as pd
# Load the dataset
data = load_breast_cancer()
X, y = data.data, data.target
feature_names = data.feature_names
target_names = data.target_names
# Step 1: Standardize the data
# PCA is affected by the scale of features, so standardization is important
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Step 2: Apply PCA to reduce dimensions to 2 components
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
# Step 3: Explained variance ratio
print("Explained Variance Ratio for each component:")
print(pca.explained_variance_ratio_)
# Step 4: Visualize the reduced feature space
pca_df = pd.DataFrame(data=X_pca, columns=['PC1', 'PC2'])
pca_df['Target'] = y
plt.figure(figsize=(8, 6))
for label, target_name in enumerate(target_names):
plt.scatter(
pca_df[pca_df['Target'] == label]['PC1'],
pca_df[pca_df['Target'] == label]['PC2'],
label=target_name,
alpha=0.7
)
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('PCA - Breast Cancer Dataset')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.savefig("pca_plot.png") # Save the figure
Output -
Explained Variance Ratio for each component:
[0.44272026 0.18971182]
Code Explanation:
- Dataset: The Breast Cancer dataset is loaded for binary classification.
- Standardization: PCA is a variance-based technique and is sensitive to feature scales. Hence, standardization using
StandardScaler
is applied. - PCA Transformation: The data is projected to 2 principal components, which explain most of the variance in the dataset.
- Variance Ratio:
explained_variance_ratio_
shows how much variance is retained by each principal component. - Visualization: A 2D scatter plot shows how the two classes (malignant and benign) are separated in the new reduced feature space.
2. Linear Discriminant Analysis (LDA)
Description:
LDA is a supervised dimensionality reduction technique that seeks to find a feature space that maximizes class separability. It projects data in a way that maximizes the distance between classes and minimizes the variation within each class.
Key Concepts:
- Unlike PCA, LDA uses class labels.
- It is suitable for classification tasks.
Use Case:
Best used when the goal is to improve classification performance by reducing dimensions while maintaining class separation.
Code Example:
from sklearn.datasets import load_breast_cancer
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.preprocessing import StandardScaler
import pandas as pd
import matplotlib.pyplot as plt
# Load dataset
data = load_breast_cancer()
X, y = data.data, data.target
feature_names = data.feature_names
target_names = data.target_names
# Step 1: Standardize the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Step 2: Apply LDA to reduce to 1 component (since it’s a binary classification)
lda = LinearDiscriminantAnalysis(n_components=1)
X_lda = lda.fit_transform(X_scaled, y)
# Step 3: Add target labels for visualization
lda_df = pd.DataFrame(X_lda, columns=['LDA1'])
lda_df['Target'] = y
# Step 4: Visualize the distribution along the LDA component
plt.figure(figsize=(8, 4))
for label, name in enumerate(target_names):
plt.hist(
lda_df[lda_df['Target'] == label]['LDA1'],
bins=30,
alpha=0.6,
label=name
)
plt.title('LDA Projection - Breast Cancer Dataset')
plt.xlabel('Linear Discriminant 1')
plt.ylabel('Frequency')
plt.legend()
plt.grid(True)
plt.tight_layout()
# Save the plot instead of showing if in headless environment
plt.savefig("lda_projection.png")
# plt.show() # Uncomment this line if running in an interactive environment
Output-
Code Explanation:
- LDA is supervised, meaning it uses the class labels (
y
) to find a projection that maximizes separation between classes and minimizes intra-class variance. - The dataset has 30 features, which are projected into 1 component (LDA1) because there are 2 classes (you can extract at most
n_classes - 1
components). - The final plot shows how the classes are separated along the LDA axis.
3. t-Distributed Stochastic Neighbor Embedding (t-SNE)
Description:
t-SNE is a non-linear technique particularly well-suited for visualizing high-dimensional data in 2D or 3D space. It preserves local structure by modeling similar data points close together in the lower-dimensional space.
Key Concepts:
- Not suitable for predictive modeling.
- Primarily used for visualization and understanding clusters in data.
Limitations:
- Computationally expensive.
- Not deterministic unless the random state is fixed.
Code Example:
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
import pandas as pd
# Step 1: Load dataset
data = load_breast_cancer()
X, y = data.data, data.target
target_names = data.target_names
# Step 2: Standardize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Step 3: Apply t-SNE to reduce dimensions to 2
tsne = TSNE(n_components=2, perplexity=30, random_state=0)
X_tsne = tsne.fit_transform(X_scaled)
# Step 4: Convert to DataFrame for visualization
tsne_df = pd.DataFrame(X_tsne, columns=['TSNE1', 'TSNE2'])
tsne_df['Target'] = y
# Step 5: Plot the 2D t-SNE output
plt.figure(figsize=(8, 6))
for label, name in enumerate(target_names):
subset = tsne_df[tsne_df['Target'] == label]
plt.scatter(subset['TSNE1'], subset['TSNE2'], label=name, alpha=0.6)
plt.title("t-SNE Projection of Breast Cancer Dataset")
plt.xlabel("t-SNE Component 1")
plt.ylabel("t-SNE Component 2")
plt.legend()
plt.grid(True)
plt.tight_layout()
# Save plot for non-interactive environments
plt.savefig("tsne_projection.png")
# plt.show() # Uncomment if running interactively
Output -
Code Explanation:
- t-SNE is a non-linear, unsupervised dimensionality reduction technique best suited for visualizing high-dimensional data in 2 or 3 dimensions.
- It preserves local structure — nearby points in high-dimensional space stay close in the lower-dimensional projection.
- It's particularly useful when you want to visualize how well your data clusters without needing labels (though we use them just for plotting here).
Important Notes:
- Perplexity is a key parameter. Values between 5–50 usually work well depending on the dataset size.
- t-SNE is computationally intensive, and results may vary with each run unless
random_state
is set.
4. Autoencoders (Neural Network-based)
Description:
Autoencoders are a type of neural network trained to compress input data into a lower-dimensional representation and then reconstruct it back to the original data. The compressed representation (bottleneck layer) is the reduced feature space.
Key Concepts:
- Non-linear dimensionality reduction.
- Can learn complex feature representations.
- Requires more data and computation compared to PCA/LDA.
Architecture:
Input layer
- Encoding layers (compress data)
- Bottleneck layer (lowest dimension)
- Decoding layers (reconstruct data)
Use Case:
Useful for reducing dimensions in complex, non-linear datasets like images or sensor data.
Code Example (using Keras):
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import MinMaxScaler
from keras.models import Model
from keras.layers import Input, Dense
import matplotlib.pyplot as plt
# Step 1: Load dataset
data = load_breast_cancer()
X, y = data.data, data.target
# Step 2: Normalize input data (autoencoders work best with scaled data)
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)
# Step 3: Define Autoencoder architecture
input_layer = Input(shape=(X_scaled.shape[1],))
# Encoder
encoded = Dense(64, activation='relu')(input_layer)
encoded = Dense(32, activation='relu')(encoded)
bottleneck = Dense(2, activation='linear')(encoded)
# Decoder
decoded = Dense(32, activation='relu')(bottleneck)
decoded = Dense(64, activation='relu')(decoded)
output_layer = Dense(X_scaled.shape[1], activation='sigmoid')(decoded)
# Step 4: Build and compile autoencoder
autoencoder = Model(inputs=input_layer, outputs=output_layer)
autoencoder.compile(optimizer='adam', loss='mse')
# Step 5: Train the autoencoder
autoencoder.fit(X_scaled, X_scaled, epochs=50, batch_size=32, shuffle=True, verbose=0)
# Step 6: Extract encoder model to get reduced features
encoder = Model(inputs=input_layer, outputs=bottleneck)
X_reduced = encoder.predict(X_scaled)
# Step 7: Visualize the 2D reduced representation
plt.figure(figsize=(8, 6))
plt.scatter(X_reduced[:, 0], X_reduced[:, 1], c=y, cmap='viridis', alpha=0.7)
plt.title("2D Representation from Autoencoder Bottleneck")
plt.xlabel("Encoded Feature 1")
plt.ylabel("Encoded Feature 2")
plt.colorbar(label='Target')
plt.grid(True)
plt.tight_layout()
plt.savefig("autoencoder_projection.png") # Use plt.show() if running interactively
Output -
Code Explanation:
- An Autoencoder is a type of neural network that learns to compress (encode) the data into a lower dimension and then reconstruct it (decode) back to original.
- The bottleneck layer is the reduced-dimensional representation. Here, we set it to 2 dimensions.
- Unlike PCA or t-SNE, Autoencoders are non-linear, learned, and can scale well for large and complex data.
5. Independent Component Analysis (ICA)
Description:
ICA separates a multivariate signal into additive, independent non-Gaussian components. It is often used in signal processing and is useful when the source signals are statistically independent.
Use Case:
Commonly used in audio signal separation, image processing, or other cases where independent source signals are mixed.
Code Example:
from sklearn.datasets import load_breast_cancer
from sklearn.decomposition import FastICA
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
# Step 1: Load dataset
data = load_breast_cancer()
X, y = data.data, data.target
# Step 2: Standardize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Step 3: Apply FastICA
ica = FastICA(n_components=2, random_state=0)
X_ica = ica.fit_transform(X_scaled)
# Step 4: Visualize the reduced data
plt.figure(figsize=(8, 6))
plt.scatter(X_ica[:, 0], X_ica[:, 1], c=y, cmap='plasma', alpha=0.7)
plt.title("2D Representation using FastICA")
plt.xlabel("Independent Component 1")
plt.ylabel("Independent Component 2")
plt.grid(True)
plt.colorbar(label='Target')
plt.tight_layout()
plt.savefig("fastica_projection.png") # Use plt.show() if running interactively
Output -
Code Explanation:
- Independent Component Analysis (ICA) is a statistical technique that separates a multivariate signal into additive independent components.
- Unlike PCA (which looks for uncorrelated axes), ICA looks for statistically independent components — useful in signal processing and data unmixing.
- Here, we reduce to 2 components and visualize the result.
Choosing the Right Dimensionality Reduction Technique
Technique | Supervised | Linear/Non-Linear | Use Case |
---|---|---|---|
PCA | No | Linear | General dimensionality reduction |
LDA | Yes | Linear | Classification tasks |
t-SNE | No | Non-Linear | Data visualization |
Autoencoders | No | Non-Linear | Complex data, unsupervised |
ICA | No | Linear | Signal separation |
Best Practices
- Standardize your data before applying PCA or LDA to ensure that all features contribute equally.
- Use PCA for data exploration and noise reduction.
- Use t-SNE for visualizing clusters or embedding.
- Evaluate reconstruction error when using autoencoders.
- Avoid using dimensionality reduction blindly in production without evaluating the impact on model performance.
Prepare for Interview
- JavaScript Interview Questions for 5+ Years Experience
- JavaScript Interview Questions for 2–5 Years Experience
- JavaScript Interview Questions for 1–2 Years Experience
- JavaScript Interview Questions for 0–1 Year Experience
- JavaScript Interview Questions For Fresher
- SQL Interview Questions for 5+ Years Experience
- SQL Interview Questions for 2–5 Years Experience
- SQL Interview Questions for 1–2 Years Experience
- SQL Interview Questions for 0–1 Year Experience
- SQL Interview Questions for Freshers
- Design Patterns in Python
- Dynamic Programming and Recursion in Python
- Trees and Graphs in Python
- Linked Lists, Stacks, and Queues in Python
- Sorting and Searching in Python
Random Blogs
- Datasets for Natural Language Processing
- How Multimodal Generative AI Will Change Content Creation Forever
- Variable Assignment in Python
- How AI is Making Humans Weaker – The Hidden Impact of Artificial Intelligence
- Role of Digital Marketing Services to Uplift Online business of Company and Beat Its Competitors
- What to Do When Your MySQL Table Grows Too Wide
- Government Datasets from 50 Countries for Machine Learning Training
- The Ultimate Guide to Machine Learning (ML) for Beginners
- Understanding HTAP Databases: Bridging Transactions and Analytics
- How to Install Tableau and Power BI on Ubuntu Using VirtualBox
- Top 10 Blogs of Digital Marketing you Must Follow
- Top 10 Knowledge for Machine Learning & Data Science Students
- Store Data Into CSV File Using Python Tkinter GUI Library
- Navigating AI Careers in 2025: Data Science, Machine Learning, Deep Learning, and More
- Where to Find Free Datasets for Your Next Machine Learning & Data Science Project
Datasets for Machine Learning
- Amazon Product Reviews Dataset
- Ozone Level Detection Dataset
- Bank Transaction Fraud Detection
- YouTube Trending Video Dataset (updated daily)
- Covid-19 Case Surveillance Public Use Dataset
- US Election 2020
- Forest Fires Dataset
- Mobile Robots Dataset
- Safety Helmet Detection
- All Space Missions from 1957
- OSIC Pulmonary Fibrosis Progression Dataset
- Wine Quality Dataset
- Google Audio Dataset
- Iris flower dataset
- Artificial Characters Dataset