Dimensionality Reduction Techniques

  Add to Bookmark

Dimensionality reduction is the process of reducing the number of input variables or features in a dataset. It is a crucial preprocessing step in machine learning, especially when dealing with high-dimensional data. Reducing the number of features can help improve model performance, reduce computational cost, and combat the curse of dimensionality.


Why Dimensionality Reduction Is Important

  1. Improves Model Performance: Redundant or irrelevant features can degrade model performance. Removing them helps the model focus on meaningful patterns.
  2. Reduces Overfitting: With fewer features, the model is less likely to learn noise from the training data.
  3. Increases Training Efficiency: Fewer features mean faster training and less memory usage.
  4. Enables Visualization: Reducing features to two or three dimensions enables data visualization and insight.

Types of Dimensionality Reduction Techniques

Dimensionality reduction techniques fall into two major categories:

  • Feature Selection: Selecting a subset of the original features without transforming them.
  • Feature Extraction: Transforming data from a high-dimensional space to a lower-dimensional space.

In this tutorial, we focus on Feature Extraction techniques.


1. Principal Component Analysis (PCA)

Description:

Principal Component Analysis is a linear technique that projects data onto a lower-dimensional subspace such that the variance of the projected data is maximized. It finds the axes (principal components) along which the data varies the most.

Key Concepts:

  • PCA transforms the original features into new uncorrelated features (principal components).
  • The first component captures the most variance, the second captures the second-most, and so on.
  • PCA is unsupervised (does not use target labels).

Use Case:

Best used when you want to reduce dimensionality for continuous variables and when you suspect multicollinearity among features.

Code Example:


from sklearn.datasets import load_breast_cancer
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import pandas as pd

# Load the dataset
data = load_breast_cancer()
X, y = data.data, data.target
feature_names = data.feature_names
target_names = data.target_names

# Step 1: Standardize the data
# PCA is affected by the scale of features, so standardization is important
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Step 2: Apply PCA to reduce dimensions to 2 components
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

# Step 3: Explained variance ratio
print("Explained Variance Ratio for each component:")
print(pca.explained_variance_ratio_)

# Step 4: Visualize the reduced feature space
pca_df = pd.DataFrame(data=X_pca, columns=['PC1', 'PC2'])
pca_df['Target'] = y

plt.figure(figsize=(8, 6))
for label, target_name in enumerate(target_names):
    plt.scatter(
        pca_df[pca_df['Target'] == label]['PC1'],
        pca_df[pca_df['Target'] == label]['PC2'],
        label=target_name,
        alpha=0.7
    )

plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('PCA - Breast Cancer Dataset')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.savefig("pca_plot.png")  # Save the figure

Output - 

Explained Variance Ratio for each component:
[0.44272026 0.18971182]

Code Explanation:

  • Dataset: The Breast Cancer dataset is loaded for binary classification.
  • Standardization: PCA is a variance-based technique and is sensitive to feature scales. Hence, standardization using StandardScaler is applied.
  • PCA Transformation: The data is projected to 2 principal components, which explain most of the variance in the dataset.
  • Variance Ratio: explained_variance_ratio_ shows how much variance is retained by each principal component.
  • Visualization: A 2D scatter plot shows how the two classes (malignant and benign) are separated in the new reduced feature space.

2. Linear Discriminant Analysis (LDA)

Description:

LDA is a supervised dimensionality reduction technique that seeks to find a feature space that maximizes class separability. It projects data in a way that maximizes the distance between classes and minimizes the variation within each class.

Key Concepts:

  • Unlike PCA, LDA uses class labels.
  • It is suitable for classification tasks.

Use Case:

Best used when the goal is to improve classification performance by reducing dimensions while maintaining class separation.

Code Example:


from sklearn.datasets import load_breast_cancer
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.preprocessing import StandardScaler
import pandas as pd
import matplotlib.pyplot as plt

# Load dataset
data = load_breast_cancer()
X, y = data.data, data.target
feature_names = data.feature_names
target_names = data.target_names

# Step 1: Standardize the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Step 2: Apply LDA to reduce to 1 component (since it’s a binary classification)
lda = LinearDiscriminantAnalysis(n_components=1)
X_lda = lda.fit_transform(X_scaled, y)

# Step 3: Add target labels for visualization
lda_df = pd.DataFrame(X_lda, columns=['LDA1'])
lda_df['Target'] = y

# Step 4: Visualize the distribution along the LDA component
plt.figure(figsize=(8, 4))
for label, name in enumerate(target_names):
    plt.hist(
        lda_df[lda_df['Target'] == label]['LDA1'],
        bins=30,
        alpha=0.6,
        label=name
    )

plt.title('LDA Projection - Breast Cancer Dataset')
plt.xlabel('Linear Discriminant 1')
plt.ylabel('Frequency')
plt.legend()
plt.grid(True)
plt.tight_layout()

# Save the plot instead of showing if in headless environment
plt.savefig("lda_projection.png")
# plt.show()  # Uncomment this line if running in an interactive environment

Output- 

Code Explanation:

  • LDA is supervised, meaning it uses the class labels (y) to find a projection that maximizes separation between classes and minimizes intra-class variance.
  • The dataset has 30 features, which are projected into 1 component (LDA1) because there are 2 classes (you can extract at most n_classes - 1 components).
  • The final plot shows how the classes are separated along the LDA axis.

3. t-Distributed Stochastic Neighbor Embedding (t-SNE)

Description:

t-SNE is a non-linear technique particularly well-suited for visualizing high-dimensional data in 2D or 3D space. It preserves local structure by modeling similar data points close together in the lower-dimensional space.

Key Concepts:

  • Not suitable for predictive modeling.
  • Primarily used for visualization and understanding clusters in data.

Limitations:

  • Computationally expensive.
  • Not deterministic unless the random state is fixed.

Code Example:


from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
import pandas as pd

# Step 1: Load dataset
data = load_breast_cancer()
X, y = data.data, data.target
target_names = data.target_names

# Step 2: Standardize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Step 3: Apply t-SNE to reduce dimensions to 2
tsne = TSNE(n_components=2, perplexity=30, random_state=0)
X_tsne = tsne.fit_transform(X_scaled)

# Step 4: Convert to DataFrame for visualization
tsne_df = pd.DataFrame(X_tsne, columns=['TSNE1', 'TSNE2'])
tsne_df['Target'] = y

# Step 5: Plot the 2D t-SNE output
plt.figure(figsize=(8, 6))
for label, name in enumerate(target_names):
    subset = tsne_df[tsne_df['Target'] == label]
    plt.scatter(subset['TSNE1'], subset['TSNE2'], label=name, alpha=0.6)

plt.title("t-SNE Projection of Breast Cancer Dataset")
plt.xlabel("t-SNE Component 1")
plt.ylabel("t-SNE Component 2")
plt.legend()
plt.grid(True)
plt.tight_layout()

# Save plot for non-interactive environments
plt.savefig("tsne_projection.png")
# plt.show()  # Uncomment if running interactively

Output - 

Code Explanation:

  • t-SNE is a non-linear, unsupervised dimensionality reduction technique best suited for visualizing high-dimensional data in 2 or 3 dimensions.
  • It preserves local structure — nearby points in high-dimensional space stay close in the lower-dimensional projection.
  • It's particularly useful when you want to visualize how well your data clusters without needing labels (though we use them just for plotting here).

Important Notes:

  • Perplexity is a key parameter. Values between 5–50 usually work well depending on the dataset size.
  • t-SNE is computationally intensive, and results may vary with each run unless random_state is set.

4. Autoencoders (Neural Network-based)

Description:

Autoencoders are a type of neural network trained to compress input data into a lower-dimensional representation and then reconstruct it back to the original data. The compressed representation (bottleneck layer) is the reduced feature space.

Key Concepts:

  • Non-linear dimensionality reduction.
  • Can learn complex feature representations.
  • Requires more data and computation compared to PCA/LDA.

Architecture:

Input layer

  • Encoding layers (compress data)
  • Bottleneck layer (lowest dimension)
  • Decoding layers (reconstruct data)

Use Case:

Useful for reducing dimensions in complex, non-linear datasets like images or sensor data.

Code Example (using Keras):


from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import MinMaxScaler
from keras.models import Model
from keras.layers import Input, Dense
import matplotlib.pyplot as plt

# Step 1: Load dataset
data = load_breast_cancer()
X, y = data.data, data.target

# Step 2: Normalize input data (autoencoders work best with scaled data)
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)

# Step 3: Define Autoencoder architecture
input_layer = Input(shape=(X_scaled.shape[1],))
# Encoder
encoded = Dense(64, activation='relu')(input_layer)
encoded = Dense(32, activation='relu')(encoded)
bottleneck = Dense(2, activation='linear')(encoded)
# Decoder
decoded = Dense(32, activation='relu')(bottleneck)
decoded = Dense(64, activation='relu')(decoded)
output_layer = Dense(X_scaled.shape[1], activation='sigmoid')(decoded)

# Step 4: Build and compile autoencoder
autoencoder = Model(inputs=input_layer, outputs=output_layer)
autoencoder.compile(optimizer='adam', loss='mse')

# Step 5: Train the autoencoder
autoencoder.fit(X_scaled, X_scaled, epochs=50, batch_size=32, shuffle=True, verbose=0)

# Step 6: Extract encoder model to get reduced features
encoder = Model(inputs=input_layer, outputs=bottleneck)
X_reduced = encoder.predict(X_scaled)

# Step 7: Visualize the 2D reduced representation
plt.figure(figsize=(8, 6))
plt.scatter(X_reduced[:, 0], X_reduced[:, 1], c=y, cmap='viridis', alpha=0.7)
plt.title("2D Representation from Autoencoder Bottleneck")
plt.xlabel("Encoded Feature 1")
plt.ylabel("Encoded Feature 2")
plt.colorbar(label='Target')
plt.grid(True)
plt.tight_layout()
plt.savefig("autoencoder_projection.png")  # Use plt.show() if running interactively

Output - 

Code Explanation:

  • An Autoencoder is a type of neural network that learns to compress (encode) the data into a lower dimension and then reconstruct it (decode) back to original.
  • The bottleneck layer is the reduced-dimensional representation. Here, we set it to 2 dimensions.
  • Unlike PCA or t-SNE, Autoencoders are non-linear, learned, and can scale well for large and complex data.

5. Independent Component Analysis (ICA)

Description:

ICA separates a multivariate signal into additive, independent non-Gaussian components. It is often used in signal processing and is useful when the source signals are statistically independent.

Use Case:

Commonly used in audio signal separation, image processing, or other cases where independent source signals are mixed.

Code Example:


from sklearn.datasets import load_breast_cancer
from sklearn.decomposition import FastICA
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt

# Step 1: Load dataset
data = load_breast_cancer()
X, y = data.data, data.target

# Step 2: Standardize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Step 3: Apply FastICA
ica = FastICA(n_components=2, random_state=0)
X_ica = ica.fit_transform(X_scaled)

# Step 4: Visualize the reduced data
plt.figure(figsize=(8, 6))
plt.scatter(X_ica[:, 0], X_ica[:, 1], c=y, cmap='plasma', alpha=0.7)
plt.title("2D Representation using FastICA")
plt.xlabel("Independent Component 1")
plt.ylabel("Independent Component 2")
plt.grid(True)
plt.colorbar(label='Target')
plt.tight_layout()
plt.savefig("fastica_projection.png")  # Use plt.show() if running interactively

Output - 

Code Explanation:

  • Independent Component Analysis (ICA) is a statistical technique that separates a multivariate signal into additive independent components.
  • Unlike PCA (which looks for uncorrelated axes), ICA looks for statistically independent components — useful in signal processing and data unmixing.
  • Here, we reduce to 2 components and visualize the result.

Choosing the Right Dimensionality Reduction Technique

TechniqueSupervisedLinear/Non-LinearUse Case
PCANoLinearGeneral dimensionality reduction
LDAYesLinearClassification tasks
t-SNENoNon-LinearData visualization
AutoencodersNoNon-LinearComplex data, unsupervised
ICANoLinearSignal separation

Best Practices

  • Standardize your data before applying PCA or LDA to ensure that all features contribute equally.
  • Use PCA for data exploration and noise reduction.
  • Use t-SNE for visualizing clusters or embedding.
  • Evaluate reconstruction error when using autoencoders.
  • Avoid using dimensionality reduction blindly in production without evaluating the impact on model performance.