Dimensionality reduction is the process of reducing the number of input variables or features in a dataset. It is a crucial preprocessing step in machine learning, especially when dealing with high-dimensional data. Reducing the number of features can help improve model performance, reduce computational cost, and combat the curse of dimensionality.
Dimensionality reduction techniques fall into two major categories:
In this tutorial, we focus on Feature Extraction techniques.
Principal Component Analysis is a linear technique that projects data onto a lower-dimensional subspace such that the variance of the projected data is maximized. It finds the axes (principal components) along which the data varies the most.
Best used when you want to reduce dimensionality for continuous variables and when you suspect multicollinearity among features.
from sklearn.datasets import load_breast_cancer
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import pandas as pd
# Load the dataset
data = load_breast_cancer()
X, y = data.data, data.target
feature_names = data.feature_names
target_names = data.target_names
# Step 1: Standardize the data
# PCA is affected by the scale of features, so standardization is important
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Step 2: Apply PCA to reduce dimensions to 2 components
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
# Step 3: Explained variance ratio
print("Explained Variance Ratio for each component:")
print(pca.explained_variance_ratio_)
# Step 4: Visualize the reduced feature space
pca_df = pd.DataFrame(data=X_pca, columns=['PC1', 'PC2'])
pca_df['Target'] = y
plt.figure(figsize=(8, 6))
for label, target_name in enumerate(target_names):
plt.scatter(
pca_df[pca_df['Target'] == label]['PC1'],
pca_df[pca_df['Target'] == label]['PC2'],
label=target_name,
alpha=0.7
)
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('PCA - Breast Cancer Dataset')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.savefig("pca_plot.png") # Save the figureOutput -
Explained Variance Ratio for each component:
[0.44272026 0.18971182]StandardScaler is applied.explained_variance_ratio_ shows how much variance is retained by each principal component.LDA is a supervised dimensionality reduction technique that seeks to find a feature space that maximizes class separability. It projects data in a way that maximizes the distance between classes and minimizes the variation within each class.
Best used when the goal is to improve classification performance by reducing dimensions while maintaining class separation.
from sklearn.datasets import load_breast_cancer
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.preprocessing import StandardScaler
import pandas as pd
import matplotlib.pyplot as plt
# Load dataset
data = load_breast_cancer()
X, y = data.data, data.target
feature_names = data.feature_names
target_names = data.target_names
# Step 1: Standardize the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Step 2: Apply LDA to reduce to 1 component (since it’s a binary classification)
lda = LinearDiscriminantAnalysis(n_components=1)
X_lda = lda.fit_transform(X_scaled, y)
# Step 3: Add target labels for visualization
lda_df = pd.DataFrame(X_lda, columns=['LDA1'])
lda_df['Target'] = y
# Step 4: Visualize the distribution along the LDA component
plt.figure(figsize=(8, 4))
for label, name in enumerate(target_names):
plt.hist(
lda_df[lda_df['Target'] == label]['LDA1'],
bins=30,
alpha=0.6,
label=name
)
plt.title('LDA Projection - Breast Cancer Dataset')
plt.xlabel('Linear Discriminant 1')
plt.ylabel('Frequency')
plt.legend()
plt.grid(True)
plt.tight_layout()
# Save the plot instead of showing if in headless environment
plt.savefig("lda_projection.png")
# plt.show() # Uncomment this line if running in an interactive environmentOutput-
y) to find a projection that maximizes separation between classes and minimizes intra-class variance.n_classes - 1 components).t-SNE is a non-linear technique particularly well-suited for visualizing high-dimensional data in 2D or 3D space. It preserves local structure by modeling similar data points close together in the lower-dimensional space.
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
import pandas as pd
# Step 1: Load dataset
data = load_breast_cancer()
X, y = data.data, data.target
target_names = data.target_names
# Step 2: Standardize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Step 3: Apply t-SNE to reduce dimensions to 2
tsne = TSNE(n_components=2, perplexity=30, random_state=0)
X_tsne = tsne.fit_transform(X_scaled)
# Step 4: Convert to DataFrame for visualization
tsne_df = pd.DataFrame(X_tsne, columns=['TSNE1', 'TSNE2'])
tsne_df['Target'] = y
# Step 5: Plot the 2D t-SNE output
plt.figure(figsize=(8, 6))
for label, name in enumerate(target_names):
subset = tsne_df[tsne_df['Target'] == label]
plt.scatter(subset['TSNE1'], subset['TSNE2'], label=name, alpha=0.6)
plt.title("t-SNE Projection of Breast Cancer Dataset")
plt.xlabel("t-SNE Component 1")
plt.ylabel("t-SNE Component 2")
plt.legend()
plt.grid(True)
plt.tight_layout()
# Save plot for non-interactive environments
plt.savefig("tsne_projection.png")
# plt.show() # Uncomment if running interactivelyOutput -
random_state is set.Autoencoders are a type of neural network trained to compress input data into a lower-dimensional representation and then reconstruct it back to the original data. The compressed representation (bottleneck layer) is the reduced feature space.
Input layer
Useful for reducing dimensions in complex, non-linear datasets like images or sensor data.
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import MinMaxScaler
from keras.models import Model
from keras.layers import Input, Dense
import matplotlib.pyplot as plt
# Step 1: Load dataset
data = load_breast_cancer()
X, y = data.data, data.target
# Step 2: Normalize input data (autoencoders work best with scaled data)
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)
# Step 3: Define Autoencoder architecture
input_layer = Input(shape=(X_scaled.shape[1],))
# Encoder
encoded = Dense(64, activation='relu')(input_layer)
encoded = Dense(32, activation='relu')(encoded)
bottleneck = Dense(2, activation='linear')(encoded)
# Decoder
decoded = Dense(32, activation='relu')(bottleneck)
decoded = Dense(64, activation='relu')(decoded)
output_layer = Dense(X_scaled.shape[1], activation='sigmoid')(decoded)
# Step 4: Build and compile autoencoder
autoencoder = Model(inputs=input_layer, outputs=output_layer)
autoencoder.compile(optimizer='adam', loss='mse')
# Step 5: Train the autoencoder
autoencoder.fit(X_scaled, X_scaled, epochs=50, batch_size=32, shuffle=True, verbose=0)
# Step 6: Extract encoder model to get reduced features
encoder = Model(inputs=input_layer, outputs=bottleneck)
X_reduced = encoder.predict(X_scaled)
# Step 7: Visualize the 2D reduced representation
plt.figure(figsize=(8, 6))
plt.scatter(X_reduced[:, 0], X_reduced[:, 1], c=y, cmap='viridis', alpha=0.7)
plt.title("2D Representation from Autoencoder Bottleneck")
plt.xlabel("Encoded Feature 1")
plt.ylabel("Encoded Feature 2")
plt.colorbar(label='Target')
plt.grid(True)
plt.tight_layout()
plt.savefig("autoencoder_projection.png") # Use plt.show() if running interactivelyOutput -
ICA separates a multivariate signal into additive, independent non-Gaussian components. It is often used in signal processing and is useful when the source signals are statistically independent.
Commonly used in audio signal separation, image processing, or other cases where independent source signals are mixed.
from sklearn.datasets import load_breast_cancer
from sklearn.decomposition import FastICA
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
# Step 1: Load dataset
data = load_breast_cancer()
X, y = data.data, data.target
# Step 2: Standardize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Step 3: Apply FastICA
ica = FastICA(n_components=2, random_state=0)
X_ica = ica.fit_transform(X_scaled)
# Step 4: Visualize the reduced data
plt.figure(figsize=(8, 6))
plt.scatter(X_ica[:, 0], X_ica[:, 1], c=y, cmap='plasma', alpha=0.7)
plt.title("2D Representation using FastICA")
plt.xlabel("Independent Component 1")
plt.ylabel("Independent Component 2")
plt.grid(True)
plt.colorbar(label='Target')
plt.tight_layout()
plt.savefig("fastica_projection.png") # Use plt.show() if running interactivelyOutput -
| Technique | Supervised | Linear/Non-Linear | Use Case |
|---|---|---|---|
| PCA | No | Linear | General dimensionality reduction |
| LDA | Yes | Linear | Classification tasks |
| t-SNE | No | Non-Linear | Data visualization |
| Autoencoders | No | Non-Linear | Complex data, unsupervised |
| ICA | No | Linear | Signal separation |
Sign in to join the discussion and post comments.
Sign inSupervised Learning
Discover what Supervised Learning is, how it works, and what you'll learn in this hands-on tutorial series covering top ML algorithms like Linear Regression, Decision Trees, SVM, and more.
Unsupervised Learning
Explore Unsupervised Learning techniques to uncover patterns, structures, and relationships in unlabeled data.