Dimensionality Reduction Techniques: Complete Guide with Python
AI-Generated Content Notice
Some code examples and technical explanations in this article were generated with AI assistance. The content has been reviewed for accuracy, but please test any code snippets in your development environment before using them.
Dimensionality Reduction Techniques: Complete Guide with Python
Introduction
High-dimensional data is everywhere in machine learning - from image pixels to text features to sensor readings. Dimensionality reduction techniques help us visualize, understand, and process this data by reducing the number of features while preserving important information.
This comprehensive guide covers the most important dimensionality reduction techniques with practical Python implementations and real-world applications.
Why Dimensionality Reduction?
The Curse of Dimensionality
High-dimensional data creates several challenges:
- Exponential growth in data requirements
- Poor distance metrics (all points seem equally far apart)
- Visualization difficulties (impossible to plot >3D)
- Computational complexity increases dramatically
- Overfitting becomes more likely
Benefits of Dimensionality Reduction
- Visualization: Plot high-dimensional data in 2D/3D
- Storage: Reduce memory and disk requirements
- Speed: Faster training and inference
- Noise reduction: Filter out irrelevant features
- Feature engineering: Create meaningful representations
Principal Component Analysis (PCA)
Mathematical Foundation
PCA finds the directions (principal components) of maximum variance in data.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_digits, load_breast_cancer, make_classification
from sklearn.decomposition import PCA, KernelPCA, FactorAnalysis, FastICA
from sklearn.manifold import TSNE, Isomap, LocallyLinearEmbedding
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
from typing import Dict, List, Tuple, Optional
import warnings
warnings.filterwarnings('ignore')
plt.style.use('seaborn-v0_8')
class PCAAnalyzer:
"""Comprehensive PCA analyzer"""
def __init__(self, random_state: int = 42):
self.random_state = random_state
def pca_from_scratch(self, X: np.ndarray, n_components: Optional[int] = None) -> Dict:
"""PCA implementation from scratch"""
# Center the data
X_centered = X - np.mean(X, axis=0)
# Compute covariance matrix
cov_matrix = np.cov(X_centered.T)
# Compute eigenvalues and eigenvectors
eigenvalues, eigenvectors = np.linalg.eigh(cov_matrix)
# Sort by eigenvalues (descending)
idx = np.argsort(eigenvalues)[::-1]
eigenvalues = eigenvalues[idx]
eigenvectors = eigenvectors[:, idx]
# Select components
if n_components is None:
n_components = len(eigenvalues)
selected_components = eigenvectors[:, :n_components]
selected_eigenvalues = eigenvalues[:n_components]
# Transform data
X_transformed = np.dot(X_centered, selected_components)
# Calculate explained variance ratio
explained_variance_ratio = selected_eigenvalues / np.sum(eigenvalues)
return {
'components': selected_components,
'eigenvalues': selected_eigenvalues,
'explained_variance_ratio': explained_variance_ratio,
'transformed_data': X_transformed,
'mean': np.mean(X, axis=0)
}
def comprehensive_pca_analysis(self, X: np.ndarray, y: np.ndarray = None,
feature_names: List[str] = None) -> Dict:
"""Comprehensive PCA analysis"""
# Standardize data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Fit PCA with all components
pca_full = PCA()
pca_full.fit(X_scaled)
# Explained variance analysis
explained_variance_ratio = pca_full.explained_variance_ratio_
cumulative_variance = np.cumsum(explained_variance_ratio)
# Find components for different variance thresholds
variance_thresholds = [0.8, 0.9, 0.95, 0.99]
components_for_threshold = {}
for threshold in variance_thresholds:
n_components = np.argmax(cumulative_variance >= threshold) + 1
components_for_threshold[threshold] = n_components
# Compare with our implementation
our_pca = self.pca_from_scratch(X_scaled, n_components=min(10, X_scaled.shape[1]))
# Component loadings (if feature names provided)
loadings = None
if feature_names is not None:
n_show = min(5, len(pca_full.components_))
loadings = pd.DataFrame(
pca_full.components_[:n_show].T,
columns=[f'PC{i+1}' for i in range(n_show)],
index=feature_names
)
return {
'pca_full': pca_full,
'explained_variance_ratio': explained_variance_ratio,
'cumulative_variance': cumulative_variance,
'components_for_threshold': components_for_threshold,
'our_implementation': our_pca,
'scaler': scaler,
'loadings': loadings
}
def plot_pca_analysis(self, results: Dict, X: np.ndarray, y: np.ndarray = None):
"""Plot comprehensive PCA analysis"""
fig, axes = plt.subplots(2, 2, figsize=(15, 12))
# Plot 1: Explained variance ratio
n_components = min(20, len(results['explained_variance_ratio']))
axes[0, 0].bar(range(1, n_components + 1), results['explained_variance_ratio'][:n_components],
alpha=0.7, color='steelblue')
axes[0, 0].set_xlabel('Principal Component', fontsize=12)
axes[0, 0].set_ylabel('Explained Variance Ratio', fontsize=12)
axes[0, 0].set_title('Individual Component Variance', fontweight='bold')
axes[0, 0].grid(True, alpha=0.3)
# Plot 2: Cumulative explained variance
axes[0, 1].plot(range(1, len(results['cumulative_variance']) + 1),
results['cumulative_variance'], 'o-', linewidth=2, markersize=4)
# Add threshold lines
thresholds = [0.8, 0.9, 0.95, 0.99]
colors = ['red', 'orange', 'green', 'blue']
for threshold, color in zip(thresholds, colors):
axes[0, 1].axhline(y=threshold, color=color, linestyle='--', alpha=0.7,
label=f'{threshold*100:.0f}% variance')
n_comp = results['components_for_threshold'][threshold]
axes[0, 1].axvline(x=n_comp, color=color, linestyle='--', alpha=0.7)
axes[0, 1].text(n_comp + 1, threshold + 0.01, f'{n_comp}',
color=color, fontweight='bold')
axes[0, 1].set_xlabel('Number of Components', fontsize=12)
axes[0, 1].set_ylabel('Cumulative Explained Variance', fontsize=12)
axes[0, 1].set_title('Cumulative Variance Explained', fontweight='bold')
axes[0, 1].legend()
axes[0, 1].grid(True, alpha=0.3)
# Plot 3: Component loadings (if available)
if results['loadings'] is not None:
# Plot loadings heatmap
sns.heatmap(results['loadings'].T, annot=True, cmap='RdBu_r', center=0,
ax=axes[1, 0], cbar_kws={'shrink': 0.8})
axes[1, 0].set_title('Component Loadings', fontweight='bold')
axes[1, 0].set_xlabel('Features')
axes[1, 0].set_ylabel('Principal Components')
else:
axes[1, 0].text(0.5, 0.5, 'Feature names not provided\nfor loadings analysis',
ha='center', va='center', transform=axes[1, 0].transAxes,
fontsize=12, bbox=dict(boxstyle="round,pad=0.3", facecolor="lightgray"))
axes[1, 0].set_title('Component Loadings', fontweight='bold')
# Plot 4: 2D projection (if labels available)
if y is not None:
# Transform to 2D
pca_2d = PCA(n_components=2, random_state=42)
X_2d = pca_2d.fit_transform(results['scaler'].transform(X))
scatter = axes[1, 1].scatter(X_2d[:, 0], X_2d[:, 1], c=y,
cmap='viridis', alpha=0.6, s=50)
axes[1, 1].set_xlabel(f'PC1 ({pca_2d.explained_variance_ratio_[0]:.3f})', fontsize=12)
axes[1, 1].set_ylabel(f'PC2 ({pca_2d.explained_variance_ratio_[1]:.3f})', fontsize=12)
axes[1, 1].set_title('2D PCA Projection', fontweight='bold')
plt.colorbar(scatter, ax=axes[1, 1])
else:
# Show comparison with our implementation
our_var_ratio = results['our_implementation']['explained_variance_ratio']
sklearn_var_ratio = results['explained_variance_ratio'][:len(our_var_ratio)]
x_pos = np.arange(len(our_var_ratio))
width = 0.35
bars1 = axes[1, 1].bar(x_pos - width/2, our_var_ratio, width,
label='Our Implementation', alpha=0.7)
bars2 = axes[1, 1].bar(x_pos + width/2, sklearn_var_ratio, width,
label='Scikit-learn', alpha=0.7)
axes[1, 1].set_xlabel('Component', fontsize=12)
axes[1, 1].set_ylabel('Explained Variance Ratio', fontsize=12)
axes[1, 1].set_title('Implementation Comparison', fontweight='bold')
axes[1, 1].set_xticks(x_pos)
axes[1, 1].set_xticklabels([f'PC{i+1}' for i in range(len(our_var_ratio))])
axes[1, 1].legend()
axes[1, 1].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# Print summary
print("\nPCA Analysis Summary:")
print("-" * 40)
for threshold in [0.9, 0.95]:
n_comp = results['components_for_threshold'][threshold]
print(f"Components for {threshold*100:.0f}% variance: {n_comp}")
# Load datasets for analysis
print("Loading datasets...")
# Digits dataset
digits = load_digits()
X_digits, y_digits = digits.data, digits.target
# Breast cancer dataset
cancer = load_breast_cancer()
X_cancer, y_cancer = cancer.data, cancer.target
# Analyze PCA on digits dataset
print("\n" + "="*60)
print("PCA Analysis on Digits Dataset")
print("="*60)
pca_analyzer = PCAAnalyzer()
digits_results = pca_analyzer.comprehensive_pca_analysis(X_digits, y_digits)
pca_analyzer.plot_pca_analysis(digits_results, X_digits, y_digits)
# Analyze PCA on cancer dataset with feature names
print("\n" + "="*60)
print("PCA Analysis on Breast Cancer Dataset")
print("="*60)
cancer_results = pca_analyzer.comprehensive_pca_analysis(X_cancer, y_cancer,
list(cancer.feature_names))
pca_analyzer.plot_pca_analysis(cancer_results, X_cancer, y_cancer)
Advanced PCA Variants
Kernel PCA and Incremental PCA
class AdvancedPCAAnalyzer:
"""Advanced PCA techniques analyzer"""
def __init__(self, random_state: int = 42):
self.random_state = random_state
def kernel_pca_analysis(self, X: np.ndarray, y: np.ndarray) -> Dict:
"""Analyze different Kernel PCA variants"""
# Standardize data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Different kernels
kernels = {
'linear': {'kernel': 'linear'},
'polynomial': {'kernel': 'poly', 'degree': 3},
'rbf': {'kernel': 'rbf', 'gamma': 0.1},
'sigmoid': {'kernel': 'sigmoid', 'gamma': 0.1}
}
results = {}
for kernel_name, kernel_params in kernels.items():
print(f"Analyzing {kernel_name} kernel...")
try:
# Fit Kernel PCA
kpca = KernelPCA(n_components=2, random_state=self.random_state,
**kernel_params)
X_kpca = kpca.fit_transform(X_scaled)
results[kernel_name] = {
'model': kpca,
'transformed_data': X_kpca,
'eigenvalues': kpca.eigenvalues_ if hasattr(kpca, 'eigenvalues_') else None
}
except Exception as e:
print(f"Error with {kernel_name} kernel: {str(e)}")
continue
return results
def plot_kernel_pca_comparison(self, results: Dict, y: np.ndarray):
"""Plot Kernel PCA comparison"""
n_kernels = len(results)
fig, axes = plt.subplots(2, 2, figsize=(15, 12))
axes = axes.flatten()
for idx, (kernel_name, data) in enumerate(results.items()):
if idx >= 4: # Only plot first 4
break
X_transformed = data['transformed_data']
scatter = axes[idx].scatter(X_transformed[:, 0], X_transformed[:, 1],
c=y, cmap='viridis', alpha=0.6, s=50)
axes[idx].set_title(f'{kernel_name.capitalize()} Kernel PCA', fontweight='bold')
axes[idx].set_xlabel('First Component')
axes[idx].set_ylabel('Second Component')
axes[idx].grid(True, alpha=0.3)
plt.colorbar(scatter, ax=axes[idx])
plt.tight_layout()
plt.show()
def incremental_pca_analysis(self, X: np.ndarray, batch_sizes: List[int] = None) -> Dict:
"""Analyze Incremental PCA for large datasets"""
if batch_sizes is None:
batch_sizes = [50, 100, 200, 500]
# Standardize data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Regular PCA for comparison
pca_regular = PCA(n_components=10)
pca_regular.fit(X_scaled)
results = {'regular_pca': pca_regular}
for batch_size in batch_sizes:
if batch_size < X_scaled.shape[0]:
print(f"Analyzing Incremental PCA with batch size {batch_size}...")
from sklearn.decomposition import IncrementalPCA
ipca = IncrementalPCA(n_components=10, batch_size=batch_size)
ipca.fit(X_scaled)
results[f'batch_{batch_size}'] = ipca
return results
def plot_incremental_pca_comparison(self, results: Dict):
"""Plot Incremental PCA comparison"""
# Compare explained variance ratios
methods = list(results.keys())
n_components = 10
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))
# Plot explained variance ratios
for method, pca_model in results.items():
if hasattr(pca_model, 'explained_variance_ratio_'):
variance_ratio = pca_model.explained_variance_ratio_
ax1.plot(range(1, len(variance_ratio) + 1), variance_ratio,
'o-', linewidth=2, label=method.replace('_', ' ').title())
ax1.set_xlabel('Component', fontsize=12)
ax1.set_ylabel('Explained Variance Ratio', fontsize=12)
ax1.set_title('Explained Variance Comparison', fontweight='bold')
ax1.legend()
ax1.grid(True, alpha=0.3)
# Plot cumulative variance
for method, pca_model in results.items():
if hasattr(pca_model, 'explained_variance_ratio_'):
cumulative_variance = np.cumsum(pca_model.explained_variance_ratio_)
ax2.plot(range(1, len(cumulative_variance) + 1), cumulative_variance,
'o-', linewidth=2, label=method.replace('_', ' ').title())
ax2.set_xlabel('Component', fontsize=12)
ax2.set_ylabel('Cumulative Explained Variance', fontsize=12)
ax2.set_title('Cumulative Variance Comparison', fontweight='bold')
ax2.legend()
ax2.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# Analyze advanced PCA techniques
advanced_pca = AdvancedPCAAnalyzer()
print("\n" + "="*60)
print("Kernel PCA Analysis")
print("="*60)
# Use a smaller subset for Kernel PCA (computationally expensive)
X_sample = X_digits[:500]
y_sample = y_digits[:500]
kernel_results = advanced_pca.kernel_pca_analysis(X_sample, y_sample)
advanced_pca.plot_kernel_pca_comparison(kernel_results, y_sample)
print("\n" + "="*60)
print("Incremental PCA Analysis")
print("="*60)
incremental_results = advanced_pca.incremental_pca_analysis(X_digits)
advanced_pca.plot_incremental_pca_comparison(incremental_results)
t-SNE (t-Distributed Stochastic Neighbor Embedding)
t-SNE Implementation and Analysis
class TSNEAnalyzer:
"""t-SNE analyzer for nonlinear dimensionality reduction"""
def __init__(self, random_state: int = 42):
self.random_state = random_state
def tsne_parameter_analysis(self, X: np.ndarray, y: np.ndarray) -> Dict:
"""Analyze t-SNE with different parameters"""
# Standardize data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Use subset for faster computation
if X_scaled.shape[0] > 1000:
indices = np.random.choice(X_scaled.shape[0], 1000, replace=False)
X_subset = X_scaled[indices]
y_subset = y[indices]
else:
X_subset = X_scaled
y_subset = y
# Different perplexity values
perplexities = [5, 15, 30, 50]
results = {}
for perplexity in perplexities:
print(f"Running t-SNE with perplexity={perplexity}...")
tsne = TSNE(n_components=2, perplexity=perplexity,
random_state=self.random_state, n_iter=1000)
X_tsne = tsne.fit_transform(X_subset)
results[f'perplexity_{perplexity}'] = {
'transformed_data': X_tsne,
'perplexity': perplexity,
'kl_divergence': tsne.kl_divergence_
}
return results, y_subset
def plot_tsne_analysis(self, results: Dict, y: np.ndarray):
"""Plot t-SNE analysis results"""
fig, axes = plt.subplots(2, 2, figsize=(15, 12))
axes = axes.flatten()
for idx, (param_name, data) in enumerate(results.items()):
X_transformed = data['transformed_data']
perplexity = data['perplexity']
kl_div = data['kl_divergence']
scatter = axes[idx].scatter(X_transformed[:, 0], X_transformed[:, 1],
c=y, cmap='tab10', alpha=0.7, s=50)
axes[idx].set_title(f't-SNE (perplexity={perplexity})\nKL divergence: {kl_div:.2f}',
fontweight='bold')
axes[idx].set_xlabel('t-SNE 1')
axes[idx].set_ylabel('t-SNE 2')
axes[idx].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
def compare_with_pca(self, X: np.ndarray, y: np.ndarray):
"""Compare t-SNE with PCA"""
# Standardize data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Use subset for faster computation
if X_scaled.shape[0] > 1000:
indices = np.random.choice(X_scaled.shape[0], 1000, replace=False)
X_subset = X_scaled[indices]
y_subset = y[indices]
else:
X_subset = X_scaled
y_subset = y
# PCA
pca = PCA(n_components=2, random_state=self.random_state)
X_pca = pca.fit_transform(X_subset)
# t-SNE
tsne = TSNE(n_components=2, perplexity=30, random_state=self.random_state)
X_tsne = tsne.fit_transform(X_subset)
# Plot comparison
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))
# PCA plot
scatter1 = ax1.scatter(X_pca[:, 0], X_pca[:, 1], c=y_subset,
cmap='tab10', alpha=0.7, s=50)
ax1.set_title(f'PCA\nExplained Variance: {pca.explained_variance_ratio_.sum():.3f}',
fontweight='bold')
ax1.set_xlabel('PC1')
ax1.set_ylabel('PC2')
ax1.grid(True, alpha=0.3)
plt.colorbar(scatter1, ax=ax1)
# t-SNE plot
scatter2 = ax2.scatter(X_tsne[:, 0], X_tsne[:, 1], c=y_subset,
cmap='tab10', alpha=0.7, s=50)
ax2.set_title(f't-SNE\nKL Divergence: {tsne.kl_divergence_:.2f}', fontweight='bold')
ax2.set_xlabel('t-SNE 1')
ax2.set_ylabel('t-SNE 2')
ax2.grid(True, alpha=0.3)
plt.colorbar(scatter2, ax=ax2)
plt.tight_layout()
plt.show()
# Analyze t-SNE
tsne_analyzer = TSNEAnalyzer()
print("\n" + "="*60)
print("t-SNE Parameter Analysis")
print("="*60)
tsne_results, y_tsne = tsne_analyzer.tsne_parameter_analysis(X_digits, y_digits)
tsne_analyzer.plot_tsne_analysis(tsne_results, y_tsne)
print("\n" + "="*60)
print("t-SNE vs PCA Comparison")
print("="*60)
tsne_analyzer.compare_with_pca(X_digits, y_digits)
UMAP (Uniform Manifold Approximation and Projection)
UMAP Analysis
class UMAPAnalyzer:
"""UMAP analyzer for fast nonlinear dimensionality reduction"""
def __init__(self, random_state: int = 42):
self.random_state = random_state
def umap_analysis(self, X: np.ndarray, y: np.ndarray) -> Dict:
"""Analyze UMAP with different parameters"""
try:
import umap
except ImportError:
print("UMAP not installed. Install with: pip install umap-learn")
return self._simulate_umap_results(X, y)
# Standardize data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Different parameter combinations
parameter_combinations = [
{'n_neighbors': 15, 'min_dist': 0.1},
{'n_neighbors': 15, 'min_dist': 0.5},
{'n_neighbors': 50, 'min_dist': 0.1},
{'n_neighbors': 50, 'min_dist': 0.5}
]
results = {}
for i, params in enumerate(parameter_combinations):
print(f"Running UMAP with n_neighbors={params['n_neighbors']}, min_dist={params['min_dist']}...")
reducer = umap.UMAP(n_components=2, random_state=self.random_state, **params)
X_umap = reducer.fit_transform(X_scaled)
param_key = f"n{params['n_neighbors']}_d{params['min_dist']}"
results[param_key] = {
'transformed_data': X_umap,
'params': params,
'model': reducer
}
return results
def _simulate_umap_results(self, X: np.ndarray, y: np.ndarray) -> Dict:
"""Simulate UMAP results when package not available"""
print("Simulating UMAP results (install umap-learn for real results)")
# Use t-SNE as approximation
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
if X_scaled.shape[0] > 1000:
indices = np.random.choice(X_scaled.shape[0], 1000, replace=False)
X_subset = X_scaled[indices]
else:
X_subset = X_scaled
tsne = TSNE(n_components=2, perplexity=30, random_state=self.random_state)
X_sim = tsne.fit_transform(X_subset)
# Add some noise to simulate different parameters
results = {}
for i, (n_neighbors, min_dist) in enumerate([(15, 0.1), (15, 0.5), (50, 0.1), (50, 0.5)]):
noise_scale = 0.1 * (i + 1)
X_noisy = X_sim + np.random.normal(0, noise_scale, X_sim.shape)
param_key = f"n{n_neighbors}_d{min_dist}"
results[param_key] = {
'transformed_data': X_noisy,
'params': {'n_neighbors': n_neighbors, 'min_dist': min_dist},
'model': None
}
return results
def plot_umap_analysis(self, results: Dict, y: np.ndarray):
"""Plot UMAP analysis results"""
fig, axes = plt.subplots(2, 2, figsize=(15, 12))
axes = axes.flatten()
for idx, (param_key, data) in enumerate(results.items()):
X_transformed = data['transformed_data']
params = data['params']
# Handle potential size mismatch
y_plot = y[:len(X_transformed)] if len(y) > len(X_transformed) else y
scatter = axes[idx].scatter(X_transformed[:, 0], X_transformed[:, 1],
c=y_plot, cmap='tab10', alpha=0.7, s=50)
title = f"UMAP\nn_neighbors={params['n_neighbors']}, min_dist={params['min_dist']}"
axes[idx].set_title(title, fontweight='bold')
axes[idx].set_xlabel('UMAP 1')
axes[idx].set_ylabel('UMAP 2')
axes[idx].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
def compare_methods(self, X: np.ndarray, y: np.ndarray):
"""Compare PCA, t-SNE, and UMAP"""
# Standardize data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Use subset for faster computation
if X_scaled.shape[0] > 1000:
indices = np.random.choice(X_scaled.shape[0], 1000, replace=False)
X_subset = X_scaled[indices]
y_subset = y[indices]
else:
X_subset = X_scaled
y_subset = y
# PCA
pca = PCA(n_components=2, random_state=self.random_state)
X_pca = pca.fit_transform(X_subset)
# t-SNE
tsne = TSNE(n_components=2, perplexity=30, random_state=self.random_state)
X_tsne = tsne.fit_transform(X_subset)
# UMAP (or simulation)
try:
import umap
reducer = umap.UMAP(n_components=2, random_state=self.random_state)
X_umap = reducer.fit_transform(X_subset)
umap_available = True
except ImportError:
# Use t-SNE with different parameters as simulation
tsne_sim = TSNE(n_components=2, perplexity=15, random_state=self.random_state + 1)
X_umap = tsne_sim.fit_transform(X_subset)
umap_available = False
# Plot comparison
fig, axes = plt.subplots(1, 3, figsize=(18, 6))
# PCA
scatter1 = axes[0].scatter(X_pca[:, 0], X_pca[:, 1], c=y_subset,
cmap='tab10', alpha=0.7, s=50)
axes[0].set_title(f'PCA\nLinear Projection', fontweight='bold')
axes[0].set_xlabel('PC1')
axes[0].set_ylabel('PC2')
axes[0].grid(True, alpha=0.3)
# t-SNE
scatter2 = axes[1].scatter(X_tsne[:, 0], X_tsne[:, 1], c=y_subset,
cmap='tab10', alpha=0.7, s=50)
axes[1].set_title('t-SNE\nNonlinear (Local)', fontweight='bold')
axes[1].set_xlabel('t-SNE 1')
axes[1].set_ylabel('t-SNE 2')
axes[1].grid(True, alpha=0.3)
# UMAP
scatter3 = axes[2].scatter(X_umap[:, 0], X_umap[:, 1], c=y_subset,
cmap='tab10', alpha=0.7, s=50)
title = 'UMAP\nNonlinear (Global)' if umap_available else 'UMAP (Simulated)\nNonlinear'
axes[2].set_title(title, fontweight='bold')
axes[2].set_xlabel('UMAP 1')
axes[2].set_ylabel('UMAP 2')
axes[2].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# Analyze UMAP
umap_analyzer = UMAPAnalyzer()
print("\n" + "="*60)
print("UMAP Parameter Analysis")
print("="*60)
umap_results = umap_analyzer.umap_analysis(X_digits, y_digits)
umap_analyzer.plot_umap_analysis(umap_results, y_digits)
print("\n" + "="*60)
print("Method Comparison: PCA vs t-SNE vs UMAP")
print("="*60)
umap_analyzer.compare_methods(X_digits, y_digits)
Linear Discriminant Analysis (LDA)
LDA for Supervised Dimensionality Reduction
class LDAAnalyzer:
"""Linear Discriminant Analysis analyzer"""
def __init__(self, random_state: int = 42):
self.random_state = random_state
def lda_analysis(self, X: np.ndarray, y: np.ndarray) -> Dict:
"""Comprehensive LDA analysis"""
# Standardize data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Fit LDA
lda = LinearDiscriminantAnalysis()
X_lda = lda.fit_transform(X_scaled, y)
# Calculate explained variance ratio
explained_variance_ratio = lda.explained_variance_ratio_
return {
'model': lda,
'transformed_data': X_lda,
'explained_variance_ratio': explained_variance_ratio,
'classes': lda.classes_,
'means': lda.means_,
'scaler': scaler
}
def compare_lda_pca(self, X: np.ndarray, y: np.ndarray):
"""Compare LDA with PCA"""
# Standardize data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# LDA (supervised)
lda = LinearDiscriminantAnalysis(n_components=2)
X_lda = lda.fit_transform(X_scaled, y)
# PCA (unsupervised)
pca = PCA(n_components=2, random_state=self.random_state)
X_pca = pca.fit_transform(X_scaled)
# Plot comparison
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))
# PCA plot
scatter1 = ax1.scatter(X_pca[:, 0], X_pca[:, 1], c=y,
cmap='tab10', alpha=0.7, s=50)
ax1.set_title(f'PCA (Unsupervised)\nExplained Variance: {pca.explained_variance_ratio_.sum():.3f}',
fontweight='bold')
ax1.set_xlabel('PC1')
ax1.set_ylabel('PC2')
ax1.grid(True, alpha=0.3)
plt.colorbar(scatter1, ax=ax1)
# LDA plot
scatter2 = ax2.scatter(X_lda[:, 0], X_lda[:, 1], c=y,
cmap='tab10', alpha=0.7, s=50)
lda_variance = lda.explained_variance_ratio_.sum() if hasattr(lda, 'explained_variance_ratio_') else 'N/A'
ax2.set_title(f'LDA (Supervised)\nClass Separation Optimized', fontweight='bold')
ax2.set_xlabel('LD1')
ax2.set_ylabel('LD2')
ax2.grid(True, alpha=0.3)
plt.colorbar(scatter2, ax=ax2)
plt.tight_layout()
plt.show()
return {'lda': lda, 'pca': pca}
def evaluate_classification_performance(self, X: np.ndarray, y: np.ndarray) -> Dict:
"""Evaluate classification performance with different dimensionality reduction"""
# Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=self.random_state, stratify=y
)
# Standardize
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
methods = {}
# Original features
rf_original = RandomForestClassifier(n_estimators=100, random_state=self.random_state)
rf_original.fit(X_train_scaled, y_train)
acc_original = accuracy_score(y_test, rf_original.predict(X_test_scaled))
methods['Original'] = {'accuracy': acc_original, 'n_features': X_train_scaled.shape[1]}
# PCA
for n_comp in [10, 20, 50]:
if n_comp < X_train_scaled.shape[1]:
pca = PCA(n_components=n_comp, random_state=self.random_state)
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)
rf_pca = RandomForestClassifier(n_estimators=100, random_state=self.random_state)
rf_pca.fit(X_train_pca, y_train)
acc_pca = accuracy_score(y_test, rf_pca.predict(X_test_pca))
methods[f'PCA_{n_comp}'] = {'accuracy': acc_pca, 'n_features': n_comp}
# LDA
n_lda_comp = min(len(np.unique(y)) - 1, X_train_scaled.shape[1])
if n_lda_comp > 0:
lda = LinearDiscriminantAnalysis(n_components=n_lda_comp)
X_train_lda = lda.fit_transform(X_train_scaled, y_train)
X_test_lda = lda.transform(X_test_scaled)
rf_lda = RandomForestClassifier(n_estimators=100, random_state=self.random_state)
rf_lda.fit(X_train_lda, y_train)
acc_lda = accuracy_score(y_test, rf_lda.predict(X_test_lda))
methods['LDA'] = {'accuracy': acc_lda, 'n_features': n_lda_comp}
return methods
def plot_classification_performance(self, results: Dict):
"""Plot classification performance comparison"""
methods = list(results.keys())
accuracies = [results[method]['accuracy'] for method in methods]
n_features = [results[method]['n_features'] for method in methods]
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))
# Accuracy comparison
bars1 = ax1.bar(methods, accuracies, alpha=0.7, color='steelblue')
ax1.set_ylabel('Accuracy', fontsize=12)
ax1.set_title('Classification Accuracy by Method', fontweight='bold')
ax1.set_xticklabels(methods, rotation=45, ha='right')
ax1.grid(True, alpha=0.3)
# Add value labels
for bar, acc in zip(bars1, accuracies):
ax1.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.005,
f'{acc:.3f}', ha='center', va='bottom', fontweight='bold')
# Feature count comparison
bars2 = ax2.bar(methods, n_features, alpha=0.7, color='orange')
ax2.set_ylabel('Number of Features', fontsize=12)
ax2.set_title('Feature Count by Method', fontweight='bold')
ax2.set_xticklabels(methods, rotation=45, ha='right')
ax2.grid(True, alpha=0.3)
ax2.set_yscale('log')
# Add value labels
for bar, n_feat in zip(bars2, n_features):
ax2.text(bar.get_x() + bar.get_width()/2, bar.get_height() * 1.1,
f'{n_feat}', ha='center', va='bottom', fontweight='bold')
plt.tight_layout()
plt.show()
# Analyze LDA
lda_analyzer = LDAAnalyzer()
print("\n" + "="*60)
print("LDA vs PCA Comparison")
print("="*60)
lda_pca_models = lda_analyzer.compare_lda_pca(X_digits, y_digits)
print("\n" + "="*60)
print("Classification Performance Comparison")
print("="*60)
classification_results = lda_analyzer.evaluate_classification_performance(X_digits, y_digits)
lda_analyzer.plot_classification_performance(classification_results)
# Print detailed results
print("\nDetailed Classification Results:")
print("-" * 50)
for method, metrics in classification_results.items():
print(f"{method:12}: Accuracy={metrics['accuracy']:.4f}, Features={metrics['n_features']}")
Method Comparison and Guidelines
Comprehensive Comparison
class DimensionalityReductionComparison:
"""Comprehensive comparison of dimensionality reduction methods"""
def __init__(self, random_state: int = 42):
self.random_state = random_state
def create_comparison_table(self):
"""Create comparison table of methods"""
comparison_data = {
'Method': ['PCA', 'Kernel PCA', 't-SNE', 'UMAP', 'LDA', 'ICA', 'Factor Analysis'],
'Type': ['Linear', 'Nonlinear', 'Nonlinear', 'Nonlinear', 'Linear', 'Linear', 'Linear'],
'Supervised': ['No', 'No', 'No', 'No', 'Yes', 'No', 'No'],
'Preserves': ['Global variance', 'Global structure', 'Local structure', 'Local+Global', 'Class separation', 'Independence', 'Latent factors'],
'Best For': ['Visualization, preprocessing', 'Nonlinear patterns', 'Visualization', 'Fast visualization', 'Classification', 'Signal separation', 'Factor models'],
'Scalability': ['Excellent', 'Poor', 'Poor', 'Good', 'Good', 'Good', 'Good'],
'Deterministic': ['Yes', 'Yes', 'No', 'No', 'Yes', 'No', 'No']
}
comparison_df = pd.DataFrame(comparison_data)
# Display table
print("\nDimensionality Reduction Methods Comparison:")
print("=" * 80)
print(comparison_df.to_string(index=False))
return comparison_df
def method_selection_guide(self):
"""Provide method selection guidelines"""
guidelines = {
"Data Visualization": {
"Small datasets (<1000 samples)": "t-SNE",
"Large datasets (>10000 samples)": "UMAP or PCA",
"Need deterministic results": "PCA"
},
"Preprocessing for ML": {
"Linear relationships": "PCA",
"Nonlinear relationships": "Kernel PCA",
"Classification tasks": "LDA",
"Noise reduction": "PCA or Factor Analysis"
},
"Specific Applications": {
"Image compression": "PCA",
"Signal processing": "ICA",
"Clustering": "PCA or UMAP",
"Anomaly detection": "PCA or Kernel PCA"
}
}
print("\n\nMethod Selection Guidelines:")
print("=" * 50)
for category, subcategories in guidelines.items():
print(f"\n{category}:")
for use_case, method in subcategories.items():
print(f" " {use_case}: {method}")
def performance_vs_quality_analysis(self, X: np.ndarray, y: np.ndarray):
"""Analyze performance vs quality trade-offs"""
import time
# Use subset for consistent timing
if X.shape[0] > 1000:
indices = np.random.choice(X.shape[0], 1000, replace=False)
X_subset = X[indices]
y_subset = y[indices]
else:
X_subset = X
y_subset = y
# Standardize
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_subset)
methods = {}
# PCA
start_time = time.time()
pca = PCA(n_components=2, random_state=self.random_state)
X_pca = pca.fit_transform(X_scaled)
pca_time = time.time() - start_time
methods['PCA'] = {'time': pca_time, 'deterministic': True, 'scalable': True}
# t-SNE
start_time = time.time()
tsne = TSNE(n_components=2, random_state=self.random_state, n_iter=300)
X_tsne = tsne.fit_transform(X_scaled)
tsne_time = time.time() - start_time
methods['t-SNE'] = {'time': tsne_time, 'deterministic': False, 'scalable': False}
# LDA
if len(np.unique(y_subset)) > 1:
start_time = time.time()
lda = LinearDiscriminantAnalysis(n_components=min(2, len(np.unique(y_subset))-1))
X_lda = lda.fit_transform(X_scaled, y_subset)
lda_time = time.time() - start_time
methods['LDA'] = {'time': lda_time, 'deterministic': True, 'scalable': True}
# Plot results
fig, axes = plt.subplots(1, len(methods), figsize=(5*len(methods), 5))
if len(methods) == 1:
axes = [axes]
transforms = {'PCA': X_pca, 't-SNE': X_tsne}
if 'LDA' in methods:
transforms['LDA'] = X_lda
for idx, (method_name, transform_data) in enumerate(transforms.items()):
scatter = axes[idx].scatter(transform_data[:, 0], transform_data[:, 1],
c=y_subset, cmap='tab10', alpha=0.7, s=50)
time_taken = methods[method_name]['time']
axes[idx].set_title(f'{method_name}\nTime: {time_taken:.2f}s', fontweight='bold')
axes[idx].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# Print timing comparison
print("\nPerformance Comparison:")
print("-" * 30)
for method, metrics in methods.items():
print(f"{method:8}: {metrics['time']:.3f}s")
return methods
# Create comprehensive comparison
comparison = DimensionalityReductionComparison()
print("\n" + "="*80)
print("COMPREHENSIVE METHOD COMPARISON")
print("="*80)
comparison_table = comparison.create_comparison_table()
comparison.method_selection_guide()
print("\n" + "="*60)
print("Performance vs Quality Analysis")
print("="*60)
performance_results = comparison.performance_vs_quality_analysis(X_digits, y_digits)
Best Practices and Guidelines
Key Recommendations
- Start with PCA for initial exploration and preprocessing
- Use t-SNE/UMAP for visualization of complex patterns
- Try LDA when you have labeled data for classification
- Consider computational constraints for large datasets
- Validate results with downstream tasks
Common Pitfalls
- Over-interpreting t-SNE/UMAP visualizations
- Not standardizing data before PCA
- Ignoring explained variance when choosing components
- Using wrong distance metrics for high-dimensional data
- Not validating dimensionality reduction effectiveness
Performance Summary
Method | Speed | Memory | Scalability | Deterministic |
---|---|---|---|---|
PCA | Fast | Low | Excellent | Yes |
t-SNE | Slow | High | Poor | No |
UMAP | Medium | Medium | Good | No |
LDA | Fast | Low | Good | Yes |
Conclusion
Dimensionality reduction is essential for modern machine learning. Key takeaways:
- PCA is your go-to for linear dimensionality reduction
- t-SNE and UMAP excel at revealing nonlinear structure
- LDA is perfect when you have labels and want class separation
- Always validate that reduction preserves important information
- Consider computational trade-offs for your specific use case
The right choice depends on your data characteristics, computational constraints, and downstream applications.
References
-
Jolliffe, I. T. (2002). Principal component analysis. Springer.
-
Van der Maaten, L., & Hinton, G. (2008). Visualizing data using t-SNE. Journal of machine learning research.
-
McInnes, L., Healy, J., & Melville, J. (2018). UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint.
Connect with me on LinkedIn or X to discuss dimensionality reduction techniques!