Dimensionality Reduction Techniques in Machine Learning: PCA, t-SNE, and UMAP Guide
AI-Generated Content Notice
Some code examples and technical explanations in this article were generated with AI assistance. The content has been reviewed for accuracy, but please test any code snippets in your development environment before using them.
Dimensionality Reduction Techniques in Machine Learning: PCA, t-SNE, and UMAP Guide
Introduction
High-dimensional data is everywhere—from image pixels to gene expressions. Dimensionality reduction helps us visualize, understand, and process this data more efficiently. This guide covers the most important techniques: PCA for linear reduction, t-SNE for non-linear visualization, and UMAP for modern high-dimensional analysis.
Understanding these methods will help you tackle the curse of dimensionality and create meaningful visualizations of complex datasets.
Why Dimensionality Reduction?
Key Benefits:
- Visualization: Plot high-dimensional data in 2D/3D
- Faster computation: Fewer features = faster algorithms
- Storage efficiency: Compress data while preserving information
- Noise reduction: Focus on important patterns
- Feature extraction: Discover hidden structures
Common challenges with high-dimensional data:
- Curse of dimensionality
- Visualization difficulty
- Computational complexity
- Overfitting risk
- Memory constraints
Principal Component Analysis (PCA)
Complete PCA Implementation
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_digits, load_breast_cancer, make_blobs
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
from typing import Tuple, List, Dict, Optional
import warnings
warnings.filterwarnings('ignore')
plt.style.use('seaborn-v0_8')
class PCAAnalyzer:
"""Comprehensive PCA analysis and visualization"""
def __init__(self, random_state: int = 42):
self.random_state = random_state
self.pca_model: Optional[PCA] = None
self.scaler: Optional[StandardScaler] = None
def fit_pca(self, X: np.ndarray, n_components: Optional[int] = None) -> 'PCAAnalyzer':
"""Fit PCA model with optional standardization"""
# Standardize data
self.scaler = StandardScaler()
X_scaled = self.scaler.fit_transform(X)
# Fit PCA
if n_components is None:
n_components = min(X.shape[0], X.shape[1])
self.pca_model = PCA(n_components=n_components, random_state=self.random_state)
self.pca_model.fit(X_scaled)
return self
def transform(self, X: np.ndarray) -> np.ndarray:
"""Transform data using fitted PCA"""
if self.pca_model is None or self.scaler is None:
raise ValueError("Must fit PCA first")
X_scaled = self.scaler.transform(X)
return self.pca_model.transform(X_scaled)
def explained_variance_analysis(self) -> Dict:
"""Analyze explained variance ratios"""
if self.pca_model is None:
raise ValueError("Must fit PCA first")
explained_variance_ratio = self.pca_model.explained_variance_ratio_
cumulative_variance = np.cumsum(explained_variance_ratio)
# Find components for different thresholds
thresholds = [0.8, 0.9, 0.95, 0.99]
components_needed = {}
for threshold in thresholds:
n_components = np.argmax(cumulative_variance >= threshold) + 1
components_needed[threshold] = n_components
return {
'explained_variance_ratio': explained_variance_ratio,
'cumulative_variance': cumulative_variance,
'components_needed': components_needed
}
def plot_explained_variance(self, max_components: int = 20):
"""Plot explained variance analysis"""
if self.pca_model is None:
raise ValueError("Must fit PCA first")
variance_data = self.explained_variance_analysis()
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))
# Individual explained variance
n_components = min(max_components, len(variance_data['explained_variance_ratio']))
components = range(1, n_components + 1)
ax1.bar(components, variance_data['explained_variance_ratio'][:n_components],
alpha=0.7, color='steelblue')
ax1.set_xlabel('Principal Component', fontsize=12)
ax1.set_ylabel('Explained Variance Ratio', fontsize=12)
ax1.set_title('Individual Component Variance', fontweight='bold')
ax1.grid(True, alpha=0.3)
# Cumulative explained variance
ax2.plot(components, variance_data['cumulative_variance'][:n_components],
'o-', linewidth=3, markersize=6, color='red')
# Add threshold lines
thresholds = [0.8, 0.9, 0.95]
colors = ['orange', 'green', 'purple']
for threshold, color in zip(thresholds, colors):
ax2.axhline(y=threshold, color=color, linestyle='--', alpha=0.7,
label=f'{threshold*100:.0f}% variance')
if threshold in variance_data['components_needed']:
n_comp = variance_data['components_needed'][threshold]
if n_comp <= n_components:
ax2.axvline(x=n_comp, color=color, linestyle='--', alpha=0.7)
ax2.text(n_comp + 0.5, threshold + 0.02, f'{n_comp} comp.',
color=color, fontweight='bold')
ax2.set_xlabel('Number of Components', fontsize=12)
ax2.set_ylabel('Cumulative Explained Variance', fontsize=12)
ax2.set_title('Cumulative Variance Explained', fontweight='bold')
ax2.legend()
ax2.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# Print summary
print("PCA Variance Analysis:")
print("-" * 30)
for threshold, n_comp in variance_data['components_needed'].items():
print(f"{threshold*100:.0f}% variance: {n_comp} components")
def plot_2d_projection(self, X: np.ndarray, y: Optional[np.ndarray] = None,
feature_names: Optional[List[str]] = None):
"""Plot first two principal components"""
if self.pca_model is None:
raise ValueError("Must fit PCA first")
X_pca = self.transform(X)
fig, axes = plt.subplots(1, 2, figsize=(15, 6))
# Scatter plot
if y is not None:
scatter = axes[0].scatter(X_pca[:, 0], X_pca[:, 1], c=y,
cmap='viridis', alpha=0.7, s=50)
plt.colorbar(scatter, ax=axes[0])
else:
axes[0].scatter(X_pca[:, 0], X_pca[:, 1], alpha=0.7, s=50)
explained_var = self.pca_model.explained_variance_ratio_
axes[0].set_xlabel(f'PC1 ({explained_var[0]:.3f})', fontsize=12)
axes[0].set_ylabel(f'PC2 ({explained_var[1]:.3f})', fontsize=12)
axes[0].set_title('First Two Principal Components', fontweight='bold')
axes[0].grid(True, alpha=0.3)
# Component loadings
if feature_names is not None and len(feature_names) <= 20:
loadings = self.pca_model.components_[:2].T
for i, feature in enumerate(feature_names):
axes[1].arrow(0, 0, loadings[i, 0], loadings[i, 1],
head_width=0.02, head_length=0.02, fc='red', ec='red')
axes[1].text(loadings[i, 0]*1.1, loadings[i, 1]*1.1, feature,
fontsize=10, ha='center', va='center')
axes[1].set_xlabel(f'PC1 Loadings ({explained_var[0]:.3f})', fontsize=12)
axes[1].set_ylabel(f'PC2 Loadings ({explained_var[1]:.3f})', fontsize=12)
axes[1].set_title('Component Loadings', fontweight='bold')
axes[1].grid(True, alpha=0.3)
axes[1].set_xlim(-1.1, 1.1)
axes[1].set_ylim(-1.1, 1.1)
else:
axes[1].text(0.5, 0.5, 'Too many features\nfor loading plot',
ha='center', va='center', transform=axes[1].transAxes,
fontsize=14)
axes[1].set_title('Component Loadings (Skipped)', fontweight='bold')
plt.tight_layout()
plt.show()
# Load and analyze digits dataset
print("=== PCA Analysis on Digits Dataset ===")
digits = load_digits()
X_digits, y_digits = digits.data, digits.target
print(f"Original data shape: {X_digits.shape}")
print(f"Number of classes: {len(np.unique(y_digits))}")
# Fit PCA
pca_analyzer = PCAAnalyzer()
pca_analyzer.fit_pca(X_digits, n_components=50)
# Analyze explained variance
pca_analyzer.plot_explained_variance(max_components=20)
# Visualize 2D projection
pca_analyzer.plot_2d_projection(X_digits, y_digits)
PCA for Classification Performance
def pca_classification_analysis(X: np.ndarray, y: np.ndarray,
max_components: int = 50) -> Dict:
"""Analyze how PCA affects classification performance"""
component_range = range(2, min(max_components + 1, X.shape[1]), 2)
results = []
# Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42, stratify=y
)
# Standardize
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
for n_components in component_range:
# Apply PCA
pca = PCA(n_components=n_components, random_state=42)
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)
# Train classifier
clf = LogisticRegression(random_state=42, max_iter=1000)
clf.fit(X_train_pca, y_train)
# Evaluate
train_acc = clf.score(X_train_pca, y_train)
test_acc = clf.score(X_test_pca, y_test)
# Store results
results.append({
'n_components': n_components,
'train_accuracy': train_acc,
'test_accuracy': test_acc,
'explained_variance': pca.explained_variance_ratio_.sum()
})
print(f"Components: {n_components:2d}, "
f"Test Acc: {test_acc:.4f}, "
f"Variance: {pca.explained_variance_ratio_.sum():.3f}")
return results
def plot_pca_performance_analysis(results: List[Dict]):
"""Plot PCA performance analysis"""
n_components = [r['n_components'] for r in results]
train_accs = [r['train_accuracy'] for r in results]
test_accs = [r['test_accuracy'] for r in results]
explained_vars = [r['explained_variance'] for r in results]
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))
# Accuracy vs components
ax1.plot(n_components, train_accs, 'o-', linewidth=2,
markersize=6, label='Training Accuracy', color='blue')
ax1.plot(n_components, test_accs, 's-', linewidth=2,
markersize=6, label='Test Accuracy', color='red')
ax1.set_xlabel('Number of PCA Components', fontsize=12)
ax1.set_ylabel('Accuracy', fontsize=12)
ax1.set_title('Classification Performance vs PCA Components', fontweight='bold')
ax1.legend()
ax1.grid(True, alpha=0.3)
# Explained variance vs components
ax2.plot(n_components, explained_vars, 'o-', linewidth=2,
markersize=6, color='green')
ax2.set_xlabel('Number of PCA Components', fontsize=12)
ax2.set_ylabel('Explained Variance Ratio', fontsize=12)
ax2.set_title('Explained Variance vs Components', fontweight='bold')
ax2.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# Find optimal number of components
best_test_idx = np.argmax([r['test_accuracy'] for r in results])
best_result = results[best_test_idx]
print(f"\nOptimal Configuration:")
print(f"Components: {best_result['n_components']}")
print(f"Test Accuracy: {best_result['test_accuracy']:.4f}")
print(f"Explained Variance: {best_result['explained_variance']:.3f}")
# Run classification analysis
print("\n=== PCA Classification Performance Analysis ===")
pca_results = pca_classification_analysis(X_digits, y_digits, max_components=30)
plot_pca_performance_analysis(pca_results)
t-SNE for Non-linear Visualization
from sklearn.manifold import TSNE
import time
class TSNEAnalyzer:
"""t-SNE analysis and visualization"""
def __init__(self, random_state: int = 42):
self.random_state = random_state
def parameter_analysis(self, X: np.ndarray, y: np.ndarray,
sample_size: int = 1000) -> Dict:
"""Analyze different t-SNE parameters"""
# Sample data for faster analysis
if X.shape[0] > sample_size:
indices = np.random.choice(X.shape[0], sample_size, replace=False)
X_sample = X[indices]
y_sample = y[indices]
else:
X_sample, y_sample = X, y
# Standardize
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_sample)
# Different parameter combinations
parameter_combinations = [
{'perplexity': 30, 'learning_rate': 200, 'n_iter': 1000},
{'perplexity': 50, 'learning_rate': 200, 'n_iter': 1000},
{'perplexity': 30, 'learning_rate': 'auto', 'n_iter': 1000},
{'perplexity': 30, 'learning_rate': 200, 'n_iter': 2000}
]
results = {}
for i, params in enumerate(parameter_combinations):
print(f"Running t-SNE with parameters: {params}")
start_time = time.time()
tsne = TSNE(
n_components=2,
random_state=self.random_state,
**params
)
X_tsne = tsne.fit_transform(X_scaled)
runtime = time.time() - start_time
results[f"Config_{i+1}"] = {
'params': params,
'embedding': X_tsne,
'labels': y_sample,
'runtime': runtime,
'kl_divergence': tsne.kl_divergence_
}
print(f" Runtime: {runtime:.2f}s, KL divergence: {tsne.kl_divergence_:.2f}")
return results
def plot_tsne_comparison(self, results: Dict):
"""Plot t-SNE results comparison"""
n_configs = len(results)
fig, axes = plt.subplots(2, 2, figsize=(15, 12))
axes = axes.flatten()
for i, (config_name, data) in enumerate(results.items()):
X_tsne = data['embedding']
y_sample = data['labels']
params = data['params']
runtime = data['runtime']
kl_div = data['kl_divergence']
scatter = axes[i].scatter(X_tsne[:, 0], X_tsne[:, 1],
c=y_sample, cmap='tab10',
alpha=0.7, s=30)
# Title with parameters
title = f"Perplexity: {params['perplexity']}, LR: {params['learning_rate']}\n"
title += f"Runtime: {runtime:.1f}s, KL: {kl_div:.2f}"
axes[i].set_title(title, fontsize=10, fontweight='bold')
axes[i].set_xlabel('t-SNE 1', fontsize=10)
axes[i].set_ylabel('t-SNE 2', fontsize=10)
if i == 0: # Add colorbar to first plot
plt.colorbar(scatter, ax=axes[i])
plt.tight_layout()
plt.show()
def perplexity_analysis(self, X: np.ndarray, y: np.ndarray,
sample_size: int = 500) -> Dict:
"""Analyze impact of perplexity parameter"""
# Sample data
if X.shape[0] > sample_size:
indices = np.random.choice(X.shape[0], sample_size, replace=False)
X_sample = X[indices]
y_sample = y[indices]
else:
X_sample, y_sample = X, y
# Standardize
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_sample)
perplexity_values = [5, 15, 30, 50, 100]
results = {}
for perplexity in perplexity_values:
print(f"Testing perplexity: {perplexity}")
# Adjust perplexity if too large for dataset
actual_perplexity = min(perplexity, (X_sample.shape[0] - 1) // 3)
tsne = TSNE(
n_components=2,
perplexity=actual_perplexity,
random_state=self.random_state,
n_iter=1000
)
X_tsne = tsne.fit_transform(X_scaled)
results[actual_perplexity] = {
'embedding': X_tsne,
'labels': y_sample,
'kl_divergence': tsne.kl_divergence_
}
return results
def plot_perplexity_analysis(self, results: Dict):
"""Plot perplexity analysis results"""
n_perplexities = len(results)
fig, axes = plt.subplots(1, n_perplexities, figsize=(4*n_perplexities, 5))
if n_perplexities == 1:
axes = [axes]
for i, (perplexity, data) in enumerate(results.items()):
X_tsne = data['embedding']
y_sample = data['labels']
kl_div = data['kl_divergence']
scatter = axes[i].scatter(X_tsne[:, 0], X_tsne[:, 1],
c=y_sample, cmap='tab10',
alpha=0.7, s=30)
axes[i].set_title(f'Perplexity: {perplexity}\nKL: {kl_div:.2f}',
fontweight='bold')
axes[i].set_xlabel('t-SNE 1', fontsize=10)
axes[i].set_ylabel('t-SNE 2', fontsize=10)
plt.tight_layout()
plt.show()
# Analyze t-SNE on digits dataset
print("\n=== t-SNE Analysis ===")
tsne_analyzer = TSNEAnalyzer()
# Parameter analysis
print("Comparing different t-SNE parameters...")
tsne_param_results = tsne_analyzer.parameter_analysis(X_digits, y_digits, sample_size=800)
tsne_analyzer.plot_tsne_comparison(tsne_param_results)
# Perplexity analysis
print("\nAnalyzing perplexity impact...")
perplexity_results = tsne_analyzer.perplexity_analysis(X_digits, y_digits, sample_size=600)
tsne_analyzer.plot_perplexity_analysis(perplexity_results)
UMAP for Modern Dimensionality Reduction
try:
import umap
class UMAPAnalyzer:
"""UMAP analysis and comparison"""
def __init__(self, random_state: int = 42):
self.random_state = random_state
def parameter_analysis(self, X: np.ndarray, y: np.ndarray) -> Dict:
"""Analyze different UMAP parameters"""
# Standardize
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
parameter_combinations = [
{'n_neighbors': 15, 'min_dist': 0.1, 'metric': 'euclidean'},
{'n_neighbors': 30, 'min_dist': 0.1, 'metric': 'euclidean'},
{'n_neighbors': 15, 'min_dist': 0.5, 'metric': 'euclidean'},
{'n_neighbors': 15, 'min_dist': 0.1, 'metric': 'cosine'}
]
results = {}
for i, params in enumerate(parameter_combinations):
print(f"Running UMAP with parameters: {params}")
start_time = time.time()
umap_model = umap.UMAP(
n_components=2,
random_state=self.random_state,
**params
)
X_umap = umap_model.fit_transform(X_scaled)
runtime = time.time() - start_time
results[f"Config_{i+1}"] = {
'params': params,
'embedding': X_umap,
'labels': y,
'runtime': runtime
}
print(f" Runtime: {runtime:.2f}s")
return results
def plot_umap_comparison(self, results: Dict):
"""Plot UMAP results comparison"""
fig, axes = plt.subplots(2, 2, figsize=(15, 12))
axes = axes.flatten()
for i, (config_name, data) in enumerate(results.items()):
X_umap = data['embedding']
y_labels = data['labels']
params = data['params']
runtime = data['runtime']
scatter = axes[i].scatter(X_umap[:, 0], X_umap[:, 1],
c=y_labels, cmap='tab10',
alpha=0.7, s=30)
# Title with parameters
title = f"n_neighbors: {params['n_neighbors']}, min_dist: {params['min_dist']}\n"
title += f"metric: {params['metric']}, Runtime: {runtime:.1f}s"
axes[i].set_title(title, fontsize=10, fontweight='bold')
axes[i].set_xlabel('UMAP 1', fontsize=10)
axes[i].set_ylabel('UMAP 2', fontsize=10)
if i == 0: # Add colorbar to first plot
plt.colorbar(scatter, ax=axes[i])
plt.tight_layout()
plt.show()
def compare_with_other_methods(self, X: np.ndarray, y: np.ndarray,
sample_size: int = 1000) -> Dict:
"""Compare UMAP with PCA and t-SNE"""
# Sample data for fair comparison
if X.shape[0] > sample_size:
indices = np.random.choice(X.shape[0], sample_size, replace=False)
X_sample = X[indices]
y_sample = y[indices]
else:
X_sample, y_sample = X, y
# Standardize
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_sample)
methods = {}
# PCA
print("Running PCA...")
start_time = time.time()
pca = PCA(n_components=2, random_state=self.random_state)
X_pca = pca.fit_transform(X_scaled)
pca_time = time.time() - start_time
methods['PCA'] = {
'embedding': X_pca,
'runtime': pca_time,
'explained_variance': pca.explained_variance_ratio_.sum()
}
# t-SNE
print("Running t-SNE...")
start_time = time.time()
tsne = TSNE(n_components=2, random_state=self.random_state,
perplexity=30, n_iter=1000)
X_tsne = tsne.fit_transform(X_scaled)
tsne_time = time.time() - start_time
methods['t-SNE'] = {
'embedding': X_tsne,
'runtime': tsne_time,
'kl_divergence': tsne.kl_divergence_
}
# UMAP
print("Running UMAP...")
start_time = time.time()
umap_model = umap.UMAP(n_components=2, random_state=self.random_state,
n_neighbors=15, min_dist=0.1)
X_umap = umap_model.fit_transform(X_scaled)
umap_time = time.time() - start_time
methods['UMAP'] = {
'embedding': X_umap,
'runtime': umap_time
}
# Add labels to all methods
for method_data in methods.values():
method_data['labels'] = y_sample
return methods
def plot_method_comparison(self, methods: Dict):
"""Plot comparison of different methods"""
fig, axes = plt.subplots(1, 3, figsize=(18, 5))
for i, (method_name, data) in enumerate(methods.items()):
embedding = data['embedding']
labels = data['labels']
runtime = data['runtime']
scatter = axes[i].scatter(embedding[:, 0], embedding[:, 1],
c=labels, cmap='tab10', alpha=0.7, s=30)
title = f"{method_name}\nRuntime: {runtime:.2f}s"
if method_name == 'PCA':
title += f"\nExplained Var: {data['explained_variance']:.3f}"
elif method_name == 't-SNE':
title += f"\nKL Divergence: {data['kl_divergence']:.2f}"
axes[i].set_title(title, fontweight='bold')
axes[i].set_xlabel(f'{method_name} 1', fontsize=12)
axes[i].set_ylabel(f'{method_name} 2', fontsize=12)
if i == 0: # Add colorbar to first plot
plt.colorbar(scatter, ax=axes[i])
plt.tight_layout()
plt.show()
# Print runtime comparison
print("\nRuntime Comparison:")
print("-" * 20)
for method, data in methods.items():
print(f"{method:6}: {data['runtime']:.2f}s")
# Run UMAP analysis
print("\n=== UMAP Analysis ===")
umap_analyzer = UMAPAnalyzer()
# Parameter analysis
print("Comparing different UMAP parameters...")
umap_param_results = umap_analyzer.parameter_analysis(X_digits[:800], y_digits[:800])
umap_analyzer.plot_umap_comparison(umap_param_results)
# Method comparison
print("\nComparing PCA, t-SNE, and UMAP...")
method_comparison = umap_analyzer.compare_with_other_methods(X_digits, y_digits, sample_size=800)
umap_analyzer.plot_method_comparison(method_comparison)
except ImportError:
print("UMAP not installed. Install with: pip install umap-learn")
print("Skipping UMAP analysis...")
Comprehensive Comparison
def comprehensive_dimensionality_reduction_analysis():
"""Complete comparison of dimensionality reduction techniques"""
# Load different datasets
datasets = {
'Digits': load_digits(),
'Breast Cancer': load_breast_cancer()
}
results_summary = []
for dataset_name, dataset in datasets.items():
print(f"\n=== Analysis on {dataset_name} Dataset ===")
X, y = dataset.data, dataset.target
print(f"Original shape: {X.shape}")
print(f"Number of classes: {len(np.unique(y))}")
# Sample for consistent comparison
if X.shape[0] > 500:
indices = np.random.choice(X.shape[0], 500, replace=False)
X_sample = X[indices]
y_sample = y[indices]
else:
X_sample, y_sample = X, y
# Standardize
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_sample)
# Apply methods
methods_data = {}
# PCA
pca = PCA(n_components=2, random_state=42)
start_time = time.time()
X_pca = pca.fit_transform(X_scaled)
pca_time = time.time() - start_time
methods_data['PCA'] = {
'embedding': X_pca,
'runtime': pca_time,
'variance_explained': pca.explained_variance_ratio_.sum()
}
# t-SNE
tsne = TSNE(n_components=2, random_state=42, perplexity=30, n_iter=500)
start_time = time.time()
X_tsne = tsne.fit_transform(X_scaled)
tsne_time = time.time() - start_time
methods_data['t-SNE'] = {
'embedding': X_tsne,
'runtime': tsne_time,
'kl_divergence': tsne.kl_divergence_
}
# Store results
for method, data in methods_data.items():
results_summary.append({
'dataset': dataset_name,
'method': method,
'original_dims': X.shape[1],
'n_samples': X_sample.shape[0],
'runtime': data['runtime']
})
# Plot comparison for this dataset
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
for i, (method, data) in enumerate(methods_data.items()):
embedding = data['embedding']
scatter = axes[i].scatter(embedding[:, 0], embedding[:, 1],
c=y_sample, cmap='tab10', alpha=0.7, s=30)
title = f"{method} - {dataset_name}\nRuntime: {data['runtime']:.2f}s"
if method == 'PCA':
title += f"\nVar Explained: {data['variance_explained']:.3f}"
elif method == 't-SNE':
title += f"\nKL Div: {data['kl_divergence']:.2f}"
axes[i].set_title(title, fontweight='bold')
axes[i].set_xlabel(f'{method} 1')
axes[i].set_ylabel(f'{method} 2')
if i == 0:
plt.colorbar(scatter, ax=axes[i])
plt.tight_layout()
plt.show()
return results_summary
# Run comprehensive analysis
print("\n=== Comprehensive Dimensionality Reduction Analysis ===")
summary_results = comprehensive_dimensionality_reduction_analysis()
# Create summary table
summary_df = pd.DataFrame(summary_results)
print("\nSummary Results:")
print(summary_df.to_string(index=False))
When to Use Each Method
Method Selection Guide
Method | Best For | Pros | Cons |
---|---|---|---|
PCA | Linear relationships, feature reduction | Fast, interpretable, preserves variance | Linear only, may miss non-linear patterns |
t-SNE | Visualization, clustering analysis | Excellent clusters, non-linear | Slow, not deterministic, local structure |
UMAP | General purpose, large datasets | Fast, preserves local & global structure | Newer, fewer theoretical guarantees |
Key Recommendations
- Start with PCA for initial exploration and linear relationships
- Use t-SNE for beautiful visualizations and cluster discovery
- Choose UMAP for large datasets and balanced local/global structure
- Combine methods - use PCA for preprocessing, then t-SNE/UMAP
- Consider computational cost - PCA is fastest, t-SNE is slowest
Performance Guidelines
- PCA: Use when you need interpretable components and fast computation
- t-SNE: Perplexity = 5-50, higher for larger datasets
- UMAP: n_neighbors = 10-50, min_dist = 0.1-0.5
- Preprocessing: Always standardize features first
- Sample size: Use sampling for t-SNE on large datasets (>10k samples)
Conclusion
Dimensionality reduction is essential for high-dimensional data analysis. Key takeaways:
- PCA for linear reduction and fast feature extraction
- t-SNE for non-linear visualization and cluster discovery
- UMAP for modern, balanced dimensionality reduction
- Parameter tuning significantly impacts results
- Preprocessing and standardization are crucial
- Method combination often works better than single approaches
Choose your method based on data characteristics, computational constraints, and analysis goals.
References
-
Jolliffe, I. T. (2002). Principal component analysis. Springer.
-
Van der Maaten, L., & Hinton, G. (2008). Visualizing data using t-SNE. Journal of machine learning research.
-
McInnes, L., Healy, J., & Melville, J. (2018). UMAP: Uniform manifold approximation and projection. arXiv preprint.
Connect with me on LinkedIn or X to discuss dimensionality reduction techniques!