Fenil Sonani

Dimensionality Reduction Techniques: Complete Guide with Python

4 min read

AI-Generated Content Notice

Some code examples and technical explanations in this article were generated with AI assistance. The content has been reviewed for accuracy, but please test any code snippets in your development environment before using them.


Dimensionality Reduction Techniques: Complete Guide with Python

Introduction

High-dimensional data is everywhere in machine learning - from image pixels to text features to sensor readings. Dimensionality reduction techniques help us visualize, understand, and process this data by reducing the number of features while preserving important information.

This comprehensive guide covers the most important dimensionality reduction techniques with practical Python implementations and real-world applications.

Why Dimensionality Reduction?

The Curse of Dimensionality

High-dimensional data creates several challenges:

  • Exponential growth in data requirements
  • Poor distance metrics (all points seem equally far apart)
  • Visualization difficulties (impossible to plot >3D)
  • Computational complexity increases dramatically
  • Overfitting becomes more likely

Benefits of Dimensionality Reduction

  1. Visualization: Plot high-dimensional data in 2D/3D
  2. Storage: Reduce memory and disk requirements
  3. Speed: Faster training and inference
  4. Noise reduction: Filter out irrelevant features
  5. Feature engineering: Create meaningful representations

Principal Component Analysis (PCA)

Mathematical Foundation

PCA finds the directions (principal components) of maximum variance in data.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_digits, load_breast_cancer, make_classification
from sklearn.decomposition import PCA, KernelPCA, FactorAnalysis, FastICA
from sklearn.manifold import TSNE, Isomap, LocallyLinearEmbedding
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
from typing import Dict, List, Tuple, Optional
import warnings
warnings.filterwarnings('ignore')

plt.style.use('seaborn-v0_8')

class PCAAnalyzer:
    """Comprehensive PCA analyzer"""
    
    def __init__(self, random_state: int = 42):
        self.random_state = random_state
    
    def pca_from_scratch(self, X: np.ndarray, n_components: Optional[int] = None) -> Dict:
        """PCA implementation from scratch"""
        
        # Center the data
        X_centered = X - np.mean(X, axis=0)
        
        # Compute covariance matrix
        cov_matrix = np.cov(X_centered.T)
        
        # Compute eigenvalues and eigenvectors
        eigenvalues, eigenvectors = np.linalg.eigh(cov_matrix)
        
        # Sort by eigenvalues (descending)
        idx = np.argsort(eigenvalues)[::-1]
        eigenvalues = eigenvalues[idx]
        eigenvectors = eigenvectors[:, idx]
        
        # Select components
        if n_components is None:
            n_components = len(eigenvalues)
        
        selected_components = eigenvectors[:, :n_components]
        selected_eigenvalues = eigenvalues[:n_components]
        
        # Transform data
        X_transformed = np.dot(X_centered, selected_components)
        
        # Calculate explained variance ratio
        explained_variance_ratio = selected_eigenvalues / np.sum(eigenvalues)
        
        return {
            'components': selected_components,
            'eigenvalues': selected_eigenvalues,
            'explained_variance_ratio': explained_variance_ratio,
            'transformed_data': X_transformed,
            'mean': np.mean(X, axis=0)
        }
    
    def comprehensive_pca_analysis(self, X: np.ndarray, y: np.ndarray = None, 
                                 feature_names: List[str] = None) -> Dict:
        """Comprehensive PCA analysis"""
        
        # Standardize data
        scaler = StandardScaler()
        X_scaled = scaler.fit_transform(X)
        
        # Fit PCA with all components
        pca_full = PCA()
        pca_full.fit(X_scaled)
        
        # Explained variance analysis
        explained_variance_ratio = pca_full.explained_variance_ratio_
        cumulative_variance = np.cumsum(explained_variance_ratio)
        
        # Find components for different variance thresholds
        variance_thresholds = [0.8, 0.9, 0.95, 0.99]
        components_for_threshold = {}
        
        for threshold in variance_thresholds:
            n_components = np.argmax(cumulative_variance >= threshold) + 1
            components_for_threshold[threshold] = n_components
        
        # Compare with our implementation
        our_pca = self.pca_from_scratch(X_scaled, n_components=min(10, X_scaled.shape[1]))
        
        # Component loadings (if feature names provided)
        loadings = None
        if feature_names is not None:
            n_show = min(5, len(pca_full.components_))
            loadings = pd.DataFrame(
                pca_full.components_[:n_show].T,
                columns=[f'PC{i+1}' for i in range(n_show)],
                index=feature_names
            )
        
        return {
            'pca_full': pca_full,
            'explained_variance_ratio': explained_variance_ratio,
            'cumulative_variance': cumulative_variance,
            'components_for_threshold': components_for_threshold,
            'our_implementation': our_pca,
            'scaler': scaler,
            'loadings': loadings
        }
    
    def plot_pca_analysis(self, results: Dict, X: np.ndarray, y: np.ndarray = None):
        """Plot comprehensive PCA analysis"""
        
        fig, axes = plt.subplots(2, 2, figsize=(15, 12))
        
        # Plot 1: Explained variance ratio
        n_components = min(20, len(results['explained_variance_ratio']))
        axes[0, 0].bar(range(1, n_components + 1), results['explained_variance_ratio'][:n_components], 
                      alpha=0.7, color='steelblue')
        axes[0, 0].set_xlabel('Principal Component', fontsize=12)
        axes[0, 0].set_ylabel('Explained Variance Ratio', fontsize=12)
        axes[0, 0].set_title('Individual Component Variance', fontweight='bold')
        axes[0, 0].grid(True, alpha=0.3)
        
        # Plot 2: Cumulative explained variance
        axes[0, 1].plot(range(1, len(results['cumulative_variance']) + 1), 
                       results['cumulative_variance'], 'o-', linewidth=2, markersize=4)
        
        # Add threshold lines
        thresholds = [0.8, 0.9, 0.95, 0.99]
        colors = ['red', 'orange', 'green', 'blue']
        
        for threshold, color in zip(thresholds, colors):
            axes[0, 1].axhline(y=threshold, color=color, linestyle='--', alpha=0.7, 
                              label=f'{threshold*100:.0f}% variance')
            n_comp = results['components_for_threshold'][threshold]
            axes[0, 1].axvline(x=n_comp, color=color, linestyle='--', alpha=0.7)
            axes[0, 1].text(n_comp + 1, threshold + 0.01, f'{n_comp}', 
                           color=color, fontweight='bold')
        
        axes[0, 1].set_xlabel('Number of Components', fontsize=12)
        axes[0, 1].set_ylabel('Cumulative Explained Variance', fontsize=12)
        axes[0, 1].set_title('Cumulative Variance Explained', fontweight='bold')
        axes[0, 1].legend()
        axes[0, 1].grid(True, alpha=0.3)
        
        # Plot 3: Component loadings (if available)
        if results['loadings'] is not None:
            # Plot loadings heatmap
            sns.heatmap(results['loadings'].T, annot=True, cmap='RdBu_r', center=0,
                       ax=axes[1, 0], cbar_kws={'shrink': 0.8})
            axes[1, 0].set_title('Component Loadings', fontweight='bold')
            axes[1, 0].set_xlabel('Features')
            axes[1, 0].set_ylabel('Principal Components')
        else:
            axes[1, 0].text(0.5, 0.5, 'Feature names not provided\nfor loadings analysis', 
                           ha='center', va='center', transform=axes[1, 0].transAxes,
                           fontsize=12, bbox=dict(boxstyle="round,pad=0.3", facecolor="lightgray"))
            axes[1, 0].set_title('Component Loadings', fontweight='bold')
        
        # Plot 4: 2D projection (if labels available)
        if y is not None:
            # Transform to 2D
            pca_2d = PCA(n_components=2, random_state=42)
            X_2d = pca_2d.fit_transform(results['scaler'].transform(X))
            
            scatter = axes[1, 1].scatter(X_2d[:, 0], X_2d[:, 1], c=y, 
                                        cmap='viridis', alpha=0.6, s=50)
            axes[1, 1].set_xlabel(f'PC1 ({pca_2d.explained_variance_ratio_[0]:.3f})', fontsize=12)
            axes[1, 1].set_ylabel(f'PC2 ({pca_2d.explained_variance_ratio_[1]:.3f})', fontsize=12)
            axes[1, 1].set_title('2D PCA Projection', fontweight='bold')
            plt.colorbar(scatter, ax=axes[1, 1])
        else:
            # Show comparison with our implementation
            our_var_ratio = results['our_implementation']['explained_variance_ratio']
            sklearn_var_ratio = results['explained_variance_ratio'][:len(our_var_ratio)]
            
            x_pos = np.arange(len(our_var_ratio))
            width = 0.35
            
            bars1 = axes[1, 1].bar(x_pos - width/2, our_var_ratio, width, 
                                  label='Our Implementation', alpha=0.7)
            bars2 = axes[1, 1].bar(x_pos + width/2, sklearn_var_ratio, width, 
                                  label='Scikit-learn', alpha=0.7)
            
            axes[1, 1].set_xlabel('Component', fontsize=12)
            axes[1, 1].set_ylabel('Explained Variance Ratio', fontsize=12)
            axes[1, 1].set_title('Implementation Comparison', fontweight='bold')
            axes[1, 1].set_xticks(x_pos)
            axes[1, 1].set_xticklabels([f'PC{i+1}' for i in range(len(our_var_ratio))])
            axes[1, 1].legend()
            axes[1, 1].grid(True, alpha=0.3)
        
        plt.tight_layout()
        plt.show()
        
        # Print summary
        print("\nPCA Analysis Summary:")
        print("-" * 40)
        for threshold in [0.9, 0.95]:
            n_comp = results['components_for_threshold'][threshold]
            print(f"Components for {threshold*100:.0f}% variance: {n_comp}")

# Load datasets for analysis
print("Loading datasets...")

# Digits dataset
digits = load_digits()
X_digits, y_digits = digits.data, digits.target

# Breast cancer dataset
cancer = load_breast_cancer()
X_cancer, y_cancer = cancer.data, cancer.target

# Analyze PCA on digits dataset
print("\n" + "="*60)
print("PCA Analysis on Digits Dataset")
print("="*60)

pca_analyzer = PCAAnalyzer()
digits_results = pca_analyzer.comprehensive_pca_analysis(X_digits, y_digits)
pca_analyzer.plot_pca_analysis(digits_results, X_digits, y_digits)

# Analyze PCA on cancer dataset with feature names
print("\n" + "="*60)
print("PCA Analysis on Breast Cancer Dataset")
print("="*60)

cancer_results = pca_analyzer.comprehensive_pca_analysis(X_cancer, y_cancer, 
                                                       list(cancer.feature_names))
pca_analyzer.plot_pca_analysis(cancer_results, X_cancer, y_cancer)

Advanced PCA Variants

Kernel PCA and Incremental PCA

class AdvancedPCAAnalyzer:
    """Advanced PCA techniques analyzer"""
    
    def __init__(self, random_state: int = 42):
        self.random_state = random_state
    
    def kernel_pca_analysis(self, X: np.ndarray, y: np.ndarray) -> Dict:
        """Analyze different Kernel PCA variants"""
        
        # Standardize data
        scaler = StandardScaler()
        X_scaled = scaler.fit_transform(X)
        
        # Different kernels
        kernels = {
            'linear': {'kernel': 'linear'},
            'polynomial': {'kernel': 'poly', 'degree': 3},
            'rbf': {'kernel': 'rbf', 'gamma': 0.1},
            'sigmoid': {'kernel': 'sigmoid', 'gamma': 0.1}
        }
        
        results = {}
        
        for kernel_name, kernel_params in kernels.items():
            print(f"Analyzing {kernel_name} kernel...")
            
            try:
                # Fit Kernel PCA
                kpca = KernelPCA(n_components=2, random_state=self.random_state, 
                               **kernel_params)
                X_kpca = kpca.fit_transform(X_scaled)
                
                results[kernel_name] = {
                    'model': kpca,
                    'transformed_data': X_kpca,
                    'eigenvalues': kpca.eigenvalues_ if hasattr(kpca, 'eigenvalues_') else None
                }
                
            except Exception as e:
                print(f"Error with {kernel_name} kernel: {str(e)}")
                continue
        
        return results
    
    def plot_kernel_pca_comparison(self, results: Dict, y: np.ndarray):
        """Plot Kernel PCA comparison"""
        
        n_kernels = len(results)
        fig, axes = plt.subplots(2, 2, figsize=(15, 12))
        axes = axes.flatten()
        
        for idx, (kernel_name, data) in enumerate(results.items()):
            if idx >= 4:  # Only plot first 4
                break
                
            X_transformed = data['transformed_data']
            
            scatter = axes[idx].scatter(X_transformed[:, 0], X_transformed[:, 1], 
                                      c=y, cmap='viridis', alpha=0.6, s=50)
            axes[idx].set_title(f'{kernel_name.capitalize()} Kernel PCA', fontweight='bold')
            axes[idx].set_xlabel('First Component')
            axes[idx].set_ylabel('Second Component')
            axes[idx].grid(True, alpha=0.3)
            plt.colorbar(scatter, ax=axes[idx])
        
        plt.tight_layout()
        plt.show()
    
    def incremental_pca_analysis(self, X: np.ndarray, batch_sizes: List[int] = None) -> Dict:
        """Analyze Incremental PCA for large datasets"""
        
        if batch_sizes is None:
            batch_sizes = [50, 100, 200, 500]
        
        # Standardize data
        scaler = StandardScaler()
        X_scaled = scaler.fit_transform(X)
        
        # Regular PCA for comparison
        pca_regular = PCA(n_components=10)
        pca_regular.fit(X_scaled)
        
        results = {'regular_pca': pca_regular}
        
        for batch_size in batch_sizes:
            if batch_size < X_scaled.shape[0]:
                print(f"Analyzing Incremental PCA with batch size {batch_size}...")
                
                from sklearn.decomposition import IncrementalPCA
                
                ipca = IncrementalPCA(n_components=10, batch_size=batch_size)
                ipca.fit(X_scaled)
                
                results[f'batch_{batch_size}'] = ipca
        
        return results
    
    def plot_incremental_pca_comparison(self, results: Dict):
        """Plot Incremental PCA comparison"""
        
        # Compare explained variance ratios
        methods = list(results.keys())
        n_components = 10
        
        fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))
        
        # Plot explained variance ratios
        for method, pca_model in results.items():
            if hasattr(pca_model, 'explained_variance_ratio_'):
                variance_ratio = pca_model.explained_variance_ratio_
                ax1.plot(range(1, len(variance_ratio) + 1), variance_ratio, 
                        'o-', linewidth=2, label=method.replace('_', ' ').title())
        
        ax1.set_xlabel('Component', fontsize=12)
        ax1.set_ylabel('Explained Variance Ratio', fontsize=12)
        ax1.set_title('Explained Variance Comparison', fontweight='bold')
        ax1.legend()
        ax1.grid(True, alpha=0.3)
        
        # Plot cumulative variance
        for method, pca_model in results.items():
            if hasattr(pca_model, 'explained_variance_ratio_'):
                cumulative_variance = np.cumsum(pca_model.explained_variance_ratio_)
                ax2.plot(range(1, len(cumulative_variance) + 1), cumulative_variance, 
                        'o-', linewidth=2, label=method.replace('_', ' ').title())
        
        ax2.set_xlabel('Component', fontsize=12)
        ax2.set_ylabel('Cumulative Explained Variance', fontsize=12)
        ax2.set_title('Cumulative Variance Comparison', fontweight='bold')
        ax2.legend()
        ax2.grid(True, alpha=0.3)
        
        plt.tight_layout()
        plt.show()

# Analyze advanced PCA techniques
advanced_pca = AdvancedPCAAnalyzer()

print("\n" + "="*60)
print("Kernel PCA Analysis")
print("="*60)

# Use a smaller subset for Kernel PCA (computationally expensive)
X_sample = X_digits[:500]
y_sample = y_digits[:500]

kernel_results = advanced_pca.kernel_pca_analysis(X_sample, y_sample)
advanced_pca.plot_kernel_pca_comparison(kernel_results, y_sample)

print("\n" + "="*60)
print("Incremental PCA Analysis")
print("="*60)

incremental_results = advanced_pca.incremental_pca_analysis(X_digits)
advanced_pca.plot_incremental_pca_comparison(incremental_results)

t-SNE (t-Distributed Stochastic Neighbor Embedding)

t-SNE Implementation and Analysis

class TSNEAnalyzer:
    """t-SNE analyzer for nonlinear dimensionality reduction"""
    
    def __init__(self, random_state: int = 42):
        self.random_state = random_state
    
    def tsne_parameter_analysis(self, X: np.ndarray, y: np.ndarray) -> Dict:
        """Analyze t-SNE with different parameters"""
        
        # Standardize data
        scaler = StandardScaler()
        X_scaled = scaler.fit_transform(X)
        
        # Use subset for faster computation
        if X_scaled.shape[0] > 1000:
            indices = np.random.choice(X_scaled.shape[0], 1000, replace=False)
            X_subset = X_scaled[indices]
            y_subset = y[indices]
        else:
            X_subset = X_scaled
            y_subset = y
        
        # Different perplexity values
        perplexities = [5, 15, 30, 50]
        results = {}
        
        for perplexity in perplexities:
            print(f"Running t-SNE with perplexity={perplexity}...")
            
            tsne = TSNE(n_components=2, perplexity=perplexity, 
                       random_state=self.random_state, n_iter=1000)
            X_tsne = tsne.fit_transform(X_subset)
            
            results[f'perplexity_{perplexity}'] = {
                'transformed_data': X_tsne,
                'perplexity': perplexity,
                'kl_divergence': tsne.kl_divergence_
            }
        
        return results, y_subset
    
    def plot_tsne_analysis(self, results: Dict, y: np.ndarray):
        """Plot t-SNE analysis results"""
        
        fig, axes = plt.subplots(2, 2, figsize=(15, 12))
        axes = axes.flatten()
        
        for idx, (param_name, data) in enumerate(results.items()):
            X_transformed = data['transformed_data']
            perplexity = data['perplexity']
            kl_div = data['kl_divergence']
            
            scatter = axes[idx].scatter(X_transformed[:, 0], X_transformed[:, 1], 
                                      c=y, cmap='tab10', alpha=0.7, s=50)
            axes[idx].set_title(f't-SNE (perplexity={perplexity})\nKL divergence: {kl_div:.2f}', 
                               fontweight='bold')
            axes[idx].set_xlabel('t-SNE 1')
            axes[idx].set_ylabel('t-SNE 2')
            axes[idx].grid(True, alpha=0.3)
        
        plt.tight_layout()
        plt.show()
    
    def compare_with_pca(self, X: np.ndarray, y: np.ndarray):
        """Compare t-SNE with PCA"""
        
        # Standardize data
        scaler = StandardScaler()
        X_scaled = scaler.fit_transform(X)
        
        # Use subset for faster computation
        if X_scaled.shape[0] > 1000:
            indices = np.random.choice(X_scaled.shape[0], 1000, replace=False)
            X_subset = X_scaled[indices]
            y_subset = y[indices]
        else:
            X_subset = X_scaled
            y_subset = y
        
        # PCA
        pca = PCA(n_components=2, random_state=self.random_state)
        X_pca = pca.fit_transform(X_subset)
        
        # t-SNE
        tsne = TSNE(n_components=2, perplexity=30, random_state=self.random_state)
        X_tsne = tsne.fit_transform(X_subset)
        
        # Plot comparison
        fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))
        
        # PCA plot
        scatter1 = ax1.scatter(X_pca[:, 0], X_pca[:, 1], c=y_subset, 
                              cmap='tab10', alpha=0.7, s=50)
        ax1.set_title(f'PCA\nExplained Variance: {pca.explained_variance_ratio_.sum():.3f}', 
                     fontweight='bold')
        ax1.set_xlabel('PC1')
        ax1.set_ylabel('PC2')
        ax1.grid(True, alpha=0.3)
        plt.colorbar(scatter1, ax=ax1)
        
        # t-SNE plot
        scatter2 = ax2.scatter(X_tsne[:, 0], X_tsne[:, 1], c=y_subset, 
                              cmap='tab10', alpha=0.7, s=50)
        ax2.set_title(f't-SNE\nKL Divergence: {tsne.kl_divergence_:.2f}', fontweight='bold')
        ax2.set_xlabel('t-SNE 1')
        ax2.set_ylabel('t-SNE 2')
        ax2.grid(True, alpha=0.3)
        plt.colorbar(scatter2, ax=ax2)
        
        plt.tight_layout()
        plt.show()

# Analyze t-SNE
tsne_analyzer = TSNEAnalyzer()

print("\n" + "="*60)
print("t-SNE Parameter Analysis")
print("="*60)

tsne_results, y_tsne = tsne_analyzer.tsne_parameter_analysis(X_digits, y_digits)
tsne_analyzer.plot_tsne_analysis(tsne_results, y_tsne)

print("\n" + "="*60)
print("t-SNE vs PCA Comparison")
print("="*60)

tsne_analyzer.compare_with_pca(X_digits, y_digits)

UMAP (Uniform Manifold Approximation and Projection)

UMAP Analysis

class UMAPAnalyzer:
    """UMAP analyzer for fast nonlinear dimensionality reduction"""
    
    def __init__(self, random_state: int = 42):
        self.random_state = random_state
    
    def umap_analysis(self, X: np.ndarray, y: np.ndarray) -> Dict:
        """Analyze UMAP with different parameters"""
        
        try:
            import umap
        except ImportError:
            print("UMAP not installed. Install with: pip install umap-learn")
            return self._simulate_umap_results(X, y)
        
        # Standardize data
        scaler = StandardScaler()
        X_scaled = scaler.fit_transform(X)
        
        # Different parameter combinations
        parameter_combinations = [
            {'n_neighbors': 15, 'min_dist': 0.1},
            {'n_neighbors': 15, 'min_dist': 0.5},
            {'n_neighbors': 50, 'min_dist': 0.1},
            {'n_neighbors': 50, 'min_dist': 0.5}
        ]
        
        results = {}
        
        for i, params in enumerate(parameter_combinations):
            print(f"Running UMAP with n_neighbors={params['n_neighbors']}, min_dist={params['min_dist']}...")
            
            reducer = umap.UMAP(n_components=2, random_state=self.random_state, **params)
            X_umap = reducer.fit_transform(X_scaled)
            
            param_key = f"n{params['n_neighbors']}_d{params['min_dist']}"
            results[param_key] = {
                'transformed_data': X_umap,
                'params': params,
                'model': reducer
            }
        
        return results
    
    def _simulate_umap_results(self, X: np.ndarray, y: np.ndarray) -> Dict:
        """Simulate UMAP results when package not available"""
        print("Simulating UMAP results (install umap-learn for real results)")
        
        # Use t-SNE as approximation
        scaler = StandardScaler()
        X_scaled = scaler.fit_transform(X)
        
        if X_scaled.shape[0] > 1000:
            indices = np.random.choice(X_scaled.shape[0], 1000, replace=False)
            X_subset = X_scaled[indices]
        else:
            X_subset = X_scaled
        
        tsne = TSNE(n_components=2, perplexity=30, random_state=self.random_state)
        X_sim = tsne.fit_transform(X_subset)
        
        # Add some noise to simulate different parameters
        results = {}
        for i, (n_neighbors, min_dist) in enumerate([(15, 0.1), (15, 0.5), (50, 0.1), (50, 0.5)]):
            noise_scale = 0.1 * (i + 1)
            X_noisy = X_sim + np.random.normal(0, noise_scale, X_sim.shape)
            
            param_key = f"n{n_neighbors}_d{min_dist}"
            results[param_key] = {
                'transformed_data': X_noisy,
                'params': {'n_neighbors': n_neighbors, 'min_dist': min_dist},
                'model': None
            }
        
        return results
    
    def plot_umap_analysis(self, results: Dict, y: np.ndarray):
        """Plot UMAP analysis results"""
        
        fig, axes = plt.subplots(2, 2, figsize=(15, 12))
        axes = axes.flatten()
        
        for idx, (param_key, data) in enumerate(results.items()):
            X_transformed = data['transformed_data']
            params = data['params']
            
            # Handle potential size mismatch
            y_plot = y[:len(X_transformed)] if len(y) > len(X_transformed) else y
            
            scatter = axes[idx].scatter(X_transformed[:, 0], X_transformed[:, 1], 
                                      c=y_plot, cmap='tab10', alpha=0.7, s=50)
            
            title = f"UMAP\nn_neighbors={params['n_neighbors']}, min_dist={params['min_dist']}"
            axes[idx].set_title(title, fontweight='bold')
            axes[idx].set_xlabel('UMAP 1')
            axes[idx].set_ylabel('UMAP 2')
            axes[idx].grid(True, alpha=0.3)
        
        plt.tight_layout()
        plt.show()
    
    def compare_methods(self, X: np.ndarray, y: np.ndarray):
        """Compare PCA, t-SNE, and UMAP"""
        
        # Standardize data
        scaler = StandardScaler()
        X_scaled = scaler.fit_transform(X)
        
        # Use subset for faster computation
        if X_scaled.shape[0] > 1000:
            indices = np.random.choice(X_scaled.shape[0], 1000, replace=False)
            X_subset = X_scaled[indices]
            y_subset = y[indices]
        else:
            X_subset = X_scaled
            y_subset = y
        
        # PCA
        pca = PCA(n_components=2, random_state=self.random_state)
        X_pca = pca.fit_transform(X_subset)
        
        # t-SNE
        tsne = TSNE(n_components=2, perplexity=30, random_state=self.random_state)
        X_tsne = tsne.fit_transform(X_subset)
        
        # UMAP (or simulation)
        try:
            import umap
            reducer = umap.UMAP(n_components=2, random_state=self.random_state)
            X_umap = reducer.fit_transform(X_subset)
            umap_available = True
        except ImportError:
            # Use t-SNE with different parameters as simulation
            tsne_sim = TSNE(n_components=2, perplexity=15, random_state=self.random_state + 1)
            X_umap = tsne_sim.fit_transform(X_subset)
            umap_available = False
        
        # Plot comparison
        fig, axes = plt.subplots(1, 3, figsize=(18, 6))
        
        # PCA
        scatter1 = axes[0].scatter(X_pca[:, 0], X_pca[:, 1], c=y_subset, 
                                  cmap='tab10', alpha=0.7, s=50)
        axes[0].set_title(f'PCA\nLinear Projection', fontweight='bold')
        axes[0].set_xlabel('PC1')
        axes[0].set_ylabel('PC2')
        axes[0].grid(True, alpha=0.3)
        
        # t-SNE
        scatter2 = axes[1].scatter(X_tsne[:, 0], X_tsne[:, 1], c=y_subset, 
                                  cmap='tab10', alpha=0.7, s=50)
        axes[1].set_title('t-SNE\nNonlinear (Local)', fontweight='bold')
        axes[1].set_xlabel('t-SNE 1')
        axes[1].set_ylabel('t-SNE 2')
        axes[1].grid(True, alpha=0.3)
        
        # UMAP
        scatter3 = axes[2].scatter(X_umap[:, 0], X_umap[:, 1], c=y_subset, 
                                  cmap='tab10', alpha=0.7, s=50)
        title = 'UMAP\nNonlinear (Global)' if umap_available else 'UMAP (Simulated)\nNonlinear'
        axes[2].set_title(title, fontweight='bold')
        axes[2].set_xlabel('UMAP 1')
        axes[2].set_ylabel('UMAP 2')
        axes[2].grid(True, alpha=0.3)
        
        plt.tight_layout()
        plt.show()

# Analyze UMAP
umap_analyzer = UMAPAnalyzer()

print("\n" + "="*60)
print("UMAP Parameter Analysis")
print("="*60)

umap_results = umap_analyzer.umap_analysis(X_digits, y_digits)
umap_analyzer.plot_umap_analysis(umap_results, y_digits)

print("\n" + "="*60)
print("Method Comparison: PCA vs t-SNE vs UMAP")
print("="*60)

umap_analyzer.compare_methods(X_digits, y_digits)

Linear Discriminant Analysis (LDA)

LDA for Supervised Dimensionality Reduction

class LDAAnalyzer:
    """Linear Discriminant Analysis analyzer"""
    
    def __init__(self, random_state: int = 42):
        self.random_state = random_state
    
    def lda_analysis(self, X: np.ndarray, y: np.ndarray) -> Dict:
        """Comprehensive LDA analysis"""
        
        # Standardize data
        scaler = StandardScaler()
        X_scaled = scaler.fit_transform(X)
        
        # Fit LDA
        lda = LinearDiscriminantAnalysis()
        X_lda = lda.fit_transform(X_scaled, y)
        
        # Calculate explained variance ratio
        explained_variance_ratio = lda.explained_variance_ratio_
        
        return {
            'model': lda,
            'transformed_data': X_lda,
            'explained_variance_ratio': explained_variance_ratio,
            'classes': lda.classes_,
            'means': lda.means_,
            'scaler': scaler
        }
    
    def compare_lda_pca(self, X: np.ndarray, y: np.ndarray):
        """Compare LDA with PCA"""
        
        # Standardize data
        scaler = StandardScaler()
        X_scaled = scaler.fit_transform(X)
        
        # LDA (supervised)
        lda = LinearDiscriminantAnalysis(n_components=2)
        X_lda = lda.fit_transform(X_scaled, y)
        
        # PCA (unsupervised)
        pca = PCA(n_components=2, random_state=self.random_state)
        X_pca = pca.fit_transform(X_scaled)
        
        # Plot comparison
        fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))
        
        # PCA plot
        scatter1 = ax1.scatter(X_pca[:, 0], X_pca[:, 1], c=y, 
                              cmap='tab10', alpha=0.7, s=50)
        ax1.set_title(f'PCA (Unsupervised)\nExplained Variance: {pca.explained_variance_ratio_.sum():.3f}', 
                     fontweight='bold')
        ax1.set_xlabel('PC1')
        ax1.set_ylabel('PC2')
        ax1.grid(True, alpha=0.3)
        plt.colorbar(scatter1, ax=ax1)
        
        # LDA plot
        scatter2 = ax2.scatter(X_lda[:, 0], X_lda[:, 1], c=y, 
                              cmap='tab10', alpha=0.7, s=50)
        lda_variance = lda.explained_variance_ratio_.sum() if hasattr(lda, 'explained_variance_ratio_') else 'N/A'
        ax2.set_title(f'LDA (Supervised)\nClass Separation Optimized', fontweight='bold')
        ax2.set_xlabel('LD1')
        ax2.set_ylabel('LD2')
        ax2.grid(True, alpha=0.3)
        plt.colorbar(scatter2, ax=ax2)
        
        plt.tight_layout()
        plt.show()
        
        return {'lda': lda, 'pca': pca}
    
    def evaluate_classification_performance(self, X: np.ndarray, y: np.ndarray) -> Dict:
        """Evaluate classification performance with different dimensionality reduction"""
        
        # Split data
        X_train, X_test, y_train, y_test = train_test_split(
            X, y, test_size=0.3, random_state=self.random_state, stratify=y
        )
        
        # Standardize
        scaler = StandardScaler()
        X_train_scaled = scaler.fit_transform(X_train)
        X_test_scaled = scaler.transform(X_test)
        
        methods = {}
        
        # Original features
        rf_original = RandomForestClassifier(n_estimators=100, random_state=self.random_state)
        rf_original.fit(X_train_scaled, y_train)
        acc_original = accuracy_score(y_test, rf_original.predict(X_test_scaled))
        methods['Original'] = {'accuracy': acc_original, 'n_features': X_train_scaled.shape[1]}
        
        # PCA
        for n_comp in [10, 20, 50]:
            if n_comp < X_train_scaled.shape[1]:
                pca = PCA(n_components=n_comp, random_state=self.random_state)
                X_train_pca = pca.fit_transform(X_train_scaled)
                X_test_pca = pca.transform(X_test_scaled)
                
                rf_pca = RandomForestClassifier(n_estimators=100, random_state=self.random_state)
                rf_pca.fit(X_train_pca, y_train)
                acc_pca = accuracy_score(y_test, rf_pca.predict(X_test_pca))
                methods[f'PCA_{n_comp}'] = {'accuracy': acc_pca, 'n_features': n_comp}
        
        # LDA
        n_lda_comp = min(len(np.unique(y)) - 1, X_train_scaled.shape[1])
        if n_lda_comp > 0:
            lda = LinearDiscriminantAnalysis(n_components=n_lda_comp)
            X_train_lda = lda.fit_transform(X_train_scaled, y_train)
            X_test_lda = lda.transform(X_test_scaled)
            
            rf_lda = RandomForestClassifier(n_estimators=100, random_state=self.random_state)
            rf_lda.fit(X_train_lda, y_train)
            acc_lda = accuracy_score(y_test, rf_lda.predict(X_test_lda))
            methods['LDA'] = {'accuracy': acc_lda, 'n_features': n_lda_comp}
        
        return methods
    
    def plot_classification_performance(self, results: Dict):
        """Plot classification performance comparison"""
        
        methods = list(results.keys())
        accuracies = [results[method]['accuracy'] for method in methods]
        n_features = [results[method]['n_features'] for method in methods]
        
        fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))
        
        # Accuracy comparison
        bars1 = ax1.bar(methods, accuracies, alpha=0.7, color='steelblue')
        ax1.set_ylabel('Accuracy', fontsize=12)
        ax1.set_title('Classification Accuracy by Method', fontweight='bold')
        ax1.set_xticklabels(methods, rotation=45, ha='right')
        ax1.grid(True, alpha=0.3)
        
        # Add value labels
        for bar, acc in zip(bars1, accuracies):
            ax1.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.005, 
                    f'{acc:.3f}', ha='center', va='bottom', fontweight='bold')
        
        # Feature count comparison
        bars2 = ax2.bar(methods, n_features, alpha=0.7, color='orange')
        ax2.set_ylabel('Number of Features', fontsize=12)
        ax2.set_title('Feature Count by Method', fontweight='bold')
        ax2.set_xticklabels(methods, rotation=45, ha='right')
        ax2.grid(True, alpha=0.3)
        ax2.set_yscale('log')
        
        # Add value labels
        for bar, n_feat in zip(bars2, n_features):
            ax2.text(bar.get_x() + bar.get_width()/2, bar.get_height() * 1.1, 
                    f'{n_feat}', ha='center', va='bottom', fontweight='bold')
        
        plt.tight_layout()
        plt.show()

# Analyze LDA
lda_analyzer = LDAAnalyzer()

print("\n" + "="*60)
print("LDA vs PCA Comparison")
print("="*60)

lda_pca_models = lda_analyzer.compare_lda_pca(X_digits, y_digits)

print("\n" + "="*60)
print("Classification Performance Comparison")
print("="*60)

classification_results = lda_analyzer.evaluate_classification_performance(X_digits, y_digits)
lda_analyzer.plot_classification_performance(classification_results)

# Print detailed results
print("\nDetailed Classification Results:")
print("-" * 50)
for method, metrics in classification_results.items():
    print(f"{method:12}: Accuracy={metrics['accuracy']:.4f}, Features={metrics['n_features']}")

Method Comparison and Guidelines

Comprehensive Comparison

class DimensionalityReductionComparison:
    """Comprehensive comparison of dimensionality reduction methods"""
    
    def __init__(self, random_state: int = 42):
        self.random_state = random_state
    
    def create_comparison_table(self):
        """Create comparison table of methods"""
        
        comparison_data = {
            'Method': ['PCA', 'Kernel PCA', 't-SNE', 'UMAP', 'LDA', 'ICA', 'Factor Analysis'],
            'Type': ['Linear', 'Nonlinear', 'Nonlinear', 'Nonlinear', 'Linear', 'Linear', 'Linear'],
            'Supervised': ['No', 'No', 'No', 'No', 'Yes', 'No', 'No'],
            'Preserves': ['Global variance', 'Global structure', 'Local structure', 'Local+Global', 'Class separation', 'Independence', 'Latent factors'],
            'Best For': ['Visualization, preprocessing', 'Nonlinear patterns', 'Visualization', 'Fast visualization', 'Classification', 'Signal separation', 'Factor models'],
            'Scalability': ['Excellent', 'Poor', 'Poor', 'Good', 'Good', 'Good', 'Good'],
            'Deterministic': ['Yes', 'Yes', 'No', 'No', 'Yes', 'No', 'No']
        }
        
        comparison_df = pd.DataFrame(comparison_data)
        
        # Display table
        print("\nDimensionality Reduction Methods Comparison:")
        print("=" * 80)
        print(comparison_df.to_string(index=False))
        
        return comparison_df
    
    def method_selection_guide(self):
        """Provide method selection guidelines"""
        
        guidelines = {
            "Data Visualization": {
                "Small datasets (<1000 samples)": "t-SNE",
                "Large datasets (>10000 samples)": "UMAP or PCA",
                "Need deterministic results": "PCA"
            },
            "Preprocessing for ML": {
                "Linear relationships": "PCA",
                "Nonlinear relationships": "Kernel PCA",
                "Classification tasks": "LDA",
                "Noise reduction": "PCA or Factor Analysis"
            },
            "Specific Applications": {
                "Image compression": "PCA",
                "Signal processing": "ICA",
                "Clustering": "PCA or UMAP",
                "Anomaly detection": "PCA or Kernel PCA"
            }
        }
        
        print("\n\nMethod Selection Guidelines:")
        print("=" * 50)
        
        for category, subcategories in guidelines.items():
            print(f"\n{category}:")
            for use_case, method in subcategories.items():
                print(f"  " {use_case}: {method}")
    
    def performance_vs_quality_analysis(self, X: np.ndarray, y: np.ndarray):
        """Analyze performance vs quality trade-offs"""
        
        import time
        
        # Use subset for consistent timing
        if X.shape[0] > 1000:
            indices = np.random.choice(X.shape[0], 1000, replace=False)
            X_subset = X[indices]
            y_subset = y[indices]
        else:
            X_subset = X
            y_subset = y
        
        # Standardize
        scaler = StandardScaler()
        X_scaled = scaler.fit_transform(X_subset)
        
        methods = {}
        
        # PCA
        start_time = time.time()
        pca = PCA(n_components=2, random_state=self.random_state)
        X_pca = pca.fit_transform(X_scaled)
        pca_time = time.time() - start_time
        methods['PCA'] = {'time': pca_time, 'deterministic': True, 'scalable': True}
        
        # t-SNE
        start_time = time.time()
        tsne = TSNE(n_components=2, random_state=self.random_state, n_iter=300)
        X_tsne = tsne.fit_transform(X_scaled)
        tsne_time = time.time() - start_time
        methods['t-SNE'] = {'time': tsne_time, 'deterministic': False, 'scalable': False}
        
        # LDA
        if len(np.unique(y_subset)) > 1:
            start_time = time.time()
            lda = LinearDiscriminantAnalysis(n_components=min(2, len(np.unique(y_subset))-1))
            X_lda = lda.fit_transform(X_scaled, y_subset)
            lda_time = time.time() - start_time
            methods['LDA'] = {'time': lda_time, 'deterministic': True, 'scalable': True}
        
        # Plot results
        fig, axes = plt.subplots(1, len(methods), figsize=(5*len(methods), 5))
        if len(methods) == 1:
            axes = [axes]
        
        transforms = {'PCA': X_pca, 't-SNE': X_tsne}
        if 'LDA' in methods:
            transforms['LDA'] = X_lda
        
        for idx, (method_name, transform_data) in enumerate(transforms.items()):
            scatter = axes[idx].scatter(transform_data[:, 0], transform_data[:, 1], 
                                      c=y_subset, cmap='tab10', alpha=0.7, s=50)
            
            time_taken = methods[method_name]['time']
            axes[idx].set_title(f'{method_name}\nTime: {time_taken:.2f}s', fontweight='bold')
            axes[idx].grid(True, alpha=0.3)
        
        plt.tight_layout()
        plt.show()
        
        # Print timing comparison
        print("\nPerformance Comparison:")
        print("-" * 30)
        for method, metrics in methods.items():
            print(f"{method:8}: {metrics['time']:.3f}s")
        
        return methods

# Create comprehensive comparison
comparison = DimensionalityReductionComparison()

print("\n" + "="*80)
print("COMPREHENSIVE METHOD COMPARISON")
print("="*80)

comparison_table = comparison.create_comparison_table()
comparison.method_selection_guide()

print("\n" + "="*60)
print("Performance vs Quality Analysis")
print("="*60)

performance_results = comparison.performance_vs_quality_analysis(X_digits, y_digits)

Best Practices and Guidelines

Key Recommendations

  1. Start with PCA for initial exploration and preprocessing
  2. Use t-SNE/UMAP for visualization of complex patterns
  3. Try LDA when you have labeled data for classification
  4. Consider computational constraints for large datasets
  5. Validate results with downstream tasks

Common Pitfalls

  • Over-interpreting t-SNE/UMAP visualizations
  • Not standardizing data before PCA
  • Ignoring explained variance when choosing components
  • Using wrong distance metrics for high-dimensional data
  • Not validating dimensionality reduction effectiveness

Performance Summary

MethodSpeedMemoryScalabilityDeterministic
PCAFastLowExcellentYes
t-SNESlowHighPoorNo
UMAPMediumMediumGoodNo
LDAFastLowGoodYes

Conclusion

Dimensionality reduction is essential for modern machine learning. Key takeaways:

  • PCA is your go-to for linear dimensionality reduction
  • t-SNE and UMAP excel at revealing nonlinear structure
  • LDA is perfect when you have labels and want class separation
  • Always validate that reduction preserves important information
  • Consider computational trade-offs for your specific use case

The right choice depends on your data characteristics, computational constraints, and downstream applications.

References

  1. Jolliffe, I. T. (2002). Principal component analysis. Springer.

  2. Van der Maaten, L., & Hinton, G. (2008). Visualizing data using t-SNE. Journal of machine learning research.

  3. McInnes, L., Healy, J., & Melville, J. (2018). UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint.


Connect with me on LinkedIn or X to discuss dimensionality reduction techniques!

Share this content

Reading time: 4 minutes
Progress: 0%
#Dimensionality Reduction#PCA#t-SNE#UMAP#LDA#Python#Feature Engineering
Dimensionality Reduction Techniques: Complete Guide with Python - Fenil Sonani