Fenil Sonani

Dimensionality Reduction Techniques in Machine Learning: PCA, t-SNE, and UMAP Guide

3 min read

AI-Generated Content Notice

Some code examples and technical explanations in this article were generated with AI assistance. The content has been reviewed for accuracy, but please test any code snippets in your development environment before using them.


Dimensionality Reduction Techniques in Machine Learning: PCA, t-SNE, and UMAP Guide

Introduction

High-dimensional data is everywhere—from image pixels to gene expressions. Dimensionality reduction helps us visualize, understand, and process this data more efficiently. This guide covers the most important techniques: PCA for linear reduction, t-SNE for non-linear visualization, and UMAP for modern high-dimensional analysis.

Understanding these methods will help you tackle the curse of dimensionality and create meaningful visualizations of complex datasets.

Why Dimensionality Reduction?

Key Benefits:

  • Visualization: Plot high-dimensional data in 2D/3D
  • Faster computation: Fewer features = faster algorithms
  • Storage efficiency: Compress data while preserving information
  • Noise reduction: Focus on important patterns
  • Feature extraction: Discover hidden structures

Common challenges with high-dimensional data:

  • Curse of dimensionality
  • Visualization difficulty
  • Computational complexity
  • Overfitting risk
  • Memory constraints

Principal Component Analysis (PCA)

Complete PCA Implementation

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_digits, load_breast_cancer, make_blobs
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
from typing import Tuple, List, Dict, Optional
import warnings
warnings.filterwarnings('ignore')

plt.style.use('seaborn-v0_8')

class PCAAnalyzer:
    """Comprehensive PCA analysis and visualization"""
    
    def __init__(self, random_state: int = 42):
        self.random_state = random_state
        self.pca_model: Optional[PCA] = None
        self.scaler: Optional[StandardScaler] = None
    
    def fit_pca(self, X: np.ndarray, n_components: Optional[int] = None) -> 'PCAAnalyzer':
        """Fit PCA model with optional standardization"""
        
        # Standardize data
        self.scaler = StandardScaler()
        X_scaled = self.scaler.fit_transform(X)
        
        # Fit PCA
        if n_components is None:
            n_components = min(X.shape[0], X.shape[1])
        
        self.pca_model = PCA(n_components=n_components, random_state=self.random_state)
        self.pca_model.fit(X_scaled)
        
        return self
    
    def transform(self, X: np.ndarray) -> np.ndarray:
        """Transform data using fitted PCA"""
        if self.pca_model is None or self.scaler is None:
            raise ValueError("Must fit PCA first")
        
        X_scaled = self.scaler.transform(X)
        return self.pca_model.transform(X_scaled)
    
    def explained_variance_analysis(self) -> Dict:
        """Analyze explained variance ratios"""
        if self.pca_model is None:
            raise ValueError("Must fit PCA first")
        
        explained_variance_ratio = self.pca_model.explained_variance_ratio_
        cumulative_variance = np.cumsum(explained_variance_ratio)
        
        # Find components for different thresholds
        thresholds = [0.8, 0.9, 0.95, 0.99]
        components_needed = {}
        
        for threshold in thresholds:
            n_components = np.argmax(cumulative_variance >= threshold) + 1
            components_needed[threshold] = n_components
        
        return {
            'explained_variance_ratio': explained_variance_ratio,
            'cumulative_variance': cumulative_variance,
            'components_needed': components_needed
        }
    
    def plot_explained_variance(self, max_components: int = 20):
        """Plot explained variance analysis"""
        
        if self.pca_model is None:
            raise ValueError("Must fit PCA first")
        
        variance_data = self.explained_variance_analysis()
        
        fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))
        
        # Individual explained variance
        n_components = min(max_components, len(variance_data['explained_variance_ratio']))
        components = range(1, n_components + 1)
        
        ax1.bar(components, variance_data['explained_variance_ratio'][:n_components], 
                alpha=0.7, color='steelblue')
        ax1.set_xlabel('Principal Component', fontsize=12)
        ax1.set_ylabel('Explained Variance Ratio', fontsize=12)
        ax1.set_title('Individual Component Variance', fontweight='bold')
        ax1.grid(True, alpha=0.3)
        
        # Cumulative explained variance
        ax2.plot(components, variance_data['cumulative_variance'][:n_components], 
                'o-', linewidth=3, markersize=6, color='red')
        
        # Add threshold lines
        thresholds = [0.8, 0.9, 0.95]
        colors = ['orange', 'green', 'purple']
        
        for threshold, color in zip(thresholds, colors):
            ax2.axhline(y=threshold, color=color, linestyle='--', alpha=0.7,
                       label=f'{threshold*100:.0f}% variance')
            
            if threshold in variance_data['components_needed']:
                n_comp = variance_data['components_needed'][threshold]
                if n_comp <= n_components:
                    ax2.axvline(x=n_comp, color=color, linestyle='--', alpha=0.7)
                    ax2.text(n_comp + 0.5, threshold + 0.02, f'{n_comp} comp.', 
                           color=color, fontweight='bold')
        
        ax2.set_xlabel('Number of Components', fontsize=12)
        ax2.set_ylabel('Cumulative Explained Variance', fontsize=12)
        ax2.set_title('Cumulative Variance Explained', fontweight='bold')
        ax2.legend()
        ax2.grid(True, alpha=0.3)
        
        plt.tight_layout()
        plt.show()
        
        # Print summary
        print("PCA Variance Analysis:")
        print("-" * 30)
        for threshold, n_comp in variance_data['components_needed'].items():
            print(f"{threshold*100:.0f}% variance: {n_comp} components")
    
    def plot_2d_projection(self, X: np.ndarray, y: Optional[np.ndarray] = None, 
                          feature_names: Optional[List[str]] = None):
        """Plot first two principal components"""
        
        if self.pca_model is None:
            raise ValueError("Must fit PCA first")
        
        X_pca = self.transform(X)
        
        fig, axes = plt.subplots(1, 2, figsize=(15, 6))
        
        # Scatter plot
        if y is not None:
            scatter = axes[0].scatter(X_pca[:, 0], X_pca[:, 1], c=y, 
                                    cmap='viridis', alpha=0.7, s=50)
            plt.colorbar(scatter, ax=axes[0])
        else:
            axes[0].scatter(X_pca[:, 0], X_pca[:, 1], alpha=0.7, s=50)
        
        explained_var = self.pca_model.explained_variance_ratio_
        axes[0].set_xlabel(f'PC1 ({explained_var[0]:.3f})', fontsize=12)
        axes[0].set_ylabel(f'PC2 ({explained_var[1]:.3f})', fontsize=12)
        axes[0].set_title('First Two Principal Components', fontweight='bold')
        axes[0].grid(True, alpha=0.3)
        
        # Component loadings
        if feature_names is not None and len(feature_names) <= 20:
            loadings = self.pca_model.components_[:2].T
            
            for i, feature in enumerate(feature_names):
                axes[1].arrow(0, 0, loadings[i, 0], loadings[i, 1], 
                            head_width=0.02, head_length=0.02, fc='red', ec='red')
                axes[1].text(loadings[i, 0]*1.1, loadings[i, 1]*1.1, feature, 
                           fontsize=10, ha='center', va='center')
            
            axes[1].set_xlabel(f'PC1 Loadings ({explained_var[0]:.3f})', fontsize=12)
            axes[1].set_ylabel(f'PC2 Loadings ({explained_var[1]:.3f})', fontsize=12)
            axes[1].set_title('Component Loadings', fontweight='bold')
            axes[1].grid(True, alpha=0.3)
            axes[1].set_xlim(-1.1, 1.1)
            axes[1].set_ylim(-1.1, 1.1)
        else:
            axes[1].text(0.5, 0.5, 'Too many features\nfor loading plot', 
                        ha='center', va='center', transform=axes[1].transAxes,
                        fontsize=14)
            axes[1].set_title('Component Loadings (Skipped)', fontweight='bold')
        
        plt.tight_layout()
        plt.show()

# Load and analyze digits dataset
print("=== PCA Analysis on Digits Dataset ===")
digits = load_digits()
X_digits, y_digits = digits.data, digits.target

print(f"Original data shape: {X_digits.shape}")
print(f"Number of classes: {len(np.unique(y_digits))}")

# Fit PCA
pca_analyzer = PCAAnalyzer()
pca_analyzer.fit_pca(X_digits, n_components=50)

# Analyze explained variance
pca_analyzer.plot_explained_variance(max_components=20)

# Visualize 2D projection
pca_analyzer.plot_2d_projection(X_digits, y_digits)

PCA for Classification Performance

def pca_classification_analysis(X: np.ndarray, y: np.ndarray, 
                              max_components: int = 50) -> Dict:
    """Analyze how PCA affects classification performance"""
    
    component_range = range(2, min(max_components + 1, X.shape[1]), 2)
    results = []
    
    # Split data
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.3, random_state=42, stratify=y
    )
    
    # Standardize
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)
    
    for n_components in component_range:
        # Apply PCA
        pca = PCA(n_components=n_components, random_state=42)
        X_train_pca = pca.fit_transform(X_train_scaled)
        X_test_pca = pca.transform(X_test_scaled)
        
        # Train classifier
        clf = LogisticRegression(random_state=42, max_iter=1000)
        clf.fit(X_train_pca, y_train)
        
        # Evaluate
        train_acc = clf.score(X_train_pca, y_train)
        test_acc = clf.score(X_test_pca, y_test)
        
        # Store results
        results.append({
            'n_components': n_components,
            'train_accuracy': train_acc,
            'test_accuracy': test_acc,
            'explained_variance': pca.explained_variance_ratio_.sum()
        })
        
        print(f"Components: {n_components:2d}, "
              f"Test Acc: {test_acc:.4f}, "
              f"Variance: {pca.explained_variance_ratio_.sum():.3f}")
    
    return results

def plot_pca_performance_analysis(results: List[Dict]):
    """Plot PCA performance analysis"""
    
    n_components = [r['n_components'] for r in results]
    train_accs = [r['train_accuracy'] for r in results]
    test_accs = [r['test_accuracy'] for r in results]
    explained_vars = [r['explained_variance'] for r in results]
    
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))
    
    # Accuracy vs components
    ax1.plot(n_components, train_accs, 'o-', linewidth=2, 
             markersize=6, label='Training Accuracy', color='blue')
    ax1.plot(n_components, test_accs, 's-', linewidth=2, 
             markersize=6, label='Test Accuracy', color='red')
    
    ax1.set_xlabel('Number of PCA Components', fontsize=12)
    ax1.set_ylabel('Accuracy', fontsize=12)
    ax1.set_title('Classification Performance vs PCA Components', fontweight='bold')
    ax1.legend()
    ax1.grid(True, alpha=0.3)
    
    # Explained variance vs components
    ax2.plot(n_components, explained_vars, 'o-', linewidth=2, 
             markersize=6, color='green')
    ax2.set_xlabel('Number of PCA Components', fontsize=12)
    ax2.set_ylabel('Explained Variance Ratio', fontsize=12)
    ax2.set_title('Explained Variance vs Components', fontweight='bold')
    ax2.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    # Find optimal number of components
    best_test_idx = np.argmax([r['test_accuracy'] for r in results])
    best_result = results[best_test_idx]
    
    print(f"\nOptimal Configuration:")
    print(f"Components: {best_result['n_components']}")
    print(f"Test Accuracy: {best_result['test_accuracy']:.4f}")
    print(f"Explained Variance: {best_result['explained_variance']:.3f}")

# Run classification analysis
print("\n=== PCA Classification Performance Analysis ===")
pca_results = pca_classification_analysis(X_digits, y_digits, max_components=30)
plot_pca_performance_analysis(pca_results)

t-SNE for Non-linear Visualization

from sklearn.manifold import TSNE
import time

class TSNEAnalyzer:
    """t-SNE analysis and visualization"""
    
    def __init__(self, random_state: int = 42):
        self.random_state = random_state
    
    def parameter_analysis(self, X: np.ndarray, y: np.ndarray, 
                          sample_size: int = 1000) -> Dict:
        """Analyze different t-SNE parameters"""
        
        # Sample data for faster analysis
        if X.shape[0] > sample_size:
            indices = np.random.choice(X.shape[0], sample_size, replace=False)
            X_sample = X[indices]
            y_sample = y[indices]
        else:
            X_sample, y_sample = X, y
        
        # Standardize
        scaler = StandardScaler()
        X_scaled = scaler.fit_transform(X_sample)
        
        # Different parameter combinations
        parameter_combinations = [
            {'perplexity': 30, 'learning_rate': 200, 'n_iter': 1000},
            {'perplexity': 50, 'learning_rate': 200, 'n_iter': 1000},
            {'perplexity': 30, 'learning_rate': 'auto', 'n_iter': 1000},
            {'perplexity': 30, 'learning_rate': 200, 'n_iter': 2000}
        ]
        
        results = {}
        
        for i, params in enumerate(parameter_combinations):
            print(f"Running t-SNE with parameters: {params}")
            
            start_time = time.time()
            
            tsne = TSNE(
                n_components=2,
                random_state=self.random_state,
                **params
            )
            
            X_tsne = tsne.fit_transform(X_scaled)
            
            runtime = time.time() - start_time
            
            results[f"Config_{i+1}"] = {
                'params': params,
                'embedding': X_tsne,
                'labels': y_sample,
                'runtime': runtime,
                'kl_divergence': tsne.kl_divergence_
            }
            
            print(f"  Runtime: {runtime:.2f}s, KL divergence: {tsne.kl_divergence_:.2f}")
        
        return results
    
    def plot_tsne_comparison(self, results: Dict):
        """Plot t-SNE results comparison"""
        
        n_configs = len(results)
        fig, axes = plt.subplots(2, 2, figsize=(15, 12))
        axes = axes.flatten()
        
        for i, (config_name, data) in enumerate(results.items()):
            X_tsne = data['embedding']
            y_sample = data['labels']
            params = data['params']
            runtime = data['runtime']
            kl_div = data['kl_divergence']
            
            scatter = axes[i].scatter(X_tsne[:, 0], X_tsne[:, 1], 
                                    c=y_sample, cmap='tab10', 
                                    alpha=0.7, s=30)
            
            # Title with parameters
            title = f"Perplexity: {params['perplexity']}, LR: {params['learning_rate']}\n"
            title += f"Runtime: {runtime:.1f}s, KL: {kl_div:.2f}"
            axes[i].set_title(title, fontsize=10, fontweight='bold')
            axes[i].set_xlabel('t-SNE 1', fontsize=10)
            axes[i].set_ylabel('t-SNE 2', fontsize=10)
            
            if i == 0:  # Add colorbar to first plot
                plt.colorbar(scatter, ax=axes[i])
        
        plt.tight_layout()
        plt.show()
    
    def perplexity_analysis(self, X: np.ndarray, y: np.ndarray, 
                           sample_size: int = 500) -> Dict:
        """Analyze impact of perplexity parameter"""
        
        # Sample data
        if X.shape[0] > sample_size:
            indices = np.random.choice(X.shape[0], sample_size, replace=False)
            X_sample = X[indices]
            y_sample = y[indices]
        else:
            X_sample, y_sample = X, y
        
        # Standardize
        scaler = StandardScaler()
        X_scaled = scaler.fit_transform(X_sample)
        
        perplexity_values = [5, 15, 30, 50, 100]
        results = {}
        
        for perplexity in perplexity_values:
            print(f"Testing perplexity: {perplexity}")
            
            # Adjust perplexity if too large for dataset
            actual_perplexity = min(perplexity, (X_sample.shape[0] - 1) // 3)
            
            tsne = TSNE(
                n_components=2,
                perplexity=actual_perplexity,
                random_state=self.random_state,
                n_iter=1000
            )
            
            X_tsne = tsne.fit_transform(X_scaled)
            
            results[actual_perplexity] = {
                'embedding': X_tsne,
                'labels': y_sample,
                'kl_divergence': tsne.kl_divergence_
            }
        
        return results
    
    def plot_perplexity_analysis(self, results: Dict):
        """Plot perplexity analysis results"""
        
        n_perplexities = len(results)
        fig, axes = plt.subplots(1, n_perplexities, figsize=(4*n_perplexities, 5))
        
        if n_perplexities == 1:
            axes = [axes]
        
        for i, (perplexity, data) in enumerate(results.items()):
            X_tsne = data['embedding']
            y_sample = data['labels']
            kl_div = data['kl_divergence']
            
            scatter = axes[i].scatter(X_tsne[:, 0], X_tsne[:, 1], 
                                    c=y_sample, cmap='tab10', 
                                    alpha=0.7, s=30)
            
            axes[i].set_title(f'Perplexity: {perplexity}\nKL: {kl_div:.2f}', 
                            fontweight='bold')
            axes[i].set_xlabel('t-SNE 1', fontsize=10)
            axes[i].set_ylabel('t-SNE 2', fontsize=10)
        
        plt.tight_layout()
        plt.show()

# Analyze t-SNE on digits dataset
print("\n=== t-SNE Analysis ===")
tsne_analyzer = TSNEAnalyzer()

# Parameter analysis
print("Comparing different t-SNE parameters...")
tsne_param_results = tsne_analyzer.parameter_analysis(X_digits, y_digits, sample_size=800)
tsne_analyzer.plot_tsne_comparison(tsne_param_results)

# Perplexity analysis
print("\nAnalyzing perplexity impact...")
perplexity_results = tsne_analyzer.perplexity_analysis(X_digits, y_digits, sample_size=600)
tsne_analyzer.plot_perplexity_analysis(perplexity_results)

UMAP for Modern Dimensionality Reduction

try:
    import umap
    
    class UMAPAnalyzer:
        """UMAP analysis and comparison"""
        
        def __init__(self, random_state: int = 42):
            self.random_state = random_state
        
        def parameter_analysis(self, X: np.ndarray, y: np.ndarray) -> Dict:
            """Analyze different UMAP parameters"""
            
            # Standardize
            scaler = StandardScaler()
            X_scaled = scaler.fit_transform(X)
            
            parameter_combinations = [
                {'n_neighbors': 15, 'min_dist': 0.1, 'metric': 'euclidean'},
                {'n_neighbors': 30, 'min_dist': 0.1, 'metric': 'euclidean'},
                {'n_neighbors': 15, 'min_dist': 0.5, 'metric': 'euclidean'},
                {'n_neighbors': 15, 'min_dist': 0.1, 'metric': 'cosine'}
            ]
            
            results = {}
            
            for i, params in enumerate(parameter_combinations):
                print(f"Running UMAP with parameters: {params}")
                
                start_time = time.time()
                
                umap_model = umap.UMAP(
                    n_components=2,
                    random_state=self.random_state,
                    **params
                )
                
                X_umap = umap_model.fit_transform(X_scaled)
                
                runtime = time.time() - start_time
                
                results[f"Config_{i+1}"] = {
                    'params': params,
                    'embedding': X_umap,
                    'labels': y,
                    'runtime': runtime
                }
                
                print(f"  Runtime: {runtime:.2f}s")
            
            return results
        
        def plot_umap_comparison(self, results: Dict):
            """Plot UMAP results comparison"""
            
            fig, axes = plt.subplots(2, 2, figsize=(15, 12))
            axes = axes.flatten()
            
            for i, (config_name, data) in enumerate(results.items()):
                X_umap = data['embedding']
                y_labels = data['labels']
                params = data['params']
                runtime = data['runtime']
                
                scatter = axes[i].scatter(X_umap[:, 0], X_umap[:, 1], 
                                        c=y_labels, cmap='tab10', 
                                        alpha=0.7, s=30)
                
                # Title with parameters
                title = f"n_neighbors: {params['n_neighbors']}, min_dist: {params['min_dist']}\n"
                title += f"metric: {params['metric']}, Runtime: {runtime:.1f}s"
                axes[i].set_title(title, fontsize=10, fontweight='bold')
                axes[i].set_xlabel('UMAP 1', fontsize=10)
                axes[i].set_ylabel('UMAP 2', fontsize=10)
                
                if i == 0:  # Add colorbar to first plot
                    plt.colorbar(scatter, ax=axes[i])
            
            plt.tight_layout()
            plt.show()
        
        def compare_with_other_methods(self, X: np.ndarray, y: np.ndarray, 
                                     sample_size: int = 1000) -> Dict:
            """Compare UMAP with PCA and t-SNE"""
            
            # Sample data for fair comparison
            if X.shape[0] > sample_size:
                indices = np.random.choice(X.shape[0], sample_size, replace=False)
                X_sample = X[indices]
                y_sample = y[indices]
            else:
                X_sample, y_sample = X, y
            
            # Standardize
            scaler = StandardScaler()
            X_scaled = scaler.fit_transform(X_sample)
            
            methods = {}
            
            # PCA
            print("Running PCA...")
            start_time = time.time()
            pca = PCA(n_components=2, random_state=self.random_state)
            X_pca = pca.fit_transform(X_scaled)
            pca_time = time.time() - start_time
            
            methods['PCA'] = {
                'embedding': X_pca,
                'runtime': pca_time,
                'explained_variance': pca.explained_variance_ratio_.sum()
            }
            
            # t-SNE
            print("Running t-SNE...")
            start_time = time.time()
            tsne = TSNE(n_components=2, random_state=self.random_state, 
                       perplexity=30, n_iter=1000)
            X_tsne = tsne.fit_transform(X_scaled)
            tsne_time = time.time() - start_time
            
            methods['t-SNE'] = {
                'embedding': X_tsne,
                'runtime': tsne_time,
                'kl_divergence': tsne.kl_divergence_
            }
            
            # UMAP
            print("Running UMAP...")
            start_time = time.time()
            umap_model = umap.UMAP(n_components=2, random_state=self.random_state,
                                  n_neighbors=15, min_dist=0.1)
            X_umap = umap_model.fit_transform(X_scaled)
            umap_time = time.time() - start_time
            
            methods['UMAP'] = {
                'embedding': X_umap,
                'runtime': umap_time
            }
            
            # Add labels to all methods
            for method_data in methods.values():
                method_data['labels'] = y_sample
            
            return methods
        
        def plot_method_comparison(self, methods: Dict):
            """Plot comparison of different methods"""
            
            fig, axes = plt.subplots(1, 3, figsize=(18, 5))
            
            for i, (method_name, data) in enumerate(methods.items()):
                embedding = data['embedding']
                labels = data['labels']
                runtime = data['runtime']
                
                scatter = axes[i].scatter(embedding[:, 0], embedding[:, 1], 
                                        c=labels, cmap='tab10', alpha=0.7, s=30)
                
                title = f"{method_name}\nRuntime: {runtime:.2f}s"
                if method_name == 'PCA':
                    title += f"\nExplained Var: {data['explained_variance']:.3f}"
                elif method_name == 't-SNE':
                    title += f"\nKL Divergence: {data['kl_divergence']:.2f}"
                
                axes[i].set_title(title, fontweight='bold')
                axes[i].set_xlabel(f'{method_name} 1', fontsize=12)
                axes[i].set_ylabel(f'{method_name} 2', fontsize=12)
                
                if i == 0:  # Add colorbar to first plot
                    plt.colorbar(scatter, ax=axes[i])
            
            plt.tight_layout()
            plt.show()
            
            # Print runtime comparison
            print("\nRuntime Comparison:")
            print("-" * 20)
            for method, data in methods.items():
                print(f"{method:6}: {data['runtime']:.2f}s")
    
    # Run UMAP analysis
    print("\n=== UMAP Analysis ===")
    umap_analyzer = UMAPAnalyzer()
    
    # Parameter analysis
    print("Comparing different UMAP parameters...")
    umap_param_results = umap_analyzer.parameter_analysis(X_digits[:800], y_digits[:800])
    umap_analyzer.plot_umap_comparison(umap_param_results)
    
    # Method comparison
    print("\nComparing PCA, t-SNE, and UMAP...")
    method_comparison = umap_analyzer.compare_with_other_methods(X_digits, y_digits, sample_size=800)
    umap_analyzer.plot_method_comparison(method_comparison)

except ImportError:
    print("UMAP not installed. Install with: pip install umap-learn")
    print("Skipping UMAP analysis...")

Comprehensive Comparison

def comprehensive_dimensionality_reduction_analysis():
    """Complete comparison of dimensionality reduction techniques"""
    
    # Load different datasets
    datasets = {
        'Digits': load_digits(),
        'Breast Cancer': load_breast_cancer()
    }
    
    results_summary = []
    
    for dataset_name, dataset in datasets.items():
        print(f"\n=== Analysis on {dataset_name} Dataset ===")
        X, y = dataset.data, dataset.target
        
        print(f"Original shape: {X.shape}")
        print(f"Number of classes: {len(np.unique(y))}")
        
        # Sample for consistent comparison
        if X.shape[0] > 500:
            indices = np.random.choice(X.shape[0], 500, replace=False)
            X_sample = X[indices]
            y_sample = y[indices]
        else:
            X_sample, y_sample = X, y
        
        # Standardize
        scaler = StandardScaler()
        X_scaled = scaler.fit_transform(X_sample)
        
        # Apply methods
        methods_data = {}
        
        # PCA
        pca = PCA(n_components=2, random_state=42)
        start_time = time.time()
        X_pca = pca.fit_transform(X_scaled)
        pca_time = time.time() - start_time
        
        methods_data['PCA'] = {
            'embedding': X_pca,
            'runtime': pca_time,
            'variance_explained': pca.explained_variance_ratio_.sum()
        }
        
        # t-SNE
        tsne = TSNE(n_components=2, random_state=42, perplexity=30, n_iter=500)
        start_time = time.time()
        X_tsne = tsne.fit_transform(X_scaled)
        tsne_time = time.time() - start_time
        
        methods_data['t-SNE'] = {
            'embedding': X_tsne,
            'runtime': tsne_time,
            'kl_divergence': tsne.kl_divergence_
        }
        
        # Store results
        for method, data in methods_data.items():
            results_summary.append({
                'dataset': dataset_name,
                'method': method,
                'original_dims': X.shape[1],
                'n_samples': X_sample.shape[0],
                'runtime': data['runtime']
            })
        
        # Plot comparison for this dataset
        fig, axes = plt.subplots(1, 2, figsize=(12, 5))
        
        for i, (method, data) in enumerate(methods_data.items()):
            embedding = data['embedding']
            
            scatter = axes[i].scatter(embedding[:, 0], embedding[:, 1], 
                                    c=y_sample, cmap='tab10', alpha=0.7, s=30)
            
            title = f"{method} - {dataset_name}\nRuntime: {data['runtime']:.2f}s"
            if method == 'PCA':
                title += f"\nVar Explained: {data['variance_explained']:.3f}"
            elif method == 't-SNE':
                title += f"\nKL Div: {data['kl_divergence']:.2f}"
                
            axes[i].set_title(title, fontweight='bold')
            axes[i].set_xlabel(f'{method} 1')
            axes[i].set_ylabel(f'{method} 2')
            
            if i == 0:
                plt.colorbar(scatter, ax=axes[i])
        
        plt.tight_layout()
        plt.show()
    
    return results_summary

# Run comprehensive analysis
print("\n=== Comprehensive Dimensionality Reduction Analysis ===")
summary_results = comprehensive_dimensionality_reduction_analysis()

# Create summary table
summary_df = pd.DataFrame(summary_results)
print("\nSummary Results:")
print(summary_df.to_string(index=False))

When to Use Each Method

Method Selection Guide

MethodBest ForProsCons
PCALinear relationships, feature reductionFast, interpretable, preserves varianceLinear only, may miss non-linear patterns
t-SNEVisualization, clustering analysisExcellent clusters, non-linearSlow, not deterministic, local structure
UMAPGeneral purpose, large datasetsFast, preserves local & global structureNewer, fewer theoretical guarantees

Key Recommendations

  1. Start with PCA for initial exploration and linear relationships
  2. Use t-SNE for beautiful visualizations and cluster discovery
  3. Choose UMAP for large datasets and balanced local/global structure
  4. Combine methods - use PCA for preprocessing, then t-SNE/UMAP
  5. Consider computational cost - PCA is fastest, t-SNE is slowest

Performance Guidelines

  • PCA: Use when you need interpretable components and fast computation
  • t-SNE: Perplexity = 5-50, higher for larger datasets
  • UMAP: n_neighbors = 10-50, min_dist = 0.1-0.5
  • Preprocessing: Always standardize features first
  • Sample size: Use sampling for t-SNE on large datasets (>10k samples)

Conclusion

Dimensionality reduction is essential for high-dimensional data analysis. Key takeaways:

  • PCA for linear reduction and fast feature extraction
  • t-SNE for non-linear visualization and cluster discovery
  • UMAP for modern, balanced dimensionality reduction
  • Parameter tuning significantly impacts results
  • Preprocessing and standardization are crucial
  • Method combination often works better than single approaches

Choose your method based on data characteristics, computational constraints, and analysis goals.

References

  1. Jolliffe, I. T. (2002). Principal component analysis. Springer.

  2. Van der Maaten, L., & Hinton, G. (2008). Visualizing data using t-SNE. Journal of machine learning research.

  3. McInnes, L., Healy, J., & Melville, J. (2018). UMAP: Uniform manifold approximation and projection. arXiv preprint.


Connect with me on LinkedIn or X to discuss dimensionality reduction techniques!

Share this content

Reading time: 3 minutes
Progress: 0%
#Dimensionality Reduction#PCA#t-SNE#UMAP#Machine Learning#Python#Data Visualization
Dimensionality Reduction Techniques in Machine Learning: PCA, t-SNE, and UMAP Guide - Fenil Sonani