Feature Selection and Engineering in Machine Learning: Complete Guide with Python
AI-Generated Content Notice
Some code examples and technical explanations in this article were generated with AI assistance. The content has been reviewed for accuracy, but please test any code snippets in your development environment before using them.
Feature Selection and Engineering in Machine Learning: Complete Guide with Python
Introduction
Feature selection and engineering are critical steps that can make or break your machine learning model. Good features lead to simpler models, better performance, and faster training. This guide covers practical techniques for selecting the most informative features and creating new ones that capture hidden patterns in your data.
Why Feature Selection Matters
Benefits:
- Better performance: Remove noise and irrelevant features
- Faster training: Fewer features = less computation
- Reduced overfitting: Lower dimensional space prevents overfitting
- Better interpretability: Focus on most important features
Common problems with too many features:
- Curse of dimensionality
- Increased computational cost
- Overfitting on noisy features
- Poor model interpretability
Feature Selection Techniques
1. Univariate Selection
Select features based on statistical tests between each feature and the target.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import make_classification, load_breast_cancer
from sklearn.feature_selection import (
SelectKBest, f_classif, chi2, mutual_info_classif,
RFE, SelectFromModel, VarianceThreshold
)
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
from typing import Dict, List, Tuple
import warnings
warnings.filterwarnings('ignore')
plt.style.use('seaborn-v0_8')
class FeatureSelectionAnalyzer:
"""Comprehensive feature selection analyzer"""
def __init__(self, random_state: int = 42):
self.random_state = random_state
def univariate_selection_analysis(self, X: np.ndarray, y: np.ndarray,
feature_names: List[str] = None) -> Dict:
"""Analyze different univariate selection methods"""
if feature_names is None:
feature_names = [f'feature_{i}' for i in range(X.shape[1])]
# Different scoring functions
scoring_functions = {
'f_classif': f_classif,
'chi2': chi2,
'mutual_info': mutual_info_classif
}
results = {}
for name, score_func in scoring_functions.items():
try:
# Ensure non-negative values for chi2
X_test = X if name != 'chi2' else np.abs(X)
selector = SelectKBest(score_func=score_func, k='all')
selector.fit(X_test, y)
scores = selector.scores_
# Create feature ranking
feature_ranking = sorted(zip(feature_names, scores),
key=lambda x: x[1], reverse=True)
results[name] = {
'scores': scores,
'ranking': feature_ranking,
'selector': selector
}
except Exception as e:
print(f"Error with {name}: {str(e)}")
continue
return results
def plot_univariate_analysis(self, results: Dict, top_k: int = 15):
"""Plot univariate selection results"""
n_methods = len(results)
fig, axes = plt.subplots(1, n_methods, figsize=(5*n_methods, 6))
if n_methods == 1:
axes = [axes]
for idx, (method_name, data) in enumerate(results.items()):
# Get top k features
top_features = data['ranking'][:top_k]
feature_names = [item[0] for item in top_features]
scores = [item[1] for item in top_features]
# Create horizontal bar plot
y_pos = np.arange(len(feature_names))
bars = axes[idx].barh(y_pos, scores, alpha=0.7)
axes[idx].set_yticks(y_pos)
axes[idx].set_yticklabels(feature_names, fontsize=10)
axes[idx].set_xlabel('Score', fontsize=12)
axes[idx].set_title(f'{method_name.upper()} Scores', fontweight='bold')
axes[idx].grid(True, alpha=0.3)
# Color bars based on score
for bar, score in zip(bars, scores):
bar.set_color(plt.cm.viridis(score / max(scores)))
plt.tight_layout()
plt.show()
# Load dataset
data = load_breast_cancer()
X, y = data.data, data.target
feature_names = data.feature_names
print("Dataset shape:", X.shape)
print("Features:", len(feature_names))
# Analyze univariate selection
analyzer = FeatureSelectionAnalyzer()
univariate_results = analyzer.univariate_selection_analysis(X, y, feature_names)
analyzer.plot_univariate_analysis(univariate_results, top_k=10)
# Print top features for each method
for method, data in univariate_results.items():
print(f"\nTop 5 features - {method.upper()}:")
for i, (feature, score) in enumerate(data['ranking'][:5]):
print(f" {i+1}. {feature}: {score:.3f}")
2. Recursive Feature Elimination (RFE)
Recursively eliminates features by training models on smaller feature sets.
class RFEAnalyzer:
"""Recursive Feature Elimination analyzer"""
def __init__(self, random_state: int = 42):
self.random_state = random_state
def rfe_analysis(self, X: np.ndarray, y: np.ndarray,
feature_names: List[str],
estimators: Dict = None) -> Dict:
"""Analyze RFE with different estimators"""
if estimators is None:
estimators = {
'LogisticRegression': LogisticRegression(random_state=self.random_state, max_iter=1000),
'RandomForest': RandomForestClassifier(n_estimators=50, random_state=self.random_state)
}
results = {}
for name, estimator in estimators.items():
print(f"Running RFE with {name}...")
# Test different numbers of features
n_features_range = range(5, min(21, len(feature_names)), 2)
rfe_scores = []
selected_features_list = []
for n_features in n_features_range:
# RFE
rfe = RFE(estimator=estimator, n_features_to_select=n_features)
rfe.fit(X, y)
# Cross-validation score
score = cross_val_score(rfe, X, y, cv=5, scoring='accuracy').mean()
rfe_scores.append(score)
# Selected features
selected_features = [feature_names[i] for i in range(len(feature_names))
if rfe.support_[i]]
selected_features_list.append(selected_features)
# Find optimal number of features
optimal_idx = np.argmax(rfe_scores)
optimal_n_features = list(n_features_range)[optimal_idx]
optimal_score = rfe_scores[optimal_idx]
results[name] = {
'n_features_range': list(n_features_range),
'scores': rfe_scores,
'optimal_n_features': optimal_n_features,
'optimal_score': optimal_score,
'selected_features_list': selected_features_list,
'optimal_features': selected_features_list[optimal_idx]
}
print(f" Optimal features: {optimal_n_features}, Score: {optimal_score:.4f}")
return results
def plot_rfe_analysis(self, results: Dict):
"""Plot RFE analysis results"""
fig, axes = plt.subplots(1, 2, figsize=(15, 6))
# Plot 1: Performance vs number of features
for name, data in results.items():
axes[0].plot(data['n_features_range'], data['scores'],
'o-', linewidth=2, markersize=8, label=name)
# Mark optimal point
optimal_idx = np.argmax(data['scores'])
axes[0].scatter(data['n_features_range'][optimal_idx],
data['scores'][optimal_idx],
s=100, color='red', marker='*', zorder=5)
axes[0].set_xlabel('Number of Features', fontsize=12)
axes[0].set_ylabel('Cross-Validation Accuracy', fontsize=12)
axes[0].set_title('RFE: Performance vs Feature Count', fontweight='bold')
axes[0].legend()
axes[0].grid(True, alpha=0.3)
# Plot 2: Feature selection frequency
all_features = set()
for data in results.values():
for feature_list in data['selected_features_list']:
all_features.update(feature_list)
feature_frequency = {}
for feature in all_features:
count = 0
total_selections = 0
for data in results.values():
for feature_list in data['selected_features_list']:
total_selections += 1
if feature in feature_list:
count += 1
feature_frequency[feature] = count / len(results) / len(data['selected_features_list'])
# Plot top features by selection frequency
sorted_features = sorted(feature_frequency.items(), key=lambda x: x[1], reverse=True)[:15]
features, frequencies = zip(*sorted_features)
bars = axes[1].barh(range(len(features)), frequencies, alpha=0.7)
axes[1].set_yticks(range(len(features)))
axes[1].set_yticklabels(features, fontsize=10)
axes[1].set_xlabel('Selection Frequency', fontsize=12)
axes[1].set_title('Feature Selection Frequency', fontweight='bold')
axes[1].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# Standardize features for RFE
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Run RFE analysis
rfe_analyzer = RFEAnalyzer()
rfe_results = rfe_analyzer.rfe_analysis(X_scaled, y, list(feature_names))
rfe_analyzer.plot_rfe_analysis(rfe_results)
3. Model-Based Selection
Use feature importance from tree-based models or coefficients from linear models.
class ModelBasedSelector:
"""Model-based feature selection"""
def __init__(self, random_state: int = 42):
self.random_state = random_state
def model_based_selection(self, X: np.ndarray, y: np.ndarray,
feature_names: List[str]) -> Dict:
"""Feature selection using model-based importance"""
models = {
'RandomForest': RandomForestClassifier(n_estimators=100, random_state=self.random_state),
'LogisticRegression': LogisticRegression(random_state=self.random_state, max_iter=1000)
}
results = {}
for name, model in models.items():
print(f"Analyzing {name} feature importance...")
# Fit model
model.fit(X, y)
# Get feature importance
if hasattr(model, 'feature_importances_'):
importance = model.feature_importances_
elif hasattr(model, 'coef_'):
importance = np.abs(model.coef_[0])
else:
continue
# Create feature ranking
feature_ranking = sorted(zip(feature_names, importance),
key=lambda x: x[1], reverse=True)
# Test different thresholds
thresholds = np.percentile(importance, [50, 70, 80, 90, 95])
threshold_results = []
for threshold in thresholds:
selector = SelectFromModel(model, threshold=threshold)
X_selected = selector.fit_transform(X, y)
# Cross-validation score
score = cross_val_score(model, X_selected, y, cv=5, scoring='accuracy').mean()
n_features = X_selected.shape[1]
threshold_results.append({
'threshold': threshold,
'n_features': n_features,
'score': score
})
results[name] = {
'importance': importance,
'ranking': feature_ranking,
'threshold_results': threshold_results,
'model': model
}
return results
def plot_model_based_selection(self, results: Dict):
"""Plot model-based selection results"""
fig, axes = plt.subplots(2, 2, figsize=(15, 12))
model_names = list(results.keys())
for idx, (name, data) in enumerate(results.items()):
# Plot 1: Feature importance
top_features = data['ranking'][:15]
features, importance = zip(*top_features)
y_pos = np.arange(len(features))
bars = axes[0, idx].barh(y_pos, importance, alpha=0.7)
axes[0, idx].set_yticks(y_pos)
axes[0, idx].set_yticklabels(features, fontsize=10)
axes[0, idx].set_xlabel('Feature Importance', fontsize=12)
axes[0, idx].set_title(f'{name} - Feature Importance', fontweight='bold')
axes[0, idx].grid(True, alpha=0.3)
# Plot 2: Threshold analysis
threshold_data = data['threshold_results']
thresholds = [item['threshold'] for item in threshold_data]
n_features = [item['n_features'] for item in threshold_data]
scores = [item['score'] for item in threshold_data]
ax_twin = axes[1, idx].twinx()
line1 = axes[1, idx].plot(thresholds, scores, 'b-o', linewidth=2,
markersize=8, label='Accuracy')
line2 = ax_twin.plot(thresholds, n_features, 'r-s', linewidth=2,
markersize=8, label='# Features')
axes[1, idx].set_xlabel('Importance Threshold', fontsize=12)
axes[1, idx].set_ylabel('Accuracy', fontsize=12, color='blue')
ax_twin.set_ylabel('Number of Features', fontsize=12, color='red')
axes[1, idx].set_title(f'{name} - Threshold Analysis', fontweight='bold')
# Combine legends
lines = line1 + line2
labels = [l.get_label() for l in lines]
axes[1, idx].legend(lines, labels, loc='best')
axes[1, idx].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# Run model-based selection
model_selector = ModelBasedSelector()
model_results = model_selector.model_based_selection(X_scaled, y, list(feature_names))
model_selector.plot_model_based_selection(model_results)
Feature Engineering Techniques
1. Polynomial and Interaction Features
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report
class FeatureEngineeringAnalyzer:
"""Feature engineering techniques analyzer"""
def __init__(self, random_state: int = 42):
self.random_state = random_state
def polynomial_features_analysis(self, X: np.ndarray, y: np.ndarray) -> Dict:
"""Analyze polynomial feature generation"""
degrees = [1, 2, 3]
results = {}
for degree in degrees:
print(f"Testing polynomial degree {degree}...")
# Create polynomial features
poly = PolynomialFeatures(degree=degree, include_bias=False)
# Pipeline with scaling
pipeline = Pipeline([
('poly', poly),
('scaler', StandardScaler()),
('classifier', LogisticRegression(random_state=self.random_state, max_iter=1000))
])
# Cross-validation
scores = cross_val_score(pipeline, X, y, cv=5, scoring='accuracy')
# Fit to get number of features
X_poly = poly.fit_transform(X)
results[degree] = {
'n_features': X_poly.shape[1],
'cv_score': scores.mean(),
'cv_std': scores.std(),
'feature_names': poly.get_feature_names_out([f'f{i}' for i in range(X.shape[1])])
}
print(f" Features: {X_poly.shape[1]}, Accuracy: {scores.mean():.4f} ± {scores.std():.4f}")
return results
def plot_polynomial_analysis(self, results: Dict):
"""Plot polynomial features analysis"""
degrees = list(results.keys())
n_features = [results[d]['n_features'] for d in degrees]
cv_scores = [results[d]['cv_score'] for d in degrees]
cv_stds = [results[d]['cv_std'] for d in degrees]
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))
# Plot 1: Number of features vs degree
ax1.plot(degrees, n_features, 'o-', linewidth=3, markersize=8, color='blue')
ax1.set_xlabel('Polynomial Degree', fontsize=12)
ax1.set_ylabel('Number of Features', fontsize=12)
ax1.set_title('Feature Count vs Polynomial Degree', fontweight='bold')
ax1.grid(True, alpha=0.3)
ax1.set_yscale('log')
# Plot 2: Performance vs degree
ax2.errorbar(degrees, cv_scores, yerr=cv_stds, marker='o',
linewidth=3, markersize=8, capsize=5, color='red')
ax2.set_xlabel('Polynomial Degree', fontsize=12)
ax2.set_ylabel('Cross-Validation Accuracy', fontsize=12)
ax2.set_title('Performance vs Polynomial Degree', fontweight='bold')
ax2.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# Use subset of features for polynomial analysis (to avoid explosion)
X_subset = X[:, :5] # Use first 5 features
# Analyze polynomial features
feat_eng_analyzer = FeatureEngineeringAnalyzer()
poly_results = feat_eng_analyzer.polynomial_features_analysis(X_subset, y)
feat_eng_analyzer.plot_polynomial_analysis(poly_results)
2. Dimensionality Reduction with PCA
from sklearn.decomposition import PCA
class DimensionalityReductionAnalyzer:
"""PCA and dimensionality reduction analyzer"""
def __init__(self, random_state: int = 42):
self.random_state = random_state
def pca_analysis(self, X: np.ndarray, y: np.ndarray,
max_components: int = None) -> Dict:
"""Comprehensive PCA analysis"""
if max_components is None:
max_components = min(20, X.shape[1])
# Standardize data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Fit PCA with all components
pca_full = PCA()
pca_full.fit(X_scaled)
# Explained variance analysis
explained_variance_ratio = pca_full.explained_variance_ratio_
cumulative_variance = np.cumsum(explained_variance_ratio)
# Find components for different variance thresholds
variance_thresholds = [0.8, 0.9, 0.95, 0.99]
components_for_threshold = {}
for threshold in variance_thresholds:
n_components = np.argmax(cumulative_variance >= threshold) + 1
components_for_threshold[threshold] = n_components
# Test different numbers of components
component_range = range(2, min(max_components + 1, len(explained_variance_ratio) + 1), 2)
pca_scores = []
for n_components in component_range:
# PCA pipeline
pca_pipeline = Pipeline([
('scaler', StandardScaler()),
('pca', PCA(n_components=n_components, random_state=self.random_state)),
('classifier', LogisticRegression(random_state=self.random_state, max_iter=1000))
])
# Cross-validation
scores = cross_val_score(pca_pipeline, X, y, cv=5, scoring='accuracy')
pca_scores.append(scores.mean())
results = {
'explained_variance_ratio': explained_variance_ratio,
'cumulative_variance': cumulative_variance,
'components_for_threshold': components_for_threshold,
'component_range': list(component_range),
'pca_scores': pca_scores,
'pca_full': pca_full
}
return results
def plot_pca_analysis(self, results: Dict):
"""Plot PCA analysis results"""
fig, axes = plt.subplots(2, 2, figsize=(15, 12))
# Plot 1: Explained variance ratio
n_components = len(results['explained_variance_ratio'])
axes[0, 0].bar(range(1, n_components + 1), results['explained_variance_ratio'][:20],
alpha=0.7)
axes[0, 0].set_xlabel('Principal Component', fontsize=12)
axes[0, 0].set_ylabel('Explained Variance Ratio', fontsize=12)
axes[0, 0].set_title('Individual Component Variance', fontweight='bold')
axes[0, 0].grid(True, alpha=0.3)
# Plot 2: Cumulative explained variance
axes[0, 1].plot(range(1, len(results['cumulative_variance']) + 1),
results['cumulative_variance'], 'o-', linewidth=2, markersize=6)
# Add threshold lines
thresholds = [0.8, 0.9, 0.95, 0.99]
colors = ['red', 'orange', 'green', 'blue']
for threshold, color in zip(thresholds, colors):
axes[0, 1].axhline(y=threshold, color=color, linestyle='--', alpha=0.7,
label=f'{threshold*100:.0f}% variance')
n_comp = results['components_for_threshold'][threshold]
axes[0, 1].axvline(x=n_comp, color=color, linestyle='--', alpha=0.7)
axes[0, 1].text(n_comp + 1, threshold + 0.01, f'{n_comp} comp.',
color=color, fontweight='bold')
axes[0, 1].set_xlabel('Number of Components', fontsize=12)
axes[0, 1].set_ylabel('Cumulative Explained Variance', fontsize=12)
axes[0, 1].set_title('Cumulative Variance Explained', fontweight='bold')
axes[0, 1].legend()
axes[0, 1].grid(True, alpha=0.3)
# Plot 3: Performance vs components
axes[1, 0].plot(results['component_range'], results['pca_scores'],
'o-', linewidth=2, markersize=8, color='purple')
axes[1, 0].set_xlabel('Number of PCA Components', fontsize=12)
axes[1, 0].set_ylabel('Cross-Validation Accuracy', fontsize=12)
axes[1, 0].set_title('Performance vs PCA Components', fontweight='bold')
axes[1, 0].grid(True, alpha=0.3)
# Plot 4: First two principal components
pca_2d = PCA(n_components=2, random_state=42)
X_pca = pca_2d.fit_transform(StandardScaler().fit_transform(X))
scatter = axes[1, 1].scatter(X_pca[:, 0], X_pca[:, 1], c=y,
cmap='viridis', alpha=0.6)
axes[1, 1].set_xlabel(f'PC1 ({pca_2d.explained_variance_ratio_[0]:.3f})', fontsize=12)
axes[1, 1].set_ylabel(f'PC2 ({pca_2d.explained_variance_ratio_[1]:.3f})', fontsize=12)
axes[1, 1].set_title('First Two Principal Components', fontweight='bold')
plt.colorbar(scatter, ax=axes[1, 1])
plt.tight_layout()
plt.show()
# Print summary
print("\nPCA Analysis Summary:")
for threshold in [0.9, 0.95]:
n_comp = results['components_for_threshold'][threshold]
print(f"Components for {threshold*100:.0f}% variance: {n_comp}")
# Run PCA analysis
pca_analyzer = DimensionalityReductionAnalyzer()
pca_results = pca_analyzer.pca_analysis(X, y, max_components=15)
pca_analyzer.plot_pca_analysis(pca_results)
Comprehensive Feature Selection Comparison
def comprehensive_feature_selection_comparison():
"""Compare all feature selection methods"""
methods = {
'Original': None,
'Top 10 Univariate': SelectKBest(f_classif, k=10),
'RFE (10 features)': RFE(RandomForestClassifier(n_estimators=50, random_state=42),
n_features_to_select=10),
'Model-based (RF)': SelectFromModel(RandomForestClassifier(n_estimators=50, random_state=42)),
'PCA (10 components)': PCA(n_components=10, random_state=42),
'Low Variance Filter': VarianceThreshold(threshold=0.01)
}
results = {}
for name, selector in methods.items():
print(f"Evaluating {name}...")
if name == 'Original':
X_selected = X_scaled
else:
if 'PCA' in name:
X_selected = selector.fit_transform(X_scaled)
else:
X_selected = selector.fit_transform(X_scaled, y)
# Evaluate with cross-validation
model = LogisticRegression(random_state=42, max_iter=1000)
scores = cross_val_score(model, X_selected, y, cv=5, scoring='accuracy')
results[name] = {
'n_features': X_selected.shape[1],
'cv_score': scores.mean(),
'cv_std': scores.std(),
'scores': scores
}
print(f" Features: {X_selected.shape[1]}, Accuracy: {scores.mean():.4f} ± {scores.std():.4f}")
# Plot comparison
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))
methods_list = list(results.keys())
n_features = [results[m]['n_features'] for m in methods_list]
cv_scores = [results[m]['cv_score'] for m in methods_list]
cv_stds = [results[m]['cv_std'] for m in methods_list]
# Plot 1: Feature count
bars1 = ax1.bar(range(len(methods_list)), n_features, alpha=0.7)
ax1.set_xlabel('Feature Selection Method', fontsize=12)
ax1.set_ylabel('Number of Features', fontsize=12)
ax1.set_title('Feature Count Comparison', fontweight='bold')
ax1.set_xticks(range(len(methods_list)))
ax1.set_xticklabels(methods_list, rotation=45, ha='right')
ax1.grid(True, alpha=0.3)
# Add value labels
for bar, value in zip(bars1, n_features):
ax1.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.5,
f'{value}', ha='center', va='bottom', fontweight='bold')
# Plot 2: Performance comparison
bars2 = ax2.bar(range(len(methods_list)), cv_scores, yerr=cv_stds,
capsize=5, alpha=0.7, color='green')
ax2.set_xlabel('Feature Selection Method', fontsize=12)
ax2.set_ylabel('Cross-Validation Accuracy', fontsize=12)
ax2.set_title('Performance Comparison', fontweight='bold')
ax2.set_xticks(range(len(methods_list)))
ax2.set_xticklabels(methods_list, rotation=45, ha='right')
ax2.grid(True, alpha=0.3)
# Add value labels
for bar, score in zip(bars2, cv_scores):
ax2.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.005,
f'{score:.3f}', ha='center', va='bottom', fontweight='bold')
plt.tight_layout()
plt.show()
return results
print("\nComprehensive Feature Selection Comparison:")
comparison_results = comprehensive_feature_selection_comparison()
Best Practices and Guidelines
Feature Selection Strategy
Method | Best For | Pros | Cons |
---|---|---|---|
Univariate | Quick filtering | Fast, simple | Ignores feature interactions |
RFE | Model-specific selection | Considers interactions | Computationally expensive |
Model-based | Tree ensemble features | Built-in importance | Model-dependent |
PCA | High correlation | Reduces multicollinearity | Less interpretable |
Key Recommendations
- Start with variance filtering to remove constant features
- Use domain knowledge for feature engineering
- Try multiple methods and ensemble results
- Validate with cross-validation to avoid overfitting
- Consider computational cost vs. performance gain
- Keep interpretability in mind for business applications
Performance Summary
Feature selection typically provides:
- 10-30% performance improvement on high-dimensional data
- 2-10x faster training depending on dimensionality reduction
- Better model interpretability with fewer features
- Reduced overfitting especially with small datasets
Conclusion
Effective feature selection and engineering are crucial for building robust machine learning models. Key takeaways:
- Combine multiple methods for robust feature selection
- Use cross-validation to validate feature importance
- Balance performance vs. interpretability based on use case
- Domain knowledge often beats automated methods
- Start simple with univariate selection, then add complexity
Proper feature selection leads to simpler, faster, and more interpretable models while maintaining or improving performance.
References
-
Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection. Journal of machine learning research.
-
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning. Springer.
-
Chandrashekar, G., & Sahin, F. (2014). A survey on feature selection methods. Computers & Electrical Engineering.
Connect with me on LinkedIn or X to discuss feature engineering strategies!