Python for Data Science and Machine Learning: A Comprehensive Guide
Data Science and Machine Learning have become essential skills in modern software development. Python's rich ecosystem of libraries and frameworks makes it the perfect language for data analysis, visualization, and building AI models.
In this comprehensive guide, we'll explore how to use Python's most popular data science libraries and implement common machine learning algorithms.
Key Topics
- Data Analysis: NumPy and Pandas
- Data Visualization: Matplotlib and Seaborn
- Machine Learning: Scikit-learn
- Deep Learning: TensorFlow and Keras
- Model Deployment: Flask and FastAPI
1. Data Analysis with NumPy and Pandas
Master the fundamental libraries for data manipulation.
NumPy Basics
import numpy as np
# Creating arrays
arr1 = np.array([1, 2, 3, 4, 5])
arr2 = np.zeros((3, 3))
arr3 = np.ones((2, 4))
arr4 = np.random.rand(3, 3)
# Array operations
print(arr1 * 2) # Element-wise multiplication
print(arr1.mean()) # Mean
print(arr1.std()) # Standard deviation
print(arr1.reshape(5,1)) # Reshape array
# Matrix operations
matrix1 = np.array([[1, 2], [3, 4]])
matrix2 = np.array([[5, 6], [7, 8]])
print(matrix1.dot(matrix2)) # Matrix multiplication
print(np.linalg.inv(matrix1)) # Matrix inverse
print(np.linalg.det(matrix1)) # Determinant
Pandas Data Analysis
import pandas as pd
# Reading data
df = pd.read_csv('data.csv')
# Basic operations
print(df.head()) # First 5 rows
print(df.describe()) # Statistical summary
print(df.info()) # DataFrame info
# Data cleaning
df = df.dropna() # Remove missing values
df = df.fillna(df.mean()) # Fill missing values with mean
df = pd.get_dummies(df, columns=['category']) # One-hot encoding
# Data manipulation
# Group by and aggregate
grouped = df.groupby('category').agg({
'price': ['mean', 'min', 'max'],
'quantity': 'sum'
})
# Merging dataframes
df_merged = pd.merge(df1, df2, on='id', how='left')
# Time series analysis
df['date'] = pd.to_datetime(df['date'])
df.set_index('date', inplace=True)
monthly = df.resample('M').mean()
# Complex operations
def custom_function(x):
return x.mean() if x.dtype == 'float64' else x.mode()[0]
result = df.groupby('category').agg(custom_function)
2. Data Visualization
Create insightful visualizations of your data.
Matplotlib Plotting
import matplotlib.pyplot as plt
import seaborn as sns
# Basic plotting
plt.figure(figsize=(10, 6))
plt.plot(x, y, 'b-', label='Data')
plt.title('Simple Line Plot')
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.legend()
plt.grid(True)
plt.show()
# Multiple subplots
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))
ax1.plot(x1, y1, 'r-')
ax1.set_title('Plot 1')
ax2.scatter(x2, y2)
ax2.set_title('Plot 2')
plt.tight_layout()
plt.show()
Seaborn Visualization
# Set style
sns.set_style("whitegrid")
# Distribution plot
plt.figure(figsize=(10, 6))
sns.histplot(data=df, x='value', hue='category', multiple="stack")
plt.title('Distribution by Category')
plt.show()
# Complex visualizations
# Pair plot
sns.pairplot(df, hue='category', diag_kind='kde')
plt.show()
# Heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap')
plt.show()
# Box plot
plt.figure(figsize=(12, 6))
sns.boxplot(x='category', y='value', data=df)
plt.title('Value Distribution by Category')
plt.xticks(rotation=45)
plt.show()
3. Machine Learning with Scikit-learn
Implement common machine learning algorithms.
Classification Example
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
import pandas as pd
# Prepare data
X = df.drop('target', axis=1)
y = df['target']
# Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Train model
model = RandomForestClassifier(
n_estimators=100,
max_depth=10,
random_state=42
)
model.fit(X_train_scaled, y_train)
# Make predictions
y_pred = model.predict(X_test_scaled)
# Evaluate model
print("Classification Report:")
print(classification_report(y_test, y_pred))
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))
# Feature importance
feature_importance = pd.DataFrame({
'feature': X.columns,
'importance': model.feature_importances_
}).sort_values('importance', ascending=False)
print("\nFeature Importance:")
print(feature_importance)
Regression Example
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import mean_squared_error, r2_score
# Prepare data
X = df[['feature1', 'feature2']]
y = df['target']
# Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Create polynomial features
poly = PolynomialFeatures(degree=2)
X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.transform(X_test)
# Train model
model = LinearRegression()
model.fit(X_train_poly, y_train)
# Make predictions
y_pred = model.predict(X_test_poly)
# Evaluate model
print("R² Score:", r2_score(y_test, y_pred))
print("MSE:", mean_squared_error(y_test, y_pred))
print("RMSE:", np.sqrt(mean_squared_error(y_test, y_pred)))
4. Deep Learning with TensorFlow
Build and train neural networks.
Neural Network Implementation
import tensorflow as tf
from tensorflow.keras import layers, models
from tensorflow.keras.callbacks import EarlyStopping
# Build model
model = models.Sequential([
layers.Dense(128, activation='relu', input_shape=(num_features,)),
layers.Dropout(0.3),
layers.Dense(64, activation='relu'),
layers.Dropout(0.2),
layers.Dense(32, activation='relu'),
layers.Dense(1, activation='sigmoid')
])
# Compile model
model.compile(
optimizer='adam',
loss='binary_crossentropy',
metrics=['accuracy']
)
# Define callbacks
early_stopping = EarlyStopping(
monitor='val_loss',
patience=5,
restore_best_weights=True
)
# Train model
history = model.fit(
X_train, y_train,
epochs=100,
batch_size=32,
validation_split=0.2,
callbacks=[early_stopping]
)
# Evaluate model
test_loss, test_accuracy = model.evaluate(X_test, y_test)
print(f"Test accuracy: {test_accuracy:.4f}")
# Plot training history
plt.figure(figsize=(12, 4))
plt.subplot(1, 2, 1)
plt.plot(history.history['loss'], label='Training Loss')
plt.plot(history.history['val_loss'], label='Validation Loss')
plt.title('Model Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()
plt.subplot(1, 2, 2)
plt.plot(history.history['accuracy'], label='Training Accuracy')
plt.plot(history.history['val_accuracy'], label='Validation Accuracy')
plt.title('Model Accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend()
plt.tight_layout()
plt.show()
Convolutional Neural Network (CNN)
# Build CNN model
model = models.Sequential([
layers.Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1)),
layers.MaxPooling2D((2, 2)),
layers.Conv2D(64, (3, 3), activation='relu'),
layers.MaxPooling2D((2, 2)),
layers.Conv2D(64, (3, 3), activation='relu'),
layers.Flatten(),
layers.Dense(64, activation='relu'),
layers.Dense(10, activation='softmax')
])
# Compile model
model.compile(
optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy']
)
# Train model
history = model.fit(
X_train, y_train,
epochs=10,
batch_size=64,
validation_split=0.2
)
5. Model Deployment
Deploy your models using web frameworks.
Flask API
from flask import Flask, request, jsonify
import joblib
app = Flask(__name__)
model = joblib.load('model.pkl')
scaler = joblib.load('scaler.pkl')
@app.route('/predict', methods=['POST'])
def predict():
try:
data = request.get_json()
features = pd.DataFrame([data])
# Preprocess
features_scaled = scaler.transform(features)
# Make prediction
prediction = model.predict(features_scaled)
return jsonify({
'prediction': prediction.tolist(),
'status': 'success'
})
except Exception as e:
return jsonify({
'error': str(e),
'status': 'error'
}), 400
if __name__ == '__main__':
app.run(debug=True)
FastAPI Implementation
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import numpy as np
app = FastAPI()
model = joblib.load('model.pkl')
scaler = joblib.load('scaler.pkl')
class PredictionRequest(BaseModel):
features: list[float]
class PredictionResponse(BaseModel):
prediction: float
probability: float
@app.post("/predict", response_model=PredictionResponse)
async def predict(request: PredictionRequest):
try:
features = np.array(request.features).reshape(1, -1)
features_scaled = scaler.transform(features)
prediction = model.predict(features_scaled)[0]
probability = model.predict_proba(features_scaled)[0].max()
return PredictionResponse(
prediction=float(prediction),
probability=float(probability)
)
except Exception as e:
raise HTTPException(status_code=400, detail=str(e))
Best Practices
-
Data Preprocessing
- Handle missing values appropriately
- Scale features when needed
- Split data properly
- Validate assumptions
-
Model Development
- Start with simple models
- Use cross-validation
- Monitor for overfitting
- Document your process
-
Model Evaluation
- Use appropriate metrics
- Consider business impact
- Validate on test set
- Monitor performance
-
Deployment
- Version your models
- Monitor in production
- Handle errors gracefully
- Scale appropriately
Conclusion
Python's data science and machine learning ecosystem provides powerful tools for analyzing data and building AI models. By mastering these libraries and following best practices, you can:
- Analyze complex datasets effectively
- Build accurate predictive models
- Deploy models to production
- Make data-driven decisions
Remember to start with the basics and gradually move to more complex techniques. Focus on understanding your data and choosing the right tools for your specific use case.