Skip to main content

Machine Learning Basics Tutorial

Introduction

Machine Learning (ML) is a subset of artificial intelligence that enables computers to learn and make decisions from data without being explicitly programmed. This tutorial covers the fundamental concepts, algorithms, and practical implementation of machine learning techniques.

What is Machine Learning?

Machine learning is the process of training algorithms to recognize patterns in data and make predictions or decisions based on that data. Instead of programming explicit rules, we feed examples to algorithms that learn to generalize from the data.

Types of Machine Learning

  1. Supervised Learning: Learning with labeled examples
  2. Unsupervised Learning: Finding patterns in data without labels
  3. Reinforcement Learning: Learning through interaction and feedback
  4. Semi-supervised Learning: Combining labeled and unlabeled data

Machine Learning Workflow

1. Problem Definition

  • Define the business problem
  • Determine if it's a classification, regression, or clustering problem
  • Set success metrics

2. Data Collection and Exploration

  • Gather relevant datasets
  • Perform exploratory data analysis
  • Understand data quality and distributions

3. Data Preprocessing

  • Clean and prepare data
  • Handle missing values
  • Feature engineering and selection

4. Model Selection and Training

  • Choose appropriate algorithms
  • Split data into training and testing sets
  • Train models and tune hyperparameters

5. Model Evaluation

  • Assess model performance
  • Use appropriate metrics
  • Validate on unseen data

6. Model Deployment

  • Deploy model to production
  • Monitor performance
  • Retrain as needed

Python Machine Learning Stack

# Core libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Scikit-learn components
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder
from sklearn.impute import SimpleImputer

# Supervised Learning Algorithms
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.ensemble import RandomForestClassifier, GradientBoostingRegressor
from sklearn.svm import SVC, SVR
from sklearn.neighbors import KNeighborsClassifier

# Unsupervised Learning
from sklearn.cluster import KMeans, DBSCAN
from sklearn.decomposition import PCA

# Metrics
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import mean_squared_error, r2_score, classification_report

Supervised Learning

Classification Problems

# Binary Classification Example: Predicting customer churn
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Load and prepare data
def prepare_classification_data():
# Sample data creation (replace with real data loading)
np.random.seed(42)
n_samples = 1000

data = {
'age': np.random.normal(40, 12, n_samples),
'income': np.random.normal(50000, 15000, n_samples),
'tenure': np.random.randint(1, 60, n_samples),
'monthly_charges': np.random.normal(70, 20, n_samples),
'total_charges': np.random.normal(2000, 1000, n_samples),
'contract_type': np.random.choice(['Month-to-month', 'One year', 'Two year'], n_samples),
'churn': np.random.choice([0, 1], n_samples, p=[0.7, 0.3])
}

return pd.DataFrame(data)

# Load data
df = prepare_classification_data()

# Feature engineering
def preprocess_features(df):
# Handle categorical variables
le = LabelEncoder()
df['contract_type_encoded'] = le.fit_transform(df['contract_type'])

# Create derived features
df['charges_per_tenure'] = df['total_charges'] / (df['tenure'] + 1)
df['income_category'] = pd.cut(df['income'], bins=3, labels=['Low', 'Medium', 'High'])
df['income_category_encoded'] = le.fit_transform(df['income_category'])

return df

df = preprocess_features(df)

# Prepare features and target
feature_columns = ['age', 'income', 'tenure', 'monthly_charges',
'charges_per_tenure', 'contract_type_encoded', 'income_category_encoded']
X = df[feature_columns]
y = df['churn']

# Split the data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train multiple models
models = {
'Logistic Regression': LogisticRegression(random_state=42),
'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
'Decision Tree': DecisionTreeClassifier(random_state=42)
}

results = {}

for name, model in models.items():
# Train model
if name == 'Logistic Regression':
model.fit(X_train_scaled, y_train)
y_pred = model.predict(X_test_scaled)
else:
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

# Evaluate model
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

results[name] = {
'accuracy': accuracy,
'precision': precision,
'recall': recall,
'f1': f1
}

print(f"\n{name} Results:")
print(f"Accuracy: {accuracy:.3f}")
print(f"Precision: {precision:.3f}")
print(f"Recall: {recall:.3f}")
print(f"F1-score: {f1:.3f}")

print("\nClassification Report:")
print(classification_report(y_test, y_pred))

# Visualize results
results_df = pd.DataFrame(results).T
results_df.plot(kind='bar', figsize=(12, 6))
plt.title('Model Performance Comparison')
plt.ylabel('Score')
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()

Regression Problems

# Regression Example: Predicting house prices
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

def prepare_regression_data():
# Sample house data
np.random.seed(42)
n_samples = 1000

data = {
'square_feet': np.random.normal(2000, 500, n_samples),
'bedrooms': np.random.randint(1, 6, n_samples),
'bathrooms': np.random.randint(1, 4, n_samples),
'age': np.random.randint(0, 50, n_samples),
'garage': np.random.choice([0, 1, 2], n_samples),
'location_score': np.random.uniform(1, 10, n_samples),
}

# Create target variable with some realistic relationships
price = (
data['square_feet'] * 100 +
data['bedrooms'] * 5000 +
data['bathrooms'] * 8000 +
(50 - data['age']) * 1000 +
data['garage'] * 10000 +
data['location_score'] * 15000 +
np.random.normal(0, 20000, n_samples) # Add noise
)

data['price'] = np.clip(price, 50000, 800000) # Reasonable price range

return pd.DataFrame(data)

# Load and prepare regression data
house_df = prepare_regression_data()

# Feature engineering for regression
def engineer_regression_features(df):
df['price_per_sqft'] = df['price'] / df['square_feet']
df['total_rooms'] = df['bedrooms'] + df['bathrooms']
df['is_new'] = (df['age'] < 5).astype(int)
return df

house_df = engineer_regression_features(house_df)

# Prepare features and target
regression_features = ['square_feet', 'bedrooms', 'bathrooms', 'age',
'garage', 'location_score', 'total_rooms', 'is_new']
X_reg = house_df[regression_features]
y_reg = house_df['price']

# Split data
X_reg_train, X_reg_test, y_reg_train, y_reg_test = train_test_split(
X_reg, y_reg, test_size=0.2, random_state=42
)

# Scale features
reg_scaler = StandardScaler()
X_reg_train_scaled = reg_scaler.fit_transform(X_reg_train)
X_reg_test_scaled = reg_scaler.transform(X_reg_test)

# Train regression models
regression_models = {
'Linear Regression': LinearRegression(),
'Gradient Boosting': GradientBoostingRegressor(random_state=42),
'Random Forest': RandomForestRegressor(n_estimators=100, random_state=42)
}

regression_results = {}

for name, model in regression_models.items():
# Train model
if name == 'Linear Regression':
model.fit(X_reg_train_scaled, y_reg_train)
y_pred = model.predict(X_reg_test_scaled)
else:
model.fit(X_reg_train, y_reg_train)
y_pred = model.predict(X_reg_test)

# Evaluate model
mse = mean_squared_error(y_reg_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_reg_test, y_pred)

regression_results[name] = {
'mse': mse,
'rmse': rmse,
'r2': r2
}

print(f"\n{name} Results:")
print(f"MSE: {mse:.2f}")
print(f"RMSE: {rmse:.2f}")
print(f"R² Score: {r2:.3f}")

# Feature importance (for tree-based models)
if 'Random Forest' in regression_models:
rf_model = regression_models['Random Forest']
feature_importance = pd.DataFrame({
'feature': regression_features,
'importance': rf_model.feature_importances_
}).sort_values('importance', ascending=False)

plt.figure(figsize=(10, 6))
plt.barh(feature_importance['feature'], feature_importance['importance'])
plt.title('Feature Importance (Random Forest)')
plt.xlabel('Importance')
plt.tight_layout()
plt.show()

Unsupervised Learning

Clustering

# K-Means Clustering Example
from sklearn.cluster import KMeans, DBSCAN
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt

def perform_clustering_analysis():
# Create sample customer data
np.random.seed(42)
n_customers = 500

# Create customer segments
segment1 = np.random.multivariate_normal([25, 30000], [[10, 0], [0, 5000000]], 150)
segment2 = np.random.multivariate_normal([45, 60000], [[15, 0], [0, 8000000]], 200)
segment3 = np.random.multivariate_normal([35, 45000], [[8, 0], [0, 6000000]], 150)

customer_data = np.vstack([segment1, segment2, segment3])

df_customers = pd.DataFrame(customer_data, columns=['age', 'annual_income'])
df_customers['spending_score'] = (
np.random.normal(50, 20, len(df_customers)) +
(df_customers['annual_income'] / 1000) * 0.3
)

return df_customers

# Load customer data
customers_df = perform_clustering_analysis()

# Prepare features for clustering
clustering_features = ['age', 'annual_income', 'spending_score']
X_cluster = customers_df[clustering_features]

# Scale features
cluster_scaler = StandardScaler()
X_cluster_scaled = cluster_scaler.fit_transform(X_cluster)

# Determine optimal number of clusters using elbow method
def find_optimal_clusters(X, max_k=10):
inertias = []
k_range = range(1, max_k + 1)

for k in k_range:
kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
kmeans.fit(X)
inertias.append(kmeans.inertia_)

plt.figure(figsize=(10, 5))

plt.subplot(1, 2, 1)
plt.plot(k_range, inertias, marker='o')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Inertia')
plt.title('Elbow Method for Optimal k')

# Silhouette analysis
silhouette_scores = []
for k in range(2, max_k + 1):
kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
cluster_labels = kmeans.fit_predict(X)
silhouette_scores.append(silhouette_score(X, cluster_labels))

plt.subplot(1, 2, 2)
plt.plot(range(2, max_k + 1), silhouette_scores, marker='o')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Silhouette Score')
plt.title('Silhouette Analysis')

plt.tight_layout()
plt.show()

optimal_k = range(2, max_k + 1)[np.argmax(silhouette_scores)]
return optimal_k

# Find optimal number of clusters
optimal_k = find_optimal_clusters(X_cluster_scaled)
print(f"Optimal number of clusters: {optimal_k}")

# Apply K-Means clustering
kmeans = KMeans(n_clusters=optimal_k, random_state=42, n_init=10)
customers_df['cluster'] = kmeans.fit_predict(X_cluster_scaled)

# Visualize clusters
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# Age vs Income
axes[0].scatter(customers_df['age'], customers_df['annual_income'],
c=customers_df['cluster'], cmap='viridis', alpha=0.7)
axes[0].set_xlabel('Age')
axes[0].set_ylabel('Annual Income')
axes[0].set_title('Customer Segments: Age vs Income')

# Income vs Spending Score
axes[1].scatter(customers_df['annual_income'], customers_df['spending_score'],
c=customers_df['cluster'], cmap='viridis', alpha=0.7)
axes[1].set_xlabel('Annual Income')
axes[1].set_ylabel('Spending Score')
axes[1].set_title('Customer Segments: Income vs Spending')

# Age vs Spending Score
axes[2].scatter(customers_df['age'], customers_df['spending_score'],
c=customers_df['cluster'], cmap='viridis', alpha=0.7)
axes[2].set_xlabel('Age')
axes[2].set_ylabel('Spending Score')
axes[2].set_title('Customer Segments: Age vs Spending')

plt.tight_layout()
plt.show()

# Analyze clusters
cluster_analysis = customers_df.groupby('cluster').agg({
'age': ['mean', 'std'],
'annual_income': ['mean', 'std'],
'spending_score': ['mean', 'std']
}).round(2)

print("Cluster Analysis:")
print(cluster_analysis)

Model Evaluation and Validation

Cross-Validation

from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.model_selection import learning_curve, validation_curve

def perform_cross_validation(X, y, models, cv_folds=5):
"""Perform cross-validation for multiple models"""

cv_results = {}

# Stratified K-Fold for classification
skf = StratifiedKFold(n_splits=cv_folds, shuffle=True, random_state=42)

for name, model in models.items():
# Perform cross-validation
cv_scores = cross_val_score(model, X, y, cv=skf, scoring='accuracy')

cv_results[name] = {
'mean_score': cv_scores.mean(),
'std_score': cv_scores.std(),
'scores': cv_scores
}

print(f"\n{name} Cross-Validation Results:")
print(f"Mean Accuracy: {cv_scores.mean():.3f} (+/- {cv_scores.std() * 2:.3f})")
print(f"Individual Scores: {cv_scores}")

return cv_results

# Example usage
cv_models = {
'Logistic Regression': LogisticRegression(random_state=42),
'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
'SVM': SVC(random_state=42)
}

# Perform cross-validation
cv_results = perform_cross_validation(X_train_scaled, y_train, cv_models)

# Visualize cross-validation results
scores_df = pd.DataFrame({
name: result['scores'] for name, result in cv_results.items()
})

plt.figure(figsize=(12, 6))
scores_df.boxplot()
plt.title('Cross-Validation Scores Comparison')
plt.ylabel('Accuracy Score')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

Hyperparameter Tuning

from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

def hyperparameter_tuning_example():
"""Example of hyperparameter tuning with Grid Search"""

# Define parameter grids
param_grids = {
'Random Forest': {
'n_estimators': [50, 100, 200],
'max_depth': [None, 10, 20, 30],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4]
},
'SVM': {
'C': [0.1, 1, 10, 100],
'kernel': ['linear', 'rbf', 'poly'],
'gamma': ['scale', 'auto', 0.001, 0.01]
}
}

models_for_tuning = {
'Random Forest': RandomForestClassifier(random_state=42),
'SVM': SVC(random_state=42)
}

best_models = {}

for name, model in models_for_tuning.items():
print(f"\nTuning {name}...")

# Grid search with cross-validation
grid_search = GridSearchCV(
model,
param_grids[name],
cv=5,
scoring='accuracy',
n_jobs=-1,
verbose=1
)

# Fit on training data
if name == 'SVM':
grid_search.fit(X_train_scaled, y_train)
else:
grid_search.fit(X_train, y_train)

best_models[name] = grid_search.best_estimator_

print(f"Best parameters for {name}:")
print(grid_search.best_params_)
print(f"Best cross-validation score: {grid_search.best_score_:.3f}")

return best_models

# Perform hyperparameter tuning
best_models = hyperparameter_tuning_example()

Feature Engineering and Selection

from sklearn.feature_selection import SelectKBest, f_classif, RFE
from sklearn.preprocessing import PolynomialFeatures

def advanced_feature_engineering(df):
"""Advanced feature engineering techniques"""

# Create polynomial features
poly_features = ['age', 'income']
poly = PolynomialFeatures(degree=2, include_bias=False)
poly_array = poly.fit_transform(df[poly_features])
poly_feature_names = poly.get_feature_names_out(poly_features)

# Add polynomial features to dataframe
poly_df = pd.DataFrame(poly_array, columns=poly_feature_names, index=df.index)
df = pd.concat([df, poly_df], axis=1)

# Create interaction features
df['age_income_interaction'] = df['age'] * df['income']
df['tenure_charges_ratio'] = df['tenure'] / (df['monthly_charges'] + 1)

# Create binned features
df['age_group'] = pd.cut(df['age'], bins=4, labels=['Young', 'Adult', 'Middle', 'Senior'])
df['income_quartile'] = pd.qcut(df['income'], q=4, labels=['Q1', 'Q2', 'Q3', 'Q4'])

# Encode categorical features
categorical_features = ['age_group', 'income_quartile']
for feature in categorical_features:
dummies = pd.get_dummies(df[feature], prefix=feature)
df = pd.concat([df, dummies], axis=1)

return df

def feature_selection_techniques(X, y):
"""Demonstrate various feature selection techniques"""

# Statistical feature selection
selector_statistical = SelectKBest(score_func=f_classif, k=5)
X_selected_statistical = selector_statistical.fit_transform(X, y)
selected_features_statistical = X.columns[selector_statistical.get_support()]

print("Top 5 features by statistical test:")
print(selected_features_statistical.tolist())

# Recursive Feature Elimination
estimator = RandomForestClassifier(n_estimators=100, random_state=42)
selector_rfe = RFE(estimator, n_features_to_select=5)
X_selected_rfe = selector_rfe.fit_transform(X, y)
selected_features_rfe = X.columns[selector_rfe.get_support()]

print("\nTop 5 features by RFE:")
print(selected_features_rfe.tolist())

# Feature importance from tree-based model
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X, y)

feature_importance = pd.DataFrame({
'feature': X.columns,
'importance': rf.feature_importances_
}).sort_values('importance', ascending=False)

print("\nTop 10 features by Random Forest importance:")
print(feature_importance.head(10))

return selected_features_statistical, selected_features_rfe, feature_importance

# Apply feature engineering and selection
# engineered_df = advanced_feature_engineering(df)
# selected_stat, selected_rfe, importance = feature_selection_techniques(X, y)

Model Interpretation and Explainability

# SHAP (SHapley Additive exPlanations) for model interpretability
try:
import shap

def explain_model_predictions(model, X_train, X_test, feature_names):
"""Use SHAP to explain model predictions"""

# Create explainer
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test[:100]) # Explain first 100 predictions

# Summary plot
shap.summary_plot(shap_values, X_test[:100], feature_names=feature_names)

# Feature importance plot
shap.summary_plot(shap_values, X_test[:100], feature_names=feature_names, plot_type="bar")

# Waterfall plot for single prediction
shap.waterfall_plot(explainer.expected_value, shap_values[0], X_test.iloc[0], feature_names=feature_names)

return shap_values

except ImportError:
print("SHAP not installed. Install with: pip install shap")

def explain_model_basic(model, feature_names):
"""Basic feature importance explanation"""
if hasattr(model, 'feature_importances_'):
importance_df = pd.DataFrame({
'feature': feature_names,
'importance': model.feature_importances_
}).sort_values('importance', ascending=False)

plt.figure(figsize=(10, 6))
plt.barh(importance_df['feature'][:10], importance_df['importance'][:10])
plt.title('Top 10 Feature Importances')
plt.xlabel('Importance')
plt.tight_layout()
plt.show()

return importance_df
else:
print("Model doesn't have feature_importances_ attribute")
return None

Common Machine Learning Algorithms

Decision Trees

from sklearn.tree import DecisionTreeClassifier, plot_tree

def decision_tree_example():
"""Decision tree with visualization"""

# Create and train decision tree
dt = DecisionTreeClassifier(
max_depth=4,
min_samples_split=20,
min_samples_leaf=10,
random_state=42
)

dt.fit(X_train, y_train)

# Make predictions
y_pred_dt = dt.predict(X_test)

# Evaluate
accuracy = accuracy_score(y_test, y_pred_dt)
print(f"Decision Tree Accuracy: {accuracy:.3f}")

# Visualize tree
plt.figure(figsize=(20, 10))
plot_tree(dt, feature_names=X.columns, class_names=['No Churn', 'Churn'], filled=True)
plt.title('Decision Tree Visualization')
plt.show()

return dt

# dt_model = decision_tree_example()

Support Vector Machines

def svm_example():
"""Support Vector Machine example with different kernels"""

# Different SVM kernels
svm_models = {
'Linear SVM': SVC(kernel='linear', random_state=42),
'RBF SVM': SVC(kernel='rbf', random_state=42),
'Polynomial SVM': SVC(kernel='poly', degree=3, random_state=42)
}

svm_results = {}

for name, model in svm_models.items():
# Train model (SVM works better with scaled data)
model.fit(X_train_scaled, y_train)
y_pred = model.predict(X_test_scaled)

# Evaluate
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)

svm_results[name] = {
'accuracy': accuracy,
'precision': precision,
'recall': recall
}

print(f"\n{name} Results:")
print(f"Accuracy: {accuracy:.3f}")
print(f"Precision: {precision:.3f}")
print(f"Recall: {recall:.3f}")

return svm_results

# svm_results = svm_example()

Best Practices and Common Pitfalls

Data Preparation Best Practices

  1. Always split data before any preprocessing
  2. Handle missing values appropriately
  3. Scale features when necessary
  4. Avoid data leakage
  5. Use proper validation techniques

Model Selection Guidelines

def model_selection_guide():
"""Guidelines for choosing appropriate algorithms"""

guidelines = {
'Small dataset (< 1000 samples)': ['Naive Bayes', 'Linear models', 'k-NN'],
'Large dataset (> 100k samples)': ['Neural Networks', 'Random Forest', 'Gradient Boosting'],
'High-dimensional data': ['Linear models', 'Neural Networks'],
'Need interpretability': ['Decision Trees', 'Linear models'],
'Need high accuracy': ['Ensemble methods', 'Neural Networks'],
'Binary classification': ['Logistic Regression', 'SVM', 'Random Forest'],
'Multi-class classification': ['Random Forest', 'Gradient Boosting', 'Neural Networks'],
'Regression': ['Linear Regression', 'Random Forest Regressor', 'Gradient Boosting'],
'Clustering': ['K-Means', 'DBSCAN', 'Hierarchical'],
'Dimensionality reduction': ['PCA', 't-SNE', 'UMAP']
}

for scenario, algorithms in guidelines.items():
print(f"{scenario}: {', '.join(algorithms)}")

# model_selection_guide()

Avoiding Common Mistakes

def common_mistakes_and_solutions():
"""Common ML mistakes and how to avoid them"""

mistakes = {
'Data Leakage': {
'problem': 'Including future information in training',
'solution': 'Ensure temporal consistency, proper train/test split'
},
'Overfitting': {
'problem': 'Model performs well on training but poor on test data',
'solution': 'Use regularization, cross-validation, more data'
},
'Underfitting': {
'problem': 'Model is too simple to capture patterns',
'solution': 'Increase model complexity, add features'
},
'Improper validation': {
'problem': 'Using test data for model selection',
'solution': 'Use separate validation set or cross-validation'
},
'Ignoring class imbalance': {
'problem': 'Poor performance on minority class',
'solution': 'Use stratified sampling, SMOTE, class weights'
},
'Not scaling features': {
'problem': 'Features with different scales affect distance-based algorithms',
'solution': 'Use StandardScaler or MinMaxScaler'
}
}

for mistake, details in mistakes.items():
print(f"\n{mistake}:")
print(f"Problem: {details['problem']}")
print(f"Solution: {details['solution']}")

# common_mistakes_and_solutions()

Model Deployment Considerations

Model Serialization

import joblib
import pickle

def save_and_load_models():
"""Save and load trained models"""

# Train a model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Save model using joblib (recommended for sklearn)
joblib.dump(model, 'trained_model.joblib')

# Save using pickle
with open('trained_model.pkl', 'wb') as f:
pickle.dump(model, f)

# Load model
loaded_model = joblib.load('trained_model.joblib')

# Verify model works
predictions = loaded_model.predict(X_test)
print(f"Loaded model accuracy: {accuracy_score(y_test, predictions):.3f}")

return loaded_model

# saved_model = save_and_load_models()

Conclusion

Machine learning is a powerful tool for extracting insights and making predictions from data. Success in ML requires:

  1. Understanding your data and problem domain
  2. Proper data preprocessing and feature engineering
  3. Appropriate algorithm selection based on problem type
  4. Rigorous validation to ensure generalization
  5. Continuous monitoring and retraining in production

Remember that machine learning is an iterative process - start simple, validate thoroughly, and gradually increase complexity as needed.

Content Review

The content in this repository has been reviewed by chevp. Chevp is dedicated to ensuring that the information provided is accurate, relevant, and up-to-date, helping users to learn and implement programming skills effectively.

About the Reviewer

For more insights and contributions, visit chevp's GitHub profile: chevp's GitHub Profile.