Section 2 – Statistical Machine Learning

4) Linear & Logistic Regression

Linear Regression

“A regression model in which the conditional mean of the response variable is an affine function of the explanatory variables.”
(The Elements of Statistical Learning – Hastie, Tibshirani, Friedman)

Logistic Regression

“A generalized linear model for a binary response variable in which the log odds of the probability of the outcome are modeled as a linear function of the explanatory variables.”
(Applied Logistic Regression – Hosmer, Lemeshow)

  • Linear Regression: Fits a straight line/plane to predict a number.
  • Logistic Regression: Fits an S-shaped curve to predict a probability between 0 and 1.

Practical Examples

  • Linear: Predict house price from area, rooms, location score.
  • Logistic: Classify emails as spam or not spam.

Workflow Checklist

  1. Define target (numeric for linear; binary for logistic).
  2. Collect features; preprocess data.
  3. Train/test split.
  4. Scale features if necessary.
  5. Fit model with regularization.
  6. Evaluate (MAE, RMSE for linear; ROC-AUC, F1 for logistic).
  7. Inspect coefficients for interpretability.

Common Pitfalls

  • Leakage of future data.
  • Multicollinearity inflating coefficients.
  • Outliers skewing predictions.

5) Decision Trees (Classification & Regression)

“A decision tree is a flowchart-like structure in which each internal node represents a ‘test’ on an attribute, each branch represents the outcome of the test, and each leaf node represents a class label (classification) or a numerical value (regression).”
(Classification and Regression Trees – Breiman, Friedman, Olshen, Stone, 1984)

A tree of “if-else” questions that splits data into purer groups until a final prediction is made.

Practical Examples

  • Loan approval decision-making.
  • Predicting machine failure based on sensor readings.

Workflow Checklist

  1. Preprocess data (handle missing values, encode categories).
  2. Limit tree depth to avoid overfitting.
  3. Train model; visualize tree.
  4. Evaluate (Confusion Matrix/F1 for classification; MAE/RMSE for regression).
  5. Extract human-readable rules.

Common Pitfalls

  • Overfitting with deep trees.
  • Instability with small data changes.
  • Biased splits for high-cardinality features.

6) K-Means Clustering (Unsupervised)

“The k-means algorithm partitions n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.”
(MacQueen, J., 1967 – Some Methods for Classification and Analysis of Multivariate Observations)

Groups similar data points into clusters, each represented by its “average” (centroid).

Practical Examples

  • Customer segmentation in marketing.
  • Grouping products by purchasing patterns.

Workflow Checklist

  1. Standardize features.
  2. Select k using Elbow or Silhouette method.
  3. Train model; name clusters for business use.
  4. Validate clusters with real-world metrics.

Common Pitfalls

  • Poor performance on irregular-shaped clusters.
  • Sensitivity to outliers.
  • Random initialization leading to different results.

Quick Reference – When to Use What

SituationBest MethodWhy
Predict a number with linear trendsLinear RegressionSimple, interpretable
Predict yes/no with probabilitiesLogistic RegressionCalibrated outputs
Need human-readable rulesDecision TreeTransparent logic
No labels, need groupsK-MeansQuick baseline clustering

Trivia Box

  • Logistic regression is actually a classifier, not a regression model.
  • Decision trees can naturally handle non-linear relationships without feature engineering.
  • The “K” in K-Means is chosen by you — not learned by the algorithm.

Mini Exercises for Learners

  1. Linear Regression: Predict house prices from area, rooms, and location score; compare RMSE with and without regularization.
  2. Decision Tree: Train a tree for loan approval; limit depth to 3 and explain 3 rules.
  3. K-Means: Cluster customers by spend, visits, and recency; name the clusters for a marketing plan.