Section 3 – Deep Learning Basics

Deep learning uses neural networks with many layers to learn rich, non-linear patterns from data (images, audio, text, time series). Think of it as automatic feature learning—the model discovers useful representations for you.

7) What is a Neural Network?

Official-style definition
A (feedforward) neural network is a parameterized function built by composing affine transforms (weights + biases) and non-linear activations, trained to minimize a loss over data by gradient-based optimization (backpropagation). (cf. Goodfellow, Bengio & Courville, “Deep Learning”)

Stacked layers turn inputs into predictions. Each layer learns weights so that, together, they map your data to the right answers.

Core building blocks

  • Neurons & Layers: fully connected (dense) layers
  • Activations: ReLU (fast, sparse), GELU (modern choice), sigmoid/tanh (older, saturate)
  • Loss: MSE (regression), cross-entropy (classification)
  • Optimizer: SGD(+momentum), Adam (good default)
  • Regularization: L2 weight decay, dropout, data augmentation, early stopping
  • Normalization: BatchNorm/LayerNorm → stabilizes/accelerates training

8) When to use which architecture (the 80/20 guide)

Data shapeBest first pickWhy
Tabular (columns/rows)MLP (Dense)Simple, fast, strong baseline
Images / video framesCNNSpatial filters capture edges → shapes
Text / sequences / time seriesRNN (LSTM/GRU) or 1D-CNNOrder matters; temporal patterns
Long text / multimodal(Preview) TransformerGlobal context via attention (next section)

Pro tip: Start with the simplest model that matches your data shape. If it underfits complex patterns (images, long text), move up the ladder.

9) Convolutional Neural Networks (CNNs)

Official-style definition
A CNN applies learned convolutional kernels across grid-structured inputs (e.g., images), exploiting local connectivity, weight sharing, and translation equivariance to learn hierarchical features (edges → textures → parts → objects). (cf. LeCun et al.)

CNNs slide small filters over an image to detect edges and shapes, building up to full objects as you go deeper.

Key ideas

  • Convolution: small filter scans the image; same weights reused = efficient
  • Feature hierarchy: early layers = edges; deeper = object parts
  • Pooling/stride: reduce spatial size; keep the essentials
  • Modern blocks: Conv → Norm → Activation → Residual skip (ResNet-style)

Where CNNs shine

  • Image classification (quality control, defect detection)
  • Object detection (safety PPE detection, vehicle counting)
  • Segmentation (medical, satellite, plant layout maps)

Pitfalls & fixes

  • Overfitting: heavy augmentation (flip/crop/color jitter), dropout, weight decay
  • Data leakage: same scene in train & test via near-duplicate frames—deduplicate!
  • Class imbalance: focal loss, class weights, balanced sampling

10) Recurrent Neural Networks (RNNs)

Official-style definition
An RNN maintains a state that is updated recurrently: ht=f(xt,ht−1)h_t = f(x_t, h_{t-1})ht​=f(xt​,ht−1​), enabling modeling of sequential dependencies. LSTM and GRU introduce gates that control information flow to mitigate vanishing/exploding gradients. (cf. Elman; Hochreiter & Schmidhuber; Cho et al.)

RNNs read inputs one step at a time (word by word, tick by tick), remembering what matters through a hidden state; LSTM/GRU are smarter RNNs that remember/forget more reliably.

Variants

  • Vanilla RNN: simple but struggles with long dependencies
  • LSTM: input/forget/output gates manage memory
  • GRU: streamlined LSTM—often similar performance, fewer parameters
  • Bi-RNN: reads forward & backward (great for text classification)

Where RNNs shine

  • Text classification, sentiment, NER (short/medium sequences)
  • Time-series forecasting (demand, sensor health)
  • Speech/IoT streams where order matters

Pitfalls & fixes

  • Slow training (long sequences): truncate backprop, batch sequences by length, consider 1D-CNN or Transformers for long context
  • Overfitting on small text: subword tokenization, dropout on embeddings, label smoothing
  • Vanishing gradients: prefer LSTM/GRU; clip gradients; use residual connections


11) Training Deep Models: a practical checklist

Data & splits

  • Train/validate/test; for time series use temporal split (no leakage)
  • Normalize inputs (images: mean/std; tabular: StandardScaler)
  1. Learning rate (use a schedule or one-cycle)
  2. Batch size (fit GPU, watch generalization)
  3. Model capacity (layers/width)
  4. Regularization (dropout, weight decay)
  5. Augmentation (domain-aware > random)

Evaluation

  • Classification: Accuracy, F1, ROC-AUC, calibration (reliability plots)
  • Regression: MAE/RMSE, MAPE; include prediction intervals where safety-critical
  • Computer vision: mAP (detection), IoU/Dice (segmentation)

Debugging fast

  • Overfit a tiny subset (e.g., 100 samples). If it can’t, there’s a bug.
  • Plot loss curves (exploding? LR too high).
  • Sanity-check labels and augmentations (visualize batches!).
  • Track experiments (lr, batch, seed) with a simple log or MLflow/W&B.

Deployment notes

  • Convert to ONNX/TensorRT for speed; quantize to 8-bit where possible
  • Monitor data drift and performance decay; alert on anomalies
  • Keep a rollback model and shadow new models before full cutover

12) Quick “Which model should I try first?” (ready-to-use)

GoalStart withIf underfits or slow…
Tabular classificationLogistic/Gradient BoostingSmall MLP with dropout
Small image classificationShallow CNNPretrained ResNet (fine-tune)
Object detectionPretrained YOLO familyLarger model, more data/aug
Short text classificationBi-GRU / 1D-CNN(Next section) Transformer fine-tune
Time-series regression1D-CNN or GRUAdd exogenous features; Transformers later

13) Ethics, reliability, and safety (don’t skip!)

  • Bias & fairness: check subgroup metrics; rebalance data; document limitations
  • Explainability: saliency maps (images), SHAP (tabular), attention viz (text)
  • Safety-critical use: add guardrails, fallbacks, human-in-the-loop, uncertainty estimates