Section 3 – Deep Learning Basics -

Deep learning uses neural networks with many layers to learn rich, non-linear patterns from data (images, audio, text, time series). Think of it as automatic feature learning—the model discovers useful representations for you.

7) What is a Neural Network?

Official-style definition
A (feedforward) neural network is a parameterized function built by composing affine transforms (weights + biases) and non-linear activations, trained to minimize a loss over data by gradient-based optimization (backpropagation). (cf. Goodfellow, Bengio & Courville, “Deep Learning”)

Stacked layers turn inputs into predictions. Each layer learns weights so that, together, they map your data to the right answers.

Core building blocks

Neurons & Layers: fully connected (dense) layers
Activations: ReLU (fast, sparse), GELU (modern choice), sigmoid/tanh (older, saturate)
Loss: MSE (regression), cross-entropy (classification)
Optimizer: SGD(+momentum), Adam (good default)
Regularization: L2 weight decay, dropout, data augmentation, early stopping
Normalization: BatchNorm/LayerNorm → stabilizes/accelerates training

8) When to use which architecture (the 80/20 guide)

Data shape	Best first pick	Why
Tabular (columns/rows)	MLP (Dense)	Simple, fast, strong baseline
Images / video frames	CNN	Spatial filters capture edges → shapes
Text / sequences / time series	RNN (LSTM/GRU) or 1D-CNN	Order matters; temporal patterns
Long text / multimodal	(Preview) Transformer	Global context via attention (next section)

Pro tip: Start with the simplest model that matches your data shape. If it underfits complex patterns (images, long text), move up the ladder.

9) Convolutional Neural Networks (CNNs)

Official-style definition
A CNN applies learned convolutional kernels across grid-structured inputs (e.g., images), exploiting local connectivity, weight sharing, and translation equivariance to learn hierarchical features (edges → textures → parts → objects). (cf. LeCun et al.)

CNNs slide small filters over an image to detect edges and shapes, building up to full objects as you go deeper.

Key ideas

Convolution: small filter scans the image; same weights reused = efficient
Feature hierarchy: early layers = edges; deeper = object parts
Pooling/stride: reduce spatial size; keep the essentials
Modern blocks: Conv → Norm → Activation → Residual skip (ResNet-style)

Where CNNs shine

Image classification (quality control, defect detection)
Object detection (safety PPE detection, vehicle counting)
Segmentation (medical, satellite, plant layout maps)

Pitfalls & fixes

Overfitting: heavy augmentation (flip/crop/color jitter), dropout, weight decay
Data leakage: same scene in train & test via near-duplicate frames—deduplicate!
Class imbalance: focal loss, class weights, balanced sampling

10) Recurrent Neural Networks (RNNs)

Official-style definition
An RNN maintains a state that is updated recurrently: ht=f(xt,ht−1)h_t = f(x_t, h_{t-1})ht=f(xt,ht−1), enabling modeling of sequential dependencies. LSTM and GRU introduce gates that control information flow to mitigate vanishing/exploding gradients. (cf. Elman; Hochreiter & Schmidhuber; Cho et al.)

RNNs read inputs one step at a time (word by word, tick by tick), remembering what matters through a hidden state; LSTM/GRU are smarter RNNs that remember/forget more reliably.

Variants

Vanilla RNN: simple but struggles with long dependencies
LSTM: input/forget/output gates manage memory
GRU: streamlined LSTM—often similar performance, fewer parameters
Bi-RNN: reads forward & backward (great for text classification)

Where RNNs shine

Text classification, sentiment, NER (short/medium sequences)
Time-series forecasting (demand, sensor health)
Speech/IoT streams where order matters

Pitfalls & fixes

Slow training (long sequences): truncate backprop, batch sequences by length, consider 1D-CNN or Transformers for long context
Overfitting on small text: subword tokenization, dropout on embeddings, label smoothing
Vanishing gradients: prefer LSTM/GRU; clip gradients; use residual connections

11) Training Deep Models: a practical checklist

Data & splits

Train/validate/test; for time series use temporal split (no leakage)
Normalize inputs (images: mean/std; tabular: StandardScaler)

Learning rate (use a schedule or one-cycle)
Batch size (fit GPU, watch generalization)
Model capacity (layers/width)
Regularization (dropout, weight decay)
Augmentation (domain-aware > random)

Evaluation

Classification: Accuracy, F1, ROC-AUC, calibration (reliability plots)
Regression: MAE/RMSE, MAPE; include prediction intervals where safety-critical
Computer vision: mAP (detection), IoU/Dice (segmentation)

Debugging fast

Overfit a tiny subset (e.g., 100 samples). If it can’t, there’s a bug.
Plot loss curves (exploding? LR too high).
Sanity-check labels and augmentations (visualize batches!).
Track experiments (lr, batch, seed) with a simple log or MLflow/W&B.

Deployment notes

Convert to ONNX/TensorRT for speed; quantize to 8-bit where possible
Monitor data drift and performance decay; alert on anomalies
Keep a rollback model and shadow new models before full cutover

12) Quick “Which model should I try first?” (ready-to-use)

Goal	Start with	If underfits or slow…
Tabular classification	Logistic/Gradient Boosting	Small MLP with dropout
Small image classification	Shallow CNN	Pretrained ResNet (fine-tune)
Object detection	Pretrained YOLO family	Larger model, more data/aug
Short text classification	Bi-GRU / 1D-CNN	(Next section) Transformer fine-tune
Time-series regression	1D-CNN or GRU	Add exogenous features; Transformers later

13) Ethics, reliability, and safety (don’t skip!)

Bias & fairness: check subgroup metrics; rebalance data; document limitations
Explainability: saliency maps (images), SHAP (tabular), attention viz (text)
Safety-critical use: add guardrails, fallbacks, human-in-the-loop, uncertainty estimates