Section 3 – Deep Learning Basics
Deep learning uses neural networks with many layers to learn rich, non-linear patterns from data (images, audio, text, time series). Think of it as automatic feature learning—the model discovers useful representations for you.
7) What is a Neural Network?
Official-style definition
A (feedforward) neural network is a parameterized function built by composing affine transforms (weights + biases) and non-linear activations, trained to minimize a loss over data by gradient-based optimization (backpropagation). (cf. Goodfellow, Bengio & Courville, “Deep Learning”)
Stacked layers turn inputs into predictions. Each layer learns weights so that, together, they map your data to the right answers.
Core building blocks
- Neurons & Layers: fully connected (dense) layers
- Activations: ReLU (fast, sparse), GELU (modern choice), sigmoid/tanh (older, saturate)
- Loss: MSE (regression), cross-entropy (classification)
- Optimizer: SGD(+momentum), Adam (good default)
- Regularization: L2 weight decay, dropout, data augmentation, early stopping
- Normalization: BatchNorm/LayerNorm → stabilizes/accelerates training
8) When to use which architecture (the 80/20 guide)
| Data shape | Best first pick | Why |
|---|---|---|
| Tabular (columns/rows) | MLP (Dense) | Simple, fast, strong baseline |
| Images / video frames | CNN | Spatial filters capture edges → shapes |
| Text / sequences / time series | RNN (LSTM/GRU) or 1D-CNN | Order matters; temporal patterns |
| Long text / multimodal | (Preview) Transformer | Global context via attention (next section) |
Pro tip: Start with the simplest model that matches your data shape. If it underfits complex patterns (images, long text), move up the ladder.
9) Convolutional Neural Networks (CNNs)
Official-style definition
A CNN applies learned convolutional kernels across grid-structured inputs (e.g., images), exploiting local connectivity, weight sharing, and translation equivariance to learn hierarchical features (edges → textures → parts → objects). (cf. LeCun et al.)
CNNs slide small filters over an image to detect edges and shapes, building up to full objects as you go deeper.
Key ideas
- Convolution: small filter scans the image; same weights reused = efficient
- Feature hierarchy: early layers = edges; deeper = object parts
- Pooling/stride: reduce spatial size; keep the essentials
- Modern blocks: Conv → Norm → Activation → Residual skip (ResNet-style)
Where CNNs shine
- Image classification (quality control, defect detection)
- Object detection (safety PPE detection, vehicle counting)
- Segmentation (medical, satellite, plant layout maps)
Pitfalls & fixes
- Overfitting: heavy augmentation (flip/crop/color jitter), dropout, weight decay
- Data leakage: same scene in train & test via near-duplicate frames—deduplicate!
- Class imbalance: focal loss, class weights, balanced sampling
10) Recurrent Neural Networks (RNNs)
Official-style definition
An RNN maintains a state that is updated recurrently: ht=f(xt,ht−1)h_t = f(x_t, h_{t-1})ht=f(xt,ht−1), enabling modeling of sequential dependencies. LSTM and GRU introduce gates that control information flow to mitigate vanishing/exploding gradients. (cf. Elman; Hochreiter & Schmidhuber; Cho et al.)
RNNs read inputs one step at a time (word by word, tick by tick), remembering what matters through a hidden state; LSTM/GRU are smarter RNNs that remember/forget more reliably.
Variants
- Vanilla RNN: simple but struggles with long dependencies
- LSTM: input/forget/output gates manage memory
- GRU: streamlined LSTM—often similar performance, fewer parameters
- Bi-RNN: reads forward & backward (great for text classification)
Where RNNs shine
- Text classification, sentiment, NER (short/medium sequences)
- Time-series forecasting (demand, sensor health)
- Speech/IoT streams where order matters
Pitfalls & fixes
- Slow training (long sequences): truncate backprop, batch sequences by length, consider 1D-CNN or Transformers for long context
- Overfitting on small text: subword tokenization, dropout on embeddings, label smoothing
- Vanishing gradients: prefer LSTM/GRU; clip gradients; use residual connections
11) Training Deep Models: a practical checklist
Data & splits
- Train/validate/test; for time series use temporal split (no leakage)
- Normalize inputs (images: mean/std; tabular: StandardScaler)
- Learning rate (use a schedule or one-cycle)
- Batch size (fit GPU, watch generalization)
- Model capacity (layers/width)
- Regularization (dropout, weight decay)
- Augmentation (domain-aware > random)
Evaluation
- Classification: Accuracy, F1, ROC-AUC, calibration (reliability plots)
- Regression: MAE/RMSE, MAPE; include prediction intervals where safety-critical
- Computer vision: mAP (detection), IoU/Dice (segmentation)
Debugging fast
- Overfit a tiny subset (e.g., 100 samples). If it can’t, there’s a bug.
- Plot loss curves (exploding? LR too high).
- Sanity-check labels and augmentations (visualize batches!).
- Track experiments (lr, batch, seed) with a simple log or MLflow/W&B.
Deployment notes
- Convert to ONNX/TensorRT for speed; quantize to 8-bit where possible
- Monitor data drift and performance decay; alert on anomalies
- Keep a rollback model and shadow new models before full cutover
12) Quick “Which model should I try first?” (ready-to-use)
| Goal | Start with | If underfits or slow… |
|---|---|---|
| Tabular classification | Logistic/Gradient Boosting | Small MLP with dropout |
| Small image classification | Shallow CNN | Pretrained ResNet (fine-tune) |
| Object detection | Pretrained YOLO family | Larger model, more data/aug |
| Short text classification | Bi-GRU / 1D-CNN | (Next section) Transformer fine-tune |
| Time-series regression | 1D-CNN or GRU | Add exogenous features; Transformers later |
13) Ethics, reliability, and safety (don’t skip!)
- Bias & fairness: check subgroup metrics; rebalance data; document limitations
- Explainability: saliency maps (images), SHAP (tabular), attention viz (text)
- Safety-critical use: add guardrails, fallbacks, human-in-the-loop, uncertainty estimates
