Section 4 – Transformers

Transformers are neural network architectures that replace recurrence/convolutions with attention, enabling parallel training, long-range context, and scaling to very large models.

14) Why Transformers Changed AI (Key Features)

  • No recurrence
    Traditional RNNs process tokens step-by-step. Transformers process all tokens in parallel, using attention to let each position look at all others.
  • Attention mechanism
    Each token computes which other tokens to focus on when producing its next representation.
  • Parallelisation
    Because there’s no sequential dependency in the encoder (and masked parallelism in the decoder), we can fully utilise modern accelerators (GPUs/TPUs).
  • Scalability
    Model quality improves predictably as we scale data, parameters, and compute—which is why “large language models” (LLMs) work so well.

15) Core Definitions (official-style)

  • Transformer (Vaswani et al., 2017)
    A sequence model built from stacked attention and position-wise feed-forward layers (plus residuals & normalization), operating without recurrence or convolution.
  • Self-Attention
    Attention where QQQ, KKK, and VVV all come from the same sequence, letting each token attend to every other token.
  • Multi-Head Attention
    Multiple attention “heads” (independent projections of Q,K,VQ,K,VQ,K,V) run in parallel, capturing different relation types (syntax, long-range links, etc.), then concatenated and mixed.
  • Positional Encoding / Position Embeddings
    Since attention alone is permutation-invariant, we inject position information (e.g., sinusoidal or learned embeddings) so the model knows token order.
  • Encoder / Decoder
    • Encoder: stacks of self-attention + FFN → contextualise inputs.
    • Decoder: stacks of masked self-attention + cross-attention + FFN → generate outputs autoregressively.

Scaled Dot-Product Attention
This computes weighted combinations of values where weights reflect query-key similarity.

16) Architecture at a Glance

Encoder Block (×N)

  1. LayerNorm → Multi-Head Self-Attention → Residual
  2. LayerNorm → Feed-Forward (MLP) → Residual

Decoder Block (×N)

  1. LayerNorm → Masked Multi-Head Self-Attention → Residual
  2. LayerNorm → Cross-Attention (attends over encoder outputs) → Residual
  3. LayerNorm → Feed-Forward → Residual

Training Objectives (common)

  • Masked Language Modeling (MLM): predict masked tokens (e.g., BERT).
  • Causal LM: predict next token given previous tokens (e.g., GPT family).
  • Seq2Seq: encode input, decode output (e.g., T5, translation).

17) Practical Benefits & Costs

Benefits

  • Long-range context handling
  • High throughput training (parallel)
  • Strong transfer learning (pretrain → fine-tune or prompt)

Costs / Pitfalls

  • Quadratic attention: memory/time scale as O(n2)O(n^2)O(n2) with sequence length
  • Hallucinations in generative use—needs guardrails/RAG
  • Data hunger & compute cost—quality depends on large, clean corpora

Mitigations

  • Efficient attention variants (sparse, linearised, sliding-window)
  • Retrieval-Augmented Generation (RAG) for up-to-date, source-grounded answers
  • PEFT/LoRA to fine-tune cheaply; quantisation (8-bit/4-bit) for deployment

18) What Transformers Do Well (Applications)

  • Language: translation, summarisation, Q&A, code generation, chat assistants
  • Vision: classification, detection, segmentation via ViT and hybrids
  • Speech: ASR (automatic speech recognition), speaker diarisation
  • Multimodal: text↔image (captioning, generation), text↔audio, video understanding
  • Enterprise: search & RAG over documents, analytics copilots, form understanding
  • Time-series: long-context forecasting and anomaly detection (with care)

19) Popular Transformer Families (what & when to use)

  • BERT (Encoder-only, MLM)
    Great for understanding tasks: classification, NER, semantic search (via embeddings).
    Tip: Fine-tune or use sentence-embeddings for retrieval.
  • T5 (Encoder-Decoder, Text-to-Text)
    Unifies tasks as text in → text out (summarise, translate, QA).
    Tip: Simple prompting (“summarise: …”) often works well.
  • GPT family (Decoder-only, Causal LM)
    Strong at generation: drafting, coding, chain-of-thought, tool-use via prompting.
  • Whisper (Encoder-Decoder for Speech)
    Robust speech-to-text across languages and accents.
  • ViT (Vision Transformer)
    Image understanding by splitting images into patches → transformer encoder.
    Tip: Fine-tune on your dataset; use augmentations.
  • CLIP (Vision-Language, Contrastive)
    Aligns images and text in a shared space → powerful zero-shot recognition & retrieval.
  • DALL·E (Text→Image generation)
    Produces images from prompts; modern systems combine transformer planning + diffusion decoders for fidelity and alignment.

20) Worked Examples (mental models)

  • Summarisation (T5/GPT): “Summarise this report in 5 bullet points focusing on safety KPIs.”
  • Retrieval-Augmented QA (GPT/BERT-embeddings): Embed docs → retrieve top-k → feed to an LLM for grounded answers.
  • Image Captioning (ViT+Transformer decoder): Encode image patches → decode caption tokens.
  • Speech-to-Text (Whisper): Feed audio features to encoder → decode text.

21) Hands-On: Quickstarts (low-code ideas)

  • Zero-shot classify support tickets with a GPT prompt (no training).
  • Fine-tune a small T5 on your FAQs for better on-brand summaries.
  • Build RAG: generate embeddings (BERT/SBERT), store in a vector DB, retrieve + prompt an LLM.
  • Transcribe meetings with Whisper; summarise with a T5/GPT prompt.

22) Evaluation & Safety

  • Text: BLEU/ROUGE for MT/summarisation; accuracy/F1 for classification; human eval for quality.
  • Vision: top-k accuracy; mAP for detection; IoU/Dice for segmentation.
  • Speech: Word Error Rate (WER).
  • Safety: test adversarial prompts, measure hallucination rate, add citations via RAG, log failures.

23) Visual Asset Pack (for your designer)

  1. “Attention Heatmap”: show a token attending strongly to earlier keywords.
  2. “Transformer Block”: LayerNorm → Multi-Head Attention → Add/Residual → FFN → Add/Residual.
  3. “Encoder–Decoder Flow”: input tokens → encoder → cross-attention in decoder → generated tokens.
  4. “Model Zoo” tiles: BERT, T5, GPT, ViT, CLIP, Whisper, DALL·E with 1-line use cases.
  5. “RNN vs Transformer” throughput comparison sketch (parallel vs sequential).

24) Mini-Exercises

  1. Prompting: Take a paragraph from a policy doc. Ask GPT to produce: (a) 5-bullet summary; (b) glossary; (c) action items. Compare outputs when you add role/context in the prompt.
  2. RAG sanity: Index 20–30 PDFs; ask factual questions. Track answer accuracy with and without retrieval.
  3. Vision: Try a ViT fine-tune on a small labeled set (e.g., 5–10 classes). Does freezing early layers help when data is tiny?
  4. ASR: Transcribe a 2-minute audio with Whisper; compute WER by manually checking errors.