Section 4 – Transformers -

Transformers are neural network architectures that replace recurrence/convolutions with attention, enabling parallel training, long-range context, and scaling to very large models.

14) Why Transformers Changed AI (Key Features)

No recurrence
Traditional RNNs process tokens step-by-step. Transformers process all tokens in parallel, using attention to let each position look at all others.
Attention mechanism
Each token computes which other tokens to focus on when producing its next representation.
Parallelisation
Because there’s no sequential dependency in the encoder (and masked parallelism in the decoder), we can fully utilise modern accelerators (GPUs/TPUs).
Scalability
Model quality improves predictably as we scale data, parameters, and compute—which is why “large language models” (LLMs) work so well.

15) Core Definitions (official-style)

Transformer (Vaswani et al., 2017)
A sequence model built from stacked attention and position-wise feed-forward layers (plus residuals & normalization), operating without recurrence or convolution.
Self-Attention
Attention where QQQ, KKK, and VVV all come from the same sequence, letting each token attend to every other token.
Multi-Head Attention
Multiple attention “heads” (independent projections of Q,K,VQ,K,VQ,K,V) run in parallel, capturing different relation types (syntax, long-range links, etc.), then concatenated and mixed.
Positional Encoding / Position Embeddings
Since attention alone is permutation-invariant, we inject position information (e.g., sinusoidal or learned embeddings) so the model knows token order.
Encoder / Decoder
- Encoder: stacks of self-attention + FFN → contextualise inputs.
- Decoder: stacks of masked self-attention + cross-attention + FFN → generate outputs autoregressively.

Scaled Dot-Product Attention
This computes weighted combinations of values where weights reflect query-key similarity.

16) Architecture at a Glance

Encoder Block (×N)

LayerNorm → Multi-Head Self-Attention → Residual
LayerNorm → Feed-Forward (MLP) → Residual

Decoder Block (×N)

LayerNorm → Masked Multi-Head Self-Attention → Residual
LayerNorm → Cross-Attention (attends over encoder outputs) → Residual
LayerNorm → Feed-Forward → Residual

Training Objectives (common)

Masked Language Modeling (MLM): predict masked tokens (e.g., BERT).
Causal LM: predict next token given previous tokens (e.g., GPT family).
Seq2Seq: encode input, decode output (e.g., T5, translation).

17) Practical Benefits & Costs

Benefits

Long-range context handling
High throughput training (parallel)
Strong transfer learning (pretrain → fine-tune or prompt)

Costs / Pitfalls

Quadratic attention: memory/time scale as O(n2)O(n^2)O(n2) with sequence length
Hallucinations in generative use—needs guardrails/RAG
Data hunger & compute cost—quality depends on large, clean corpora

Mitigations

Efficient attention variants (sparse, linearised, sliding-window)
Retrieval-Augmented Generation (RAG) for up-to-date, source-grounded answers
PEFT/LoRA to fine-tune cheaply; quantisation (8-bit/4-bit) for deployment

18) What Transformers Do Well (Applications)

Language: translation, summarisation, Q&A, code generation, chat assistants
Vision: classification, detection, segmentation via ViT and hybrids
Speech: ASR (automatic speech recognition), speaker diarisation
Multimodal: text↔image (captioning, generation), text↔audio, video understanding
Enterprise: search & RAG over documents, analytics copilots, form understanding
Time-series: long-context forecasting and anomaly detection (with care)

19) Popular Transformer Families (what & when to use)

BERT (Encoder-only, MLM)
Great for understanding tasks: classification, NER, semantic search (via embeddings).
Tip: Fine-tune or use sentence-embeddings for retrieval.
T5 (Encoder-Decoder, Text-to-Text)
Unifies tasks as text in → text out (summarise, translate, QA).
Tip: Simple prompting (“summarise: …”) often works well.
GPT family (Decoder-only, Causal LM)
Strong at generation: drafting, coding, chain-of-thought, tool-use via prompting.
Whisper (Encoder-Decoder for Speech)
Robust speech-to-text across languages and accents.
ViT (Vision Transformer)
Image understanding by splitting images into patches → transformer encoder.
Tip: Fine-tune on your dataset; use augmentations.
CLIP (Vision-Language, Contrastive)
Aligns images and text in a shared space → powerful zero-shot recognition & retrieval.
DALL·E (Text→Image generation)
Produces images from prompts; modern systems combine transformer planning + diffusion decoders for fidelity and alignment.

20) Worked Examples (mental models)

Summarisation (T5/GPT): “Summarise this report in 5 bullet points focusing on safety KPIs.”
Retrieval-Augmented QA (GPT/BERT-embeddings): Embed docs → retrieve top-k → feed to an LLM for grounded answers.
Image Captioning (ViT+Transformer decoder): Encode image patches → decode caption tokens.
Speech-to-Text (Whisper): Feed audio features to encoder → decode text.

21) Hands-On: Quickstarts (low-code ideas)

Zero-shot classify support tickets with a GPT prompt (no training).
Fine-tune a small T5 on your FAQs for better on-brand summaries.
Build RAG: generate embeddings (BERT/SBERT), store in a vector DB, retrieve + prompt an LLM.
Transcribe meetings with Whisper; summarise with a T5/GPT prompt.

22) Evaluation & Safety

Text: BLEU/ROUGE for MT/summarisation; accuracy/F1 for classification; human eval for quality.
Vision: top-k accuracy; mAP for detection; IoU/Dice for segmentation.
Speech: Word Error Rate (WER).
Safety: test adversarial prompts, measure hallucination rate, add citations via RAG, log failures.

23) Visual Asset Pack (for your designer)

“Attention Heatmap”: show a token attending strongly to earlier keywords.
“Transformer Block”: LayerNorm → Multi-Head Attention → Add/Residual → FFN → Add/Residual.
“Encoder–Decoder Flow”: input tokens → encoder → cross-attention in decoder → generated tokens.
“Model Zoo” tiles: BERT, T5, GPT, ViT, CLIP, Whisper, DALL·E with 1-line use cases.
“RNN vs Transformer” throughput comparison sketch (parallel vs sequential).

24) Mini-Exercises

Prompting: Take a paragraph from a policy doc. Ask GPT to produce: (a) 5-bullet summary; (b) glossary; (c) action items. Compare outputs when you add role/context in the prompt.
RAG sanity: Index 20–30 PDFs; ask factual questions. Track answer accuracy with and without retrieval.
Vision: Try a ViT fine-tune on a small labeled set (e.g., 5–10 classes). Does freezing early layers help when data is tiny?
ASR: Transcribe a 2-minute audio with Whisper; compute WER by manually checking errors.