A pure NumPy implementation of a modern Transformer architecture (Multi-Head Attention, SwiGLU FFN, Pre-Layer Normalization) with complete manual backpropagation, including the derivation of the softmax Jacobian, variance routing, and gradient flow through causal masks.
Modern deep learning frameworks like PyTorch and TensorFlow abstract away the complex calculus and gradient flow of neural networks. While import torch.nn as nn is standard for production, relying exclusively on auto-grad engines can leave engineers with a shallow understanding of the underlying architecture.
The goal of this "toy model" is to prove fundamental comprehension of the linear algebra and multivariable calculus that powers modern Large Language Models (LLMs). This repository documents the evolution of a Neural Network from a single-head attention mechanism into a fully functional GPT Transformer block, written entirely from scratch without any ML libraries.
The initial implementation (01_single_head.py) focused on proving the core calculus of Self-Attention:
- The Softmax Jacobian: Deriving the complex Jacobian gradient of the Softmax function into a fully vectorized NumPy operation:
dZ_i = S_i (E_i - sum_j E_j S_j). - Matrix Transposition for Gradient Flow: Explicitly demonstrating why transposed matrices are required during the backward pass (e.g.,
dW_q = sentence_embedding.T @ dQ) to map blame back to original input features. - The Total Derivative Rule: Tracing upstream gradients from Queries, Keys, and Values back into a single unified gradient for the
sentence_embedding.
The secondary implementation (02_multi_heads.py) upgrades the foundation into a true Generative Transformer Block by introducing parallel processing and time-awareness:
- Multi-Head Attention: Splitting the embedding dimension into parallel "brains" to learn specialized syntactic and semantic relationships.
- Learned Positional Encodings: Mimicking OpenAI's GPT design philosophy by relying on pure gradient descent to learn absolute position (
E[token] + P). - Causal Masking (The Blindfold): Generating an upper-triangular matrix of Negative Infinity (
-1e9) to mathematically force Softmax to crush future probabilities to0.000.
The final implementation (03_ffn.py) replaces the rudimentary multi-head mechanism with a mathematically complete, commercially stable Transformer architecture mirroring modern setups:
- SwiGLU Feed-Forward Network: Replaced standard ReLU layers with a Swish-Gated Linear Unit (SwiGLU). This provides deep non-linear computing power by gating standard representations (
W_up) with a Swish-activated pathway (W_gate). - Pre-Layer Normalization (Pre-LN): Implemented manual forward and backward LayerNorm to prevent variance explosions in deep networks. The manual backpropagation derives the gradients for the dynamic shift (
beta), scale (gamma), and routes blame accurately through the variance and mean tensors. - Xavier / He Initialization: Scaled all normally distributed
np.random.randn()weight matrices by the square root of their input dimensions (1 / np.sqrt(N)) to ensure variance holds at1.0through successive dot products.
The final Neural Network consists of:
- Trainable Vocabulary & Positional Embedding Matrices (
E,P) - A Causal Mask (
-1e9upper triangle) - Pre-Layer Normalization (
gamma,beta) - Multi-Head Self-Attention Layer (
W_q,W_k,W_v) - SwiGLU Feed-Forward Hidden Layer (
W_gate,W_up,W_down) - A Language Modeling Output Head (
W_lm) trained via Categorical Cross-Entropy.
The architecture is split into heavily commented, pedagogical steps to maximize readability.
To run the Phase 1 training loop:
python core/01_single_head.pyTo run the Phase 2 training loop:
python core/02_multi_heads.pyTo run the Phase 3 training loop (Complete Architecture):
python core/03_ffn.py