Transformer Block from Scratch (Pure NumPy)

A pure NumPy implementation of a modern Transformer architecture (Multi-Head Attention, SwiGLU FFN, Pre-Layer Normalization) with complete manual backpropagation, including the derivation of the softmax Jacobian, variance routing, and gradient flow through causal masks.

Overview

Modern deep learning frameworks like PyTorch and TensorFlow abstract away the complex calculus and gradient flow of neural networks. While import torch.nn as nn is standard for production, relying exclusively on auto-grad engines can leave engineers with a shallow understanding of the underlying architecture.

The goal of this "toy model" is to prove fundamental comprehension of the linear algebra and multivariable calculus that powers modern Large Language Models (LLMs). This repository documents the evolution of a Neural Network from a single-head attention mechanism into a fully functional GPT Transformer block, written entirely from scratch without any ML libraries.

Phase 1: The Single-Head Foundation

The initial implementation (01_single_head.py) focused on proving the core calculus of Self-Attention:

The Softmax Jacobian: Deriving the complex Jacobian gradient of the Softmax function into a fully vectorized NumPy operation: dZ_i = S_i (E_i - sum_j E_j S_j).
Matrix Transposition for Gradient Flow: Explicitly demonstrating why transposed matrices are required during the backward pass (e.g., dW_q = sentence_embedding.T @ dQ) to map blame back to original input features.
The Total Derivative Rule: Tracing upstream gradients from Queries, Keys, and Values back into a single unified gradient for the sentence_embedding.

Phase 2: Scaling to a GPT Architecture

The secondary implementation (02_multi_heads.py) upgrades the foundation into a true Generative Transformer Block by introducing parallel processing and time-awareness:

Multi-Head Attention: Splitting the embedding dimension into parallel "brains" to learn specialized syntactic and semantic relationships.
Learned Positional Encodings: Mimicking OpenAI's GPT design philosophy by relying on pure gradient descent to learn absolute position (E[token] + P).
Causal Masking (The Blindfold): Generating an upper-triangular matrix of Negative Infinity (-1e9) to mathematically force Softmax to crush future probabilities to 0.000.

Phase 3: Commercial Stabilization & Non-Linearity

The final implementation (03_ffn.py) replaces the rudimentary multi-head mechanism with a mathematically complete, commercially stable Transformer architecture mirroring modern setups:

SwiGLU Feed-Forward Network: Replaced standard ReLU layers with a Swish-Gated Linear Unit (SwiGLU). This provides deep non-linear computing power by gating standard representations (W_up) with a Swish-activated pathway (W_gate).
Pre-Layer Normalization (Pre-LN): Implemented manual forward and backward LayerNorm to prevent variance explosions in deep networks. The manual backpropagation derives the gradients for the dynamic shift (beta), scale (gamma), and routes blame accurately through the variance and mean tensors.
Xavier / He Initialization: Scaled all normally distributed np.random.randn() weight matrices by the square root of their input dimensions (1 / np.sqrt(N)) to ensure variance holds at 1.0 through successive dot products.

Architecture Summary

The final Neural Network consists of:

Trainable Vocabulary & Positional Embedding Matrices (E, P)
A Causal Mask (-1e9 upper triangle)
Pre-Layer Normalization (gamma, beta)
Multi-Head Self-Attention Layer (W_q, W_k, W_v)
SwiGLU Feed-Forward Hidden Layer (W_gate, W_up, W_down)
A Language Modeling Output Head (W_lm) trained via Categorical Cross-Entropy.

Usage

The architecture is split into heavily commented, pedagogical steps to maximize readability.

To run the Phase 1 training loop:

python core/01_single_head.py

To run the Phase 2 training loop:

python core/02_multi_heads.py

To run the Phase 3 training loop (Complete Architecture):

python core/03_ffn.py

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
core		core
.gitignore		.gitignore
MATH.md		MATH.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Transformer Block from Scratch (Pure NumPy)

Overview

Phase 1: The Single-Head Foundation

Phase 2: Scaling to a GPT Architecture

Phase 3: Commercial Stabilization & Non-Linearity

Architecture Summary

Usage

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

Transformer Block from Scratch (Pure NumPy)

Overview

Phase 1: The Single-Head Foundation

Phase 2: Scaling to a GPT Architecture

Phase 3: Commercial Stabilization & Non-Linearity

Architecture Summary

Usage

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages