Skip to content

Zayer1/transformer-backprop-numpy

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Transformer Block from Scratch (Pure NumPy)

A pure NumPy implementation of a modern Transformer architecture (Multi-Head Attention, SwiGLU FFN, Pre-Layer Normalization) with complete manual backpropagation, including the derivation of the softmax Jacobian, variance routing, and gradient flow through causal masks.

Overview

Modern deep learning frameworks like PyTorch and TensorFlow abstract away the complex calculus and gradient flow of neural networks. While import torch.nn as nn is standard for production, relying exclusively on auto-grad engines can leave engineers with a shallow understanding of the underlying architecture.

The goal of this "toy model" is to prove fundamental comprehension of the linear algebra and multivariable calculus that powers modern Large Language Models (LLMs). This repository documents the evolution of a Neural Network from a single-head attention mechanism into a fully functional GPT Transformer block, written entirely from scratch without any ML libraries.

Phase 1: The Single-Head Foundation

The initial implementation (01_single_head.py) focused on proving the core calculus of Self-Attention:

  1. The Softmax Jacobian: Deriving the complex Jacobian gradient of the Softmax function into a fully vectorized NumPy operation: dZ_i = S_i (E_i - sum_j E_j S_j).
  2. Matrix Transposition for Gradient Flow: Explicitly demonstrating why transposed matrices are required during the backward pass (e.g., dW_q = sentence_embedding.T @ dQ) to map blame back to original input features.
  3. The Total Derivative Rule: Tracing upstream gradients from Queries, Keys, and Values back into a single unified gradient for the sentence_embedding.

Phase 2: Scaling to a GPT Architecture

The secondary implementation (02_multi_heads.py) upgrades the foundation into a true Generative Transformer Block by introducing parallel processing and time-awareness:

  1. Multi-Head Attention: Splitting the embedding dimension into parallel "brains" to learn specialized syntactic and semantic relationships.
  2. Learned Positional Encodings: Mimicking OpenAI's GPT design philosophy by relying on pure gradient descent to learn absolute position (E[token] + P).
  3. Causal Masking (The Blindfold): Generating an upper-triangular matrix of Negative Infinity (-1e9) to mathematically force Softmax to crush future probabilities to 0.000.

Phase 3: Commercial Stabilization & Non-Linearity

The final implementation (03_ffn.py) replaces the rudimentary multi-head mechanism with a mathematically complete, commercially stable Transformer architecture mirroring modern setups:

  1. SwiGLU Feed-Forward Network: Replaced standard ReLU layers with a Swish-Gated Linear Unit (SwiGLU). This provides deep non-linear computing power by gating standard representations (W_up) with a Swish-activated pathway (W_gate).
  2. Pre-Layer Normalization (Pre-LN): Implemented manual forward and backward LayerNorm to prevent variance explosions in deep networks. The manual backpropagation derives the gradients for the dynamic shift (beta), scale (gamma), and routes blame accurately through the variance and mean tensors.
  3. Xavier / He Initialization: Scaled all normally distributed np.random.randn() weight matrices by the square root of their input dimensions (1 / np.sqrt(N)) to ensure variance holds at 1.0 through successive dot products.

Architecture Summary

The final Neural Network consists of:

  • Trainable Vocabulary & Positional Embedding Matrices (E, P)
  • A Causal Mask (-1e9 upper triangle)
  • Pre-Layer Normalization (gamma, beta)
  • Multi-Head Self-Attention Layer (W_q, W_k, W_v)
  • SwiGLU Feed-Forward Hidden Layer (W_gate, W_up, W_down)
  • A Language Modeling Output Head (W_lm) trained via Categorical Cross-Entropy.

Usage

The architecture is split into heavily commented, pedagogical steps to maximize readability.

To run the Phase 1 training loop:

python core/01_single_head.py

To run the Phase 2 training loop:

python core/02_multi_heads.py

To run the Phase 3 training loop (Complete Architecture):

python core/03_ffn.py

About

A pedagogical, from-scratch implementation of the Transformer Self-Attention mechanism and its full backpropagation calculus in pure NumPy.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages