Building a GPT-style Large Language Model from scratch in PyTorch, one component at a time.
This repository is a hands-on, notebook-driven implementation that follows the journey of Sebastian Raschka's Build a Large Language Model (From Scratch). Each notebook implements a stage of the pipeline — from raw text tokenization all the way to a fine-tuned, instruction-following model — with the core building blocks (tokenizer dataloaders, attention, the transformer block, the GPT model, training loop, and fine-tuning) written by hand rather than imported from a high-level library.
The notebooks are meant to be read and run roughly in this order:
| # | Notebook | What it covers |
|---|---|---|
| 1 | data-processing.ipynb |
Building a Dataset/DataLoader over text using Byte-Pair Encoding (BPE) tokenization, and creating token + positional embedding layers. |
| 2 | coding-attention-mechanisms.ipynb |
Self-attention, causal (masked) self-attention with dropout, and multi-head attention — including an efficient single-matrix implementation. |
| 3 | gpt-model.ipynb |
Assembling the full GPT-2 architecture: layer normalization, GELU feed-forward networks, transformer blocks, and shortcut connections. |
| 4 | pretrain.ipynb |
Text generation, cross-entropy loss, the training loop, evaluation, and loading pretrained GPT-2 weights. |
| 5 | fine-tuning.ipynb |
Fine-tuning the model for classification (SMS spam detection) by adding a classification head. |
| 6 | instruction-tuning.ipynb |
Instruction tuning on an instruction–response dataset (Alpaca-style prompt formatting). |
the-verdict.txt— the short story used as the corpus for tokenization and pretraining experiments.sms_spam_collection/— the SMS Spam Collection dataset used for the classification fine-tuning notebook.
Generated artifacts such as loss-plot.pdf and accuracy-plot.pdf
contain training/evaluation curves produced by the notebooks.
This project targets Python 3.10 (see .python-version) and uses
PyTorch for the model, tiktoken for BPE tokenization, TensorFlow/Keras for loading
the original GPT-2 checkpoint weights, and matplotlib for plotting.
# 1. Create and activate a virtual environment
python -m venv .venv
source .venv/bin/activate
# 2. Install dependencies
pip install -r requirements.txt
# 3. Launch Jupyter and open a notebook
jupyter notebookThen work through the notebooks in the order listed above.
build-llm/
├── data-processing.ipynb # Tokenization, dataloaders, embeddings
├── coding-attention-mechanisms.ipynb # Attention mechanisms
├── gpt-model.ipynb # Full GPT-2 architecture
├── pretrain.ipynb # Training loop & pretrained weights
├── fine-tuning.ipynb # Classification fine-tuning
├── instruction-tuning.ipynb # Instruction tuning
├── the-verdict.txt # Text corpus
├── sms_spam_collection/ # Spam classification dataset
├── requirements.txt
└── .python-version
The structure and concepts follow Sebastian Raschka's Build a Large Language Model (From Scratch). This repository is a personal learning implementation of the ideas presented there.