build-llm

Building a GPT-style Large Language Model from scratch in PyTorch, one component at a time.

This repository is a hands-on, notebook-driven implementation that follows the journey of Sebastian Raschka's Build a Large Language Model (From Scratch). Each notebook implements a stage of the pipeline — from raw text tokenization all the way to a fine-tuned, instruction-following model — with the core building blocks (tokenizer dataloaders, attention, the transformer block, the GPT model, training loop, and fine-tuning) written by hand rather than imported from a high-level library.

Notebooks

The notebooks are meant to be read and run roughly in this order:

#	Notebook	What it covers
1	`data-processing.ipynb`	Building a `Dataset`/`DataLoader` over text using Byte-Pair Encoding (BPE) tokenization, and creating token + positional embedding layers.
2	`coding-attention-mechanisms.ipynb`	Self-attention, causal (masked) self-attention with dropout, and multi-head attention — including an efficient single-matrix implementation.
3	`gpt-model.ipynb`	Assembling the full GPT-2 architecture: layer normalization, GELU feed-forward networks, transformer blocks, and shortcut connections.
4	`pretrain.ipynb`	Text generation, cross-entropy loss, the training loop, evaluation, and loading pretrained GPT-2 weights.
5	`fine-tuning.ipynb`	Fine-tuning the model for classification (SMS spam detection) by adding a classification head.
6	`instruction-tuning.ipynb`	Instruction tuning on an instruction–response dataset (Alpaca-style prompt formatting).

Data

the-verdict.txt — the short story used as the corpus for tokenization and pretraining experiments.
sms_spam_collection/ — the SMS Spam Collection dataset used for the classification fine-tuning notebook.

Generated artifacts such as loss-plot.pdf and accuracy-plot.pdf contain training/evaluation curves produced by the notebooks.

Getting started

This project targets Python 3.10 (see .python-version) and uses PyTorch for the model, tiktoken for BPE tokenization, TensorFlow/Keras for loading the original GPT-2 checkpoint weights, and matplotlib for plotting.

# 1. Create and activate a virtual environment
python -m venv .venv
source .venv/bin/activate

# 2. Install dependencies
pip install -r requirements.txt

# 3. Launch Jupyter and open a notebook
jupyter notebook

Then work through the notebooks in the order listed above.

Project layout

build-llm/
├── data-processing.ipynb             # Tokenization, dataloaders, embeddings
├── coding-attention-mechanisms.ipynb # Attention mechanisms
├── gpt-model.ipynb                   # Full GPT-2 architecture
├── pretrain.ipynb                    # Training loop & pretrained weights
├── fine-tuning.ipynb                 # Classification fine-tuning
├── instruction-tuning.ipynb          # Instruction tuning
├── the-verdict.txt                   # Text corpus
├── sms_spam_collection/              # Spam classification dataset
├── requirements.txt
└── .python-version

Acknowledgements

The structure and concepts follow Sebastian Raschka's Build a Large Language Model (From Scratch). This repository is a personal learning implementation of the ideas presented there.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

build-llm

Notebooks

Data

Getting started

Project layout

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
sms_spam_collection		sms_spam_collection
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
accuracy-plot.pdf		accuracy-plot.pdf
coding-attention-mechanisms.ipynb		coding-attention-mechanisms.ipynb
data-processing.ipynb		data-processing.ipynb
fine-tuning.ipynb		fine-tuning.ipynb
gpt-model.ipynb		gpt-model.ipynb
instruction-tuning.ipynb		instruction-tuning.ipynb
loss-plot.pdf		loss-plot.pdf
pretrain.ipynb		pretrain.ipynb
requirements.txt		requirements.txt
the-verdict.txt		the-verdict.txt

Folders and files

Latest commit

History

Repository files navigation

build-llm

Notebooks

Data

Getting started

Project layout

Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages