GPT from Scratch in Python

A GPT-style language model built from scratch in PyTorch, trained on ML/AI literature or Wikipedia.
The goal was to understand and implement the core Transformer architecture end-to-end — tokenization, attention, training loop, and inference, without relying on pre-built model libraries.

A Java implementation of the same concept by us can be found here: gpt-from-scratch-java (simmular context but in python we switcht for more comfort and performance )

Architecture

The model is a decoder-only Transformer (GPT-style) with:

Component	Details
Embedding dim	256
Attention heads	5
Transformer blocks	4
Context length	128 tokens
Tokenizer	SentencePiece (BPE, vocab size 1000)
Activation	GELU
Regularization	Dropout (0.2), LayerNorm (pre-norm)
Optimizer	AdamW (lr 3e-4)

Each Transformer block consists of:

Multi-head causal self-attention (with masking)
Position-wise feed-forward network (4x expansion)
Residual connections + LayerNorm

Project Structure

.
├── parse_wiki.py              # Extract plain text from a Wikipedia XML dump
├── TrainingScripts/
│   ├── cleanandmerge.py       # Text cleaning and merging of source files
│   ├── training.py            # Model definition + training loop
│   └── loadmodel.py           # Load checkpoint and run inference
├── LLM/
│   ├── AILLM_V2_Optimized.pth # Final trained model weights
│   └── checkpoint_4999.pth    # Checkpoint at step 4999
├── Tokenizer/
│   └── V2Optimized/           # Trained SentencePiece tokenizer
└── Data/
    ├── wikipedia/             # Place raw Wikipedia XML dump here
    └── V2/
        └── cleanedtxt/
            └── merged.txt     # Cleaned training corpus

Setup

python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Usage

Option A — Train on Wikipedia data

Download a Wikipedia dump from dumps.wikimedia.org (dewiki-latest-pages-articles.xml.bz2), extract it, and place the .xml file under Data/wikipedia/. Then run:

python parse_wiki.py [max_mb]

This extracts clean plain text from all main-namespace articles (no redirects, no templates, no markup) and writes it to Data/V2/wiki_plain.txt. The optional max_mb argument caps the output size (default: 500 MB).

Option B — Train on custom text files

Place any .txt source files in Data/V2/, then run:

python TrainingScripts/cleanandmerge.py

This cleans (removes page numbers, URLs, boilerplate) and merges all files into Data/V2/cleanedtxt/merged.txt.

Train from scratch

python TrainingScripts/training.py

Trains a new model on Data/V2/cleanedtxt/merged.txt and saves checkpoints to LLM/. The SentencePiece tokenizer is trained automatically on the first run.

Run inference with trained model

python TrainingScripts/loadmodel.py

Loads LLM/AILLM_V2_Optimized.pth and generates 500 tokens, printing to stdout and saving to generated/generated.txt.

Authors

Built together by Peer Grunow and Jonathan Ebel.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
Data/V2		Data/V2
LLM		LLM
Tokenizer/V2Optimized		Tokenizer/V2Optimized
TrainingScripts		TrainingScripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
parse_wiki.py		parse_wiki.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GPT from Scratch in Python

Architecture

Project Structure

Setup

Usage

Option A — Train on Wikipedia data

Option B — Train on custom text files

Train from scratch

Run inference with trained model

Authors

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

GPT from Scratch in Python

Architecture

Project Structure

Setup

Usage

Option A — Train on Wikipedia data

Option B — Train on custom text files

Train from scratch

Run inference with trained model

Authors

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages