This repository contains a from-scratch implementation of a GPT-style decoder-only Transformer, built to study architectural design choices and training stability in autoregressive language models.
The emphasis of this project is on controlled experimentation and interpretability, rather than benchmark optimization.
- Implement a decoder-only Transformer from first principles using PyTorch
- Make training dynamics and architectural effects explicit
- Design controlled experiments to isolate key design choices
- Analyze results using gradients, attention behavior, and qualitative generation
The model follows a standard GPT-style decoder architecture:
- Token embedding + positional encoding
- Stacked decoder blocks
- Masked multi-head self-attention
- Feed-forward network
- Residual connections
- Layer normalization (not yet configurable)
- Linear projection to vocabulary logits
- Decoder-only (autoregressive)
- Causal masking for next-token prediction
- Custom character-level tokenization
- Trained on the Tiny Shakespeare dataset
All experiments are controlled: only one factor is modified at a time while keeping all other variables fixed.
Question:
Where should LayerNorm be placed for stable Transformer training?
Metrics Logged:
- Training loss
- Per-layer gradient norms
- Activation statistics
- Attention weights
- Qualitative text generation
Observations:
To be filled after completing experiments.
Question:
How does positional information affect attention and long-range dependency modeling?
Metrics Logged:
- Training loss
- Attention heatmaps and entropy
- Long-range token dependency behavior
- Generated text samples
Observations:
To be filled after completing experiments.
Question:
What role does gradient clipping play in training stability?
Metrics Logged:
- Gradient norm distributions
- Frequency of clipping
- Loss stability
- Divergence events (if any)
Observations:
To be filled after completing experiments.
All experiments are tracked using Weights & Biases (W&B) to enable:
- Consistent run comparison
- Metric visualization
- Attention and gradient analysis
- Reproducibility
A custom character-level tokenizer is used throughout this project to:
- Keep preprocessing transparent
- Reduce confounding variables
- Suit the Tiny Shakespeare dataset
- Simplify attention interpretation
Tokenizer comparisons are intentionally excluded to prioritize architectural analysis under limited compute.
will be updated, once complete.
To be added after all experiments are completed.
| Experiment | Key Result | Evidence |
|---|---|---|
| Pre-LN vs Post-LN | — | — |
| Sinusoidal vs RoPE | — | — |
| Gradient Clipping | — | — |
Table to be populated after experiments.
Planned extensions and follow-up experiments will be documented here.
- Experiments are conducted on a small corpus (Tiny Shakespeare)
- Results focus on training dynamics and interpretability
- No large-scale or production-level training is attempted
These constraints are intentional to maintain experimental clarity.
This project is designed as an engineering study, emphasizing controlled experimentation, interpretability, and reproducibility over raw performance.
MIT