tiny-poet

A tiny GPT built from scratch, trained on the Complete Tang Poems + Complete Song Ci, capable of generating classical Chinese poetry.

中文

Single-file implementation (aside from data and tokenizer), under 500 lines of code. Goals:

Understand every line of the transformer
Train on a free Colab T4
Actually generate decent classical poetry

Model Sizes

Config	Params	Colab T4 Training Time
tiny	~1.6M	20-30 min
small	~6.3M	1-2 hrs
base	~16.6M	3-5 hrs

Default is small. For this dataset size, small and base perform similarly — base overfits slightly more but produces text with a stronger classical flavor.

Quick Start

Want to skip training and jump straight to generation? Download a pre-trained checkpoint:

pip install torch numpy
wget https://github.com/yingwang/tiny-poet/releases/download/v0.1/small_inference.pt
python sample.py --ckpt small_inference.pt --prompt "春" --num_samples 3

v0.1 uses the small config with 7.72M parameters, trained on an iMac 2019 for 90 minutes, final loss 4.84.

Training from Scratch

Local (iMac / MacBook)

pip install torch numpy tqdm

# 1. Download data (Complete Tang Poems + Song Ci)
python data.py

# 2. Train with tiny config (iMac CPU 2-4 hrs)
python train.py --config tiny --device cpu --iters 5000

# 3. Generate
python sample.py --prompt "春眠不觉晓" --max_tokens 50

Colab

!git clone https://github.com/yingwang/tiny-poet.git
%cd tiny-poet
!pip install torch numpy tqdm

!python data.py
!python train.py --config small --iters 10000
!python sample.py --prompt "春眠不觉晓" --max_tokens 50

train.py auto-detects GPU (cuda / mps / cpu).

Files

data.py — Downloads and cleans the chinese-poetry dataset (character-level)
model.py — GPT architecture: embedding → N × transformer block → output
train.py — Training loop: AdamW + cosine schedule, checkpoint support
sample.py — Inference: top-k sampling

Architecture

Input (char ids)
  ↓ Token Embedding + Positional Embedding
  ↓
  [Transformer Block] × N
    ├── LayerNorm
    ├── Multi-Head Self-Attention (causal)
    ├── LayerNorm
    └── MLP (4x hidden)
  ↓
LayerNorm
  ↓ Linear → vocab_size
Softmax → next char probabilities

Standard GPT-style decoder-only transformer. No bells and whistles.

Data

Source: chinese-poetry/chinese-poetry

Complete Tang Poems: ~55k
Complete Song Ci: ~21k
Total tokens: ~5M characters

Character-level tokenizer, vocab size 11,601 (simplified + traditional Chinese + punctuation + variant characters).

Sample Output (v0.1 small)

Prompt 春:

春意，柳阴如雨。春似故人来醉。送友客別離辭別，春風欲多。白髮相逢客，寒枝半似春。

Prompt 月:

月·沈丘崈一点春容不见。无人有酒。不似花梢柳。花如玉。梅花风，也似西西子。

Prompt 江南:

江南·念奴娇·王安安岳春云已暮，不怕风流水。云树碧流沙外，江外一声寒水。

Author names are hallucinated by the model; most phrases are novel generations, not memorized training data.

tiny-poet

一个从零实现的小型 GPT，用全唐诗 + 全宋词训练，能生成古体诗词。

单文件实现（除了数据和 tokenizer），代码不到 500 行。目标是：

看得懂 transformer 每一行
能在 Colab 免费 T4 上训练
训完能真的生成像样的诗词

模型规模

配置	参数量	Colab T4 训练时间
tiny	~1.6M	20-30分钟
small	~6.3M	1-2小时
base	~16.6M	3-5小时

默认用 small。对唐诗宋词这个数据量，small 和 base 差别不大，base 稍微过拟合一点但生成更有古味。

快速开始（用训好的模型）

想跳过训练直接玩生成？下载发布好的 checkpoint：

pip install torch numpy
wget https://github.com/yingwang/tiny-poet/releases/download/v0.1/small_inference.pt
python sample.py --ckpt small_inference.pt --prompt "春" --num_samples 3

v0.1 是 small 配置 7.72M 参数，在 iMac 2019 上训了 90 分钟，final loss 4.84。

从零训练

本地（iMac / MacBook）

pip install torch numpy tqdm

# 1. 下载数据（全唐诗 + 全宋词）
python data.py

# 2. 训练 tiny 配置（iMac CPU 2-4 小时能出效果）
python train.py --config tiny --device cpu --iters 5000

# 3. 生成
python sample.py --prompt "春眠不觉晓" --max_tokens 50

Colab

!git clone https://github.com/yingwang/tiny-poet.git
%cd tiny-poet
!pip install torch numpy tqdm

!python data.py
!python train.py --config small --iters 10000
!python sample.py --prompt "春眠不觉晓" --max_tokens 50

train.py 会自动检测 GPU（cuda / mps / cpu）。

文件说明

data.py — 下载并清洗 chinese-poetry 数据集（字符级）
model.py — GPT 架构：embedding → N × transformer block → output
train.py — 训练循环：AdamW + cosine schedule，支持 checkpoint
sample.py — 推理：top-k 采样生成

架构

Input (char ids)
  ↓ Token Embedding + Positional Embedding
  ↓
  [Transformer Block] × N
    ├── LayerNorm
    ├── Multi-Head Self-Attention (causal)
    ├── LayerNorm
    └── MLP (4x hidden)
  ↓
LayerNorm
  ↓ Linear → vocab_size
Softmax → next char probabilities

标准 GPT-style decoder-only transformer，没有花里胡哨。

数据

来源：chinese-poetry/chinese-poetry

全唐诗：约 55k 首
全宋词：约 21k 首
总 token 数：约 500 万字符

字符级 tokenizer，实际 vocab size 11,601（简体 + 繁体 + 标点 + 少量异体字）。

样本输出（v0.1 small）

输入 春：

春意，柳阴如雨。春似故人来醉。送友客別離辭別，春風欲多。白髮相逢客，寒枝半似春。

输入 月：

月·沈丘崈一点春容不见。无人有酒。不似花梢柳。花如玉。梅花风，也似西西子。

输入 江南：

江南·念奴娇·王安安岳春云已暮，不怕风流水。云树碧流沙外，江外一声寒水。

（作者名是模型自己造的，词句大部分也是新生成而非训练数据原文）

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.gitignore		.gitignore
README.md		README.md
data.py		data.py
model.py		model.py
requirements.txt		requirements.txt
sample.py		sample.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

tiny-poet

Model Sizes

Quick Start

Training from Scratch

Local (iMac / MacBook)

Colab

Files

Architecture

Data

Sample Output (v0.1 small)

tiny-poet

模型规模

快速开始（用训好的模型）

从零训练

本地（iMac / MacBook）

Colab

文件说明

架构

数据

样本输出（v0.1 small）

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

tiny-poet

Model Sizes

Quick Start

Training from Scratch

Local (iMac / MacBook)

Colab

Files

Architecture

Data

Sample Output (v0.1 small)

tiny-poet

模型规模

快速开始（用训好的模型）

从零训练

本地（iMac / MacBook）

Colab

文件说明

架构

数据

样本输出（v0.1 small）

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages