gzipt generates text using gzip as its only model. No neural network, no
training, no parameters. You prime it with a corpus, and it continues a prompt by
searching for the byte sequences that compress best, because what compresses
well is what the model predicts. Below is an example output:
gzipt --corpus data/tinyshakespeare.txt --prompt $'MENENIUS:\n' --length 200MENENIUS:
'Though all at once canq
MARCIUS:
Pray now, nocamest thou to a morsel .
LARTIUS:
Hence, and
I' the end admire, where G
again; and after it ag .
LARTIUS:
Hence, and
I' the end ad
LARTIUS:
fame and
This is somewhat of a cherry picked example (it is normally slightly worse than this) but isn't it cool that we can do this at all!
To download the Shakespeare dataset I used:
# Download the dataset (Shakespeare text)
wget https://github.com/nathan-barry/tiny-diffusion/releases/download/v2.0.0/data.txtBelow are the default values for the CLI arguments.
gzipt \
--corpus FILE \ # primes gzip's window with context
--prompt "text" \ # promt to continue
--length 200 \ # bytes to generate
--horizon 24 \ # beam depth: bytes looked ahead and committed per span
--beam-width 32 \ # partial continuations kept each step
--temperature 0.5 \
--tail 80 \ # generated bytes kept in scoring context (anti-copy)
--window 30000 \ # corpus bytes shown to gzip (<= 32768)
--workers 8 # threads for scoring (zlib releases the GIL)