Skip to content

nathanrs/gzipt

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

gzipt — gzip as a language model

gzipt generates text using gzip as its only model. No neural network, no training, no parameters. You prime it with a corpus, and it continues a prompt by searching for the byte sequences that compress best, because what compresses well is what the model predicts. Below is an example output:

gzipt --corpus data/tinyshakespeare.txt --prompt $'MENENIUS:\n' --length 200
MENENIUS:
'Though all at once canq

MARCIUS:
Pray now, nocamest thou to a morsel .

LARTIUS:
Hence, and
I' the end admire, where G
again; and after it ag .

LARTIUS:
Hence, and
I' the end ad

LARTIUS:
fame and

This is somewhat of a cherry picked example (it is normally slightly worse than this) but isn't it cool that we can do this at all!

Usage

To download the Shakespeare dataset I used:

# Download the dataset (Shakespeare text)
wget https://github.com/nathan-barry/tiny-diffusion/releases/download/v2.0.0/data.txt

Below are the default values for the CLI arguments.

gzipt \
  --corpus FILE \         # primes gzip's window with context
  --prompt "text" \       # promt to continue
  --length 200 \          # bytes to generate
  --horizon 24 \          # beam depth: bytes looked ahead and committed per span
  --beam-width 32 \       # partial continuations kept each step
  --temperature 0.5 \ 
  --tail 80 \             # generated bytes kept in scoring context (anti-copy)
  --window 30000 \        # corpus bytes shown to gzip (<= 32768)
  --workers 8             # threads for scoring (zlib releases the GIL)

About

A compression based language model

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages