GitHub - bootphon/discophon: The Phoneme Discovery benchmark

Benchmarking the Unsupervised Discovery of Phoneme Inventories With Discrete Speech Units

DiscoPhon is a multilingual benchmark evaluating unsupervised phoneme discovery from discrete speech units. Given only 10 hours of speech in an unseen language, models must produce discrete units that map to a predefined phoneme inventory.

Getting started

DiscoPhon requires Python ≥ 3.12 and has no system dependencies.

Install this package:

pip install discophon            # core: data preparation and phoneme discovery
pip install discophon[abx]       # adds ABX discriminability (fastabx)
pip install discophon[baselines] # adds the baseline models

Follow the tutorials to download data, evaluate models, and prepare your submission.
Current leaderboard.

References

@misc{poli2026discophon,
  title={{DiscoPhon}: Benchmarking the Unsupervised Discovery of Phoneme Inventories With Discrete Speech Units},
  author={Maxime Poli and Manel Khentout and Angelo Ortiz Tandazo and Ewan Dunbar and Emmanuel Chemla and Emmanuel Dupoux},
  year={2026},
  eprint={2603.18612},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2603.18612},
}

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
.github		.github
docs		docs
paper		paper
scripts		scripts
src/discophon		src/discophon
tests		tests
.gitignore		.gitignore
.gitlab-ci.yml		.gitlab-ci.yml
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock
zensical.toml		zensical.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Getting started

References

About

Uh oh!

Releases 11

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Getting started

References

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 11

Contributors

Uh oh!

Languages