Static Tokenizer by sfluegel05 · Pull Request #169 · ChEB-AI/python-chebai

sfluegel05 · 2026-05-06T18:21:59Z

This implements a new static tokenizer for SMILES (see #166).

The new tokenizer only needs 572 tokens which it achieves by splitting each atom into 5 tokens (element, charge, isotope, stereochemistry and hydrogen count). As demonstrated in #166, all SMILES strings in ChEBI and PubChem can be parsed with this tokenizer.

Also, the implementation includes a decoder that reconstructs SMILES strings as far as possible (some SMILES cannot be reconstructed perfectly since the encoding is not injective. E.g. [1*] and [2*] both get resolved to *).

Todo

check the implications of longer inputs for the ELECTRA model
make this reader the default for PubChem and ChEBI classes

sfluegel05 added 3 commits May 6, 2026 15:24

add static smiles tokenizer

e33ac08

update for PubChem tokens

17021bf

correctly reassamble SMILES (as far as possible)

a219d45

sfluegel05 linked an issue May 6, 2026 that may be closed by this pull request

New tokenisation #166

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Static Tokenizer#169

Static Tokenizer#169
sfluegel05 wants to merge 3 commits intodevfrom
feature/static-tokenisation

sfluegel05 commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

sfluegel05 commented May 6, 2026

Todo

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant