Skip to content

Static Tokenizer#169

Draft
sfluegel05 wants to merge 3 commits intodevfrom
feature/static-tokenisation
Draft

Static Tokenizer#169
sfluegel05 wants to merge 3 commits intodevfrom
feature/static-tokenisation

Conversation

@sfluegel05
Copy link
Copy Markdown
Collaborator

This implements a new static tokenizer for SMILES (see #166).

The new tokenizer only needs 572 tokens which it achieves by splitting each atom into 5 tokens (element, charge, isotope, stereochemistry and hydrogen count). As demonstrated in #166, all SMILES strings in ChEBI and PubChem can be parsed with this tokenizer.

Also, the implementation includes a decoder that reconstructs SMILES strings as far as possible (some SMILES cannot be reconstructed perfectly since the encoding is not injective. E.g. [1*] and [2*] both get resolved to *).

Todo

  • check the implications of longer inputs for the ELECTRA model
  • make this reader the default for PubChem and ChEBI classes

@sfluegel05 sfluegel05 linked an issue May 6, 2026 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

New tokenisation

1 participant