Substitution-Based Analysis of Structural Novelty for Generative Models of Materials

Code and analysis for the paper "Substitution-Based Analysis of Structural Novelty for Generative Models of Materials."

xtaledit evaluates whether crystals from generative models go beyond elemental substitution in known MP20 training structures.

Novelty classification

Each generated crystal is assigned to one of three classes:

Class	Meaning
Duplicate	Matches an MP20 training structure with `pymatgen`'s `StructureMatcher`.
Substituted	Is reproduced by substituting elements in a selected training structure and relaxing it.
Unmatched	Is not reproduced by the substitution-based workflow.

Workflow

For each generated crystal, the pipeline:

Relaxes the generated structure with MACE-MPA-0, applies primitive-cell and Niggli reduction, checks composition validity with SMACT, and computes energy above the Materials Project convex hull.
Searches the MP20 training set for direct StructureMatcher matches.
For non-Duplicates, selects structurally similar training crystals using two complementary criteria:
- anonymous lattice-and-site matching with StructureMatcher;
- matching space groups and Wyckoff-label multisets.
Ranks candidates by mean modified Pettifor-scale distance between mapped elements and retains the top k=3 from each structural criterion.
Substitutes the candidate elements, relaxes each candidate once with MACE-MPA-0, and compares the relaxed structures with the generated crystal.

The seven pipeline scripts under scripts/ (excluding upload_hf_dataset.py) implement preprocessing, direct matching, the two candidate-selection methods, substitution, relaxation, and the final match check. The ordered YAML files in configs/crystalite/ provide a complete Crystalite example.

Repository layout

configs/                  Pipeline configurations
scripts/                  Executable preprocessing and matching stages
src/                      Matching, substitution, energy, and utility code
notebooks/journal.ipynb   All manuscript figures and analysis tables

input/
├── gen/
│   ├── raw/              Generated structures
│   └── preprocessed/     Relaxed, filtered, and reduced structures
├── train/
│   ├── raw/              MP20 training data
│   └── preprocessed/     Reduced training structures
└── icsd/                 Licensed ICSD-derived inputs; not publicly distributed

results/
├── raw/                  Pipeline outputs and intermediate artifacts
└── analysis/journal/     Manuscript figures, tables, and example structures

input/, results/, and notebook caches are intentionally excluded from the Git repository.

Setup

Python 3.12 is required.

uv venv
source .venv/bin/activate
uv sync --group dev

The lock file selects CUDA 12.8 PyTorch wheels. The configured pipeline is intended for CUDA-capable hardware because preprocessing and substituted structure relaxation use MACE-MPA-0. The first MACE run may download model weights.

Preprocessing also needs a Materials Project API key to construct the phase diagram cache:

export MP_API_KEY="<your Materials Project API key>"

Project paths default to input/ and results/. They can be redirected in a gitignored .env file:

INPUT_DIR=/path/to/input
RESULTS_DIR=/path/to/results

Reproduce the manuscript figures

The public paper artifacts are hosted in the Hugging Face dataset repository masahiro-negishi/xtaledit. The dataset card is maintained in DATASET_CARD.md. Download them into the repository root while preserving their paths:

uvx hf download masahiro-negishi/xtaledit \
  --repo-type dataset \
  --include "input/**" \
  --include "results/**" \
  --local-dir .

Then open the journal notebook:

jupyter notebook notebooks/journal.ipynb

All figures used in the manuscript are produced by notebooks/journal.ipynb. Outputs are written to results/analysis/journal/. The notebook also recreates the main classification table and supporting analysis tables.

The public artifacts support all notebook sections except the final ICSD-based Wyckoff-multiset coverage analysis. Run all preceding cells normally. Running the final section requires a locally prepared, licensed results/raw/icsd/wyckoff_repr_s=0.01.pkl.gz artifact. The released results/analysis/journal/ directory includes the final derived figure for reference.

Apart from the ICSD exception above, the downloaded artifacts support figure reproduction for MatterGen, DiffCSP++, WyckoffTransformer, Crystalite, Chemeleon2, and the MP20 test set. The checked-in end-to-end pipeline configurations currently cover Crystalite; configurations for rerunning every model are not included.

Run the Crystalite pipeline

Place the raw Crystalite structures at input/gen/raw/crystalite.pkl.gz and the MP20 training CSV at input/train/raw/train.csv. Run the stages in order:

python scripts/preprocess.py configs/crystalite/01_preprocess.yaml
python scripts/exact_match.py configs/crystalite/02_exact_match.yaml
python scripts/anonymous_match.py configs/crystalite/03_anonymous_match.yaml
python scripts/wyckoff_match.py configs/crystalite/04_wyckoff_match.yaml
python scripts/substitute_structures.py configs/crystalite/05_substitute_structures.yaml
python scripts/relax_substituted.py configs/crystalite/06_relax_substituted.yaml
python scripts/check_relaxed_substituted_matches.py configs/crystalite/07_check_relaxed_matches.yaml

Outputs are written to input/gen/preprocessed/crystalite/ and results/raw/crystalite/. Existing artifacts are skipped unless the relevant configuration supports and enables force: true.

Data sharing and licensing

The Hugging Face release contains:

redistributable generated structures and MP20 train/test inputs;
preprocessed structures and filtering metadata;
raw matching, substitution, relaxation, and sensitivity-analysis artifacts;
figures, tables, and example structures from results/analysis/ai4am/ and results/analysis/journal/.

ICSD records must not be redistributed under the standard ICSD license. Therefore, input/icsd/ and record-level results/raw/icsd/ artifacts are excluded from the public dataset. The final ICSD-dependent prototype-coverage figure can be shared as a derived manuscript artifact, but regenerating it requires licensed ICSD access and locally prepared Wyckoff representations.

The Materials Project-derived phase-diagram cache should also be excluded unless its redistribution terms have been confirmed. Without that cache, rerunning preprocessing requires MP_API_KEY.

Validate the public manifest without uploading:

python scripts/upload_hf_dataset.py

Publish the artifacts and dataset card after authenticating with Hugging Face:

uvx hf auth login
python scripts/upload_hf_dataset.py --upload

The uploader includes input/gen/, input/train/, results/raw/, results/analysis/ai4am/, and results/analysis/journal/, while explicitly excluding other analysis directories, ICSD data, the Materials Project-derived phase-diagram cache, and .gitkeep files. Uploads are resumable. To update only the dataset card, pass --card-only --upload.

The MIT license badge applies to the code repository. The Hugging Face dataset uses license: other because it combines generated structures, MP20-derived inputs, and derived artifacts. Upstream data retain their original terms and citation requirements; see DATASET_CARD.md.

Generate a version-specific DOI for a stable release and add it to this README, the dataset card, and the paper's Data Availability statement.

Citation

Citation information will be added when the paper preprint or publication is available.

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
.vscode		.vscode
configs/crystalite		configs/crystalite
notebooks		notebooks
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
DATASET_CARD.md		DATASET_CARD.md
README.md		README.md
pyproject.toml		pyproject.toml
pyrightconfig.json		pyrightconfig.json
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Substitution-Based Analysis of Structural Novelty for Generative Models of Materials

Novelty classification

Workflow

Repository layout

Setup

Reproduce the manuscript figures

Run the Crystalite pipeline

Data sharing and licensing

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Substitution-Based Analysis of Structural Novelty for Generative Models of Materials

Novelty classification

Workflow

Repository layout

Setup

Reproduce the manuscript figures

Run the Crystalite pipeline

Data sharing and licensing

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages