Experimental, deterministic, first-principles protein folding.
The goal is a pipeline that goes mRNA sequence → amino-acid chain → 3D folded structure, simulated as the protein co-translationally emerges from a ribosome under thermal motion at body temperature. Folding is driven only by the physics of the chain — charge, hydrophobicity, hydrogen bonding, sterics, dihedral preferences, codon-rarity translation timing, Brownian motion — with no learned priors from structural databases.
It may not work. That is the experiment.
Take the 20-residue Trp-cage mini-protein, NLYIQWLKDGGPSSGRPPPS.
1. Build an all-atom 3D structure from the sequence (origami build).
Every backbone atom and every side-chain atom is placed via NeRF from
idealised internal coordinates, in an extended β-strand conformation. This is
what origami starts with — no fold yet.
2. Compare against the experimentally-determined native fold. PDB 1L2Y (NMR, Neidigh et al. 2002) is the reference. The compact tertiary structure buries the central tryptophan (centre of the image) inside a cage of polyproline helix and α-helix.
3. Score and relax (origami energy, origami minimize). The hand-built
CHARMM36-derived force field assigns native 1L2Y an energy 49,170 kJ/mol below
the extended chain, confirming the physics has the right direction. L-BFGS
minimisation on the extended chain drops the energy from +47,506 kJ/mol to
−1,789 kJ/mol in 115 steps — it relaxes local strain but cannot cross the
barriers between conformations, so the chain doesn't fold:
4. Heat it up (origami dynamics). BAOAB Langevin integration at 310 K
gives the chain thermal energy to wiggle, explore conformations, and (in
principle) cross barriers between minima. The frame below is a snapshot
from 3 ps of dynamics started from the native fold — the cage stays put but
visibly fluctuates, which is what implicit-solvent MD at body temperature
should look like:
5. Grow it co-translationally (origami cotranslate). A real chain
doesn't appear all at once — the ribosome emits residues N-to-C, and the
N-terminal portion has been folding for a while by the time the C-terminus
arrives. The cotranslate command alternates between appending one residue
and running Langevin dynamics for the time slice up to the next emission.
An optional cylindrical exit-tunnel constraint keeps the nascent chain
inside a confined region, mimicking the ribosomal tunnel.
Combined with --with-sasa, the hydrophobic-collapse term drives the
nascent chain as soon as enough side chains are present to cluster.
Chignolin (GYDPETGTWG, 10 residues, PDB 1UAO), one residue per 0.5 ps,
100 ps tail of Langevin at 310 K, γ = 2 ps⁻¹, hydrophobic γ-scale 0.25:
Full quality MP4 · 1 ps of simulated time per ~83 ms of video at 30 fps.
The Cα RMSD vs the 1UAO native fold over the 100 ps tail:
| time after emergence | Cα RMSD vs 1UAO native |
|---|---|
| ~4 ps (chain just complete) | 7.32 Å |
| 20 ps | 6.73 Å |
| 40 ps | 4.05 Å |
44 ps (docs/images/chignolin_cotsasa_43ps.png) |
2.94 Å (minimum) |
| 100 ps | 3.06 Å |
Full trace: docs/data/chignolin_cotsasa_rmsd.tsv. The chain compacts from extended (7.3 Å) into a sub-3 Å native-like basin during the tail. Compared to the pre-folded baseline below (1.82 Å from a pre-minimised extended chain), the cotranslational version reaches a slightly higher RMSD floor — the chain spends its first few ps growing rather than folding, so it has less wall-clock time available to relax. But the qualitative behaviour is the same: emerge, collapse, hover in a compact basin.
6. Actually fold something. Start from a minimised extended chignolin (GYDPETGTWG, 10 residues) and run Langevin dynamics at 310 K. With just LJ + Coulomb + GB (no hydrophobic forces) over 500 ps:
| frame (× 10 ps) | Cα RMSD vs 1UAO native |
|---|---|
| 0 (start) | 8.76 Å |
| 6 | 3.66 Å |
| 10 | 2.92 Å |
| 14 | 2.09 Å |
| 16 | 1.82 Å — within NMR experimental uncertainty |
| 17–18 | 1.88, 2.01 Å |
Adding the analytical hydrophobic-SASA forces (PSA.2) accelerates collapse. A γ-scaling sweep on the same starting structure and seed shows the trade-off:
| γ scale | Min Cα RMSD vs 1UAO | Time to min | Sim length |
|---|---|---|---|
| 0.0 (no SASA) | 1.82 Å | 160 ps | 500 ps |
| 0.25 | 2.04 Å | 100 ps | 200 ps |
| 0.5 | 2.37 Å | 48 ps | 200 ps |
| 1.0 (full literature γ) | 2.82 Å | 64 ps | 200 ps |
Full RMSD traces: no-SASA · γ=0.25 · γ=0.5 · γ=1.0.
Lower γ → tighter native fit (less molten-globule lock-in). The γ=0.25 fold at 100 ps:
Replica-exchange MD on chignolin: tightest fold so far at 1.43 Å. With REMD doing 8 replicas in parallel (T ladder 280-395 K, geometric spacing, swap every 1 ps, dt = 2 fs + SHAKE), the production trajectory (lowest-T replica at 280 K) reaches sub-2 Å within 57 ps and sub-1.6 Å within 212 ps; the global minimum is 1.43 Å at frame 1879 (1.88 ns) — tighter than the 1.82 Å straight-Langevin baseline and substantially tighter than the SASA-driven runs above. 99 % of frames cluster into a single basin at 1.5 Å RMSD-linkage cutoff. Swap acceptance 55-65 % across all 7 adjacent pairs:
REMD + SASA combined on chignolin: faster collapse but lock-in. A natural follow-up — does the temperature-ladder sampling stack with the hydrophobic-collapse driving force? Same chignolin recipe, SASA γ=0.25, 100 ps × 8 replicas, dt=2 fs + SHAKE:
• <3.0 Å Cα RMSD reached at 42 ps (REMD-alone took ~57 ps) • <2.5 Å reached at 69 ps (REMD-alone took ~212 ps) • global minimum 2.18 Å at 69 ps
Faster initial collapse than REMD-alone — SASA pulls the chain into a compact basin quickly. But the minimum is worse than REMD-alone (1.43 Å) because the SASA potential traps the chain in the first compact state it finds: clustering at 1.5 Å resolves 4 nearly-equal-population basins (31 / 24 / 20 / 11 %), versus the 99 % single-basin structure of REMD-alone.
Swap acceptance 50-70 % (same ladder works at the higher effective-energy spread that SASA introduces).
MP4 · RMSD trace.
Interpretation: SASA's hydrophobic well is steep enough that temperature-ladder sampling at 280-395 K can't lift the chain out of the wrong compact state once it's in. The trade-off matches the γ-scaling sweep observation from earlier (γ=1.0 over-stabilises); combining REMD with γ=0.25 SASA reaches the molten-globule regime ~3× faster but doesn't reach the native basin the way REMD-only eventually does. The two methods don't stack additively — each is targeting a different fold-pathway feature.
The sweet spot looks like γ ∈ [0.25, 0.5]: enough hydrophobic drive
to compact the chain ~2× faster than LJ+GB-only, without
over-stabilising the first compact state it finds. The literature γ
of 5 cal/mol/Ų (our γ=1.0 baseline) appears to be too aggressive
for our combined CHARMM36 + OBC-GB force field. Tunable via the
ORIGAMI_SASA_GAMMA_SCALE environment variable.
Either way: the central hypothesis — that hand-built physics produces reasonable folds without ML priors — is at least true for the smallest known fold, both with and without hydrophobic forces.
Trp-cage fold trial. Same setup on the 20-residue Trp-cage (NLYIQWLKDGGPSSGRPPPS, starting Cα RMSD 16.66 Å vs the 1L2Y NMR structure), 300 ps Langevin at γ=0.25. The chain compacts steadily from 16.66 Å → 4.20 Å (frame 35 / 210 ps) and stabilises in a ~4.2 Å plateau through to 300 ps:
Trace: docs/data/trpcage_sasa_g025_rmsd.tsv.
Not the native fold — Trp-cage folds in ~5 μs experimentally, and our 300 ps run is ~16 000× short of that — but a clear hydrophobic collapse to a compact molten globule. The chain didn't diverge, didn't get stuck extended, and didn't blow through the native basin. To reach the actual 1L2Y fold would need a much longer trajectory (milliseconds of simulated time, or replica exchange).
REMD on Trp-cage: 3.31 Å min, native-like basin populated 64 % of the time. 8 replicas at 280–388 K, geometric ladder, 2 ns per replica, dt = 2 fs + SHAKE, no SASA. 1.43 Å on chignolin set the methodology; running it on Trp-cage from the same minimised extended chain gives:
• <5.0 Å Cα RMSD vs 1L2Y native at 70 ps • <4.0 Å at 86 ps • <3.5 Å at 97 ps • global minimum 3.31 Å at 508 ps
Clustering the production trajectory at a 2 Å Cα-RMSD single- linkage cutoff finds the chain spends:
• 64 % of frames in cluster 0 (medoid 4.03 Å vs native — the native-like basin) • 23 % in cluster 1 (medoid 6.71 Å — a competing more-extended basin) • 10 % in cluster 2 (medoid 5.97 Å — a third basin)
Swap acceptance ratios across the 7 adjacent pairs: 0.41 – 0.48 (geometric ladder factor ≈ 1.045 vs the 1.05 we used on chignolin — Trp-cage's larger heat capacity narrows the optimal spacing).
Cluster-0 medoid (the dominant native-like fold the chain spent ~64 % of the run in):
Trajectory MP4 · RMSD trace · Cluster-0 medoid PDB. Still 60-65 % short of the native 1 Å basin — Trp-cage's microsecond fold time is past what this ladder + 2 ns reaches — but the chain is sampling the native basin (the 3.31 Å minimum implies brief excursions into it), and the structure of the time distribution (64/23/10 % across three basins) is exactly what we'd expect from a sampling that's exploring conformational space rather than getting trapped.
REMD on villin HP-35: 6.14 Å min after 1 ns; partial helix
formation. Villin (PDB 2F4K, sequence LSDEDFKAVFGMTRSAFANLPLWQQHLKEKGLF,
33 residues, 535 atoms) is the largest of our three benchmark folds
and a three-α-helix bundle. 8 replicas at 280-346 K (geometric
spacing factor 1.03 — tighter than chignolin / Trp-cage because
villin's larger heat capacity narrows the optimal ladder), 1 ns
per replica, dt = 2 fs + SHAKE. Production-replica trajectory:
• <10 Å Cα RMSD vs 2F4K native at 395 ps • <8 Å at 563 ps • global minimum 6.14 Å at 685 ps
• Native helix content: 51.5 % → trajectory peaks at 12.1 %
The 6.14 Å minimum lands the chain in a compact native-like topology but the three-helix bundle hasn't fully formed — partial helix formation across the 35-residue chain in 1 ns is consistent with literature folding times of microseconds for villin in explicit-solvent MD. Cluster-0 dominates 59 % of frames at 3 Å cutoff (the dominant compact misfold); the rest distribute across smaller transient basins.
Swap acceptance varies 21-48 % across the 7 adjacent pairs — the 0↔1 pair at 21 % is on the loose end of "well-mixed" and suggests the lowest-T spacing wants to be tighter still for villin.
Cluster-0 medoid (the dominant compact misfold the chain visits ~59 % of the time):
Trajectory MP4 · RMSD trace · Cluster-0 medoid PDB.
The villin run is the honest negative: with the integrator and sampling stack we've built, chignolin folds (1.43 Å, near-native) and Trp-cage compacts into the native basin region (3.31 Å, 64 % of time in cluster 0); villin partially compacts but doesn't yet form its three helices within 1 ns × 8 replicas of sampling.
Extending villin REMD to 5 ns × 8 replicas: 4.25 Å min, much broader basin coverage. Same ladder, same dt + SHAKE, 5× the simulation time per replica. Acceptance ratios improved from 21-48 % to 37-50 % across all pairs (more samples per pair):
• <6 Å Cα RMSD reached at 284 ps • <5 Å reached at 284 ps • <4.5 Å reached at 288 ps • global minimum 4.25 Å at 289 ps
• Helix content peaks at 24 % (1-ns run peaked at 12 %)
The chain reaches 4.25 Å once at 289 ps, briefly visiting the near-native basin, then bounces around in the 5-8 Å range exploring multiple compact configurations. Clustering at 3 Å cutoff resolves 20 clusters; the top four hold 89 % of frames with nearly-equal population (28 / 27 / 22 / 12 %). Cluster medoid distances to native:
cluster 0 (28 %): 7.72 Å — most-extended of the four cluster 1 (27 %): 6.92 Å cluster 2 (22 %): 6.01 Å cluster 3 (12 %): 5.35 Å — closest to native
The smallest of the four main basins is the most native-like — the chain spent only 12 % of the simulation in the native-adjacent basin, with the majority distributed across less-native compact states. Reaching the 1-Å native basin would still need ~milliseconds of sampling (villin's experimental fold time); this run does demonstrate that scaling REMD time on a 35-residue protein improves the result monotonically.
Cluster-3 medoid (the most native-like of the four populated basins, 12 % of frames):
Trajectory MP4 · RMSD trace · Cluster-3 medoid PDB.
Reaching the native HP-35 bundle would need either a longer run (10-50 ns / replica) or a denser temperature ladder (16 replicas).
Disulfide bonds. Crambin (PDB 1CRN, 46 residues, plant peptide)
ships under crates/io/tests/fixtures/1CRN_crambin.pdb to exercise
the S-S detection path. geom::build_topology_graph finds all three
disulfide bridges geometrically — Cys3-Cys40, Cys4-Cys32, Cys16-Cys26 —
from each SG-SG pair's distance (< 2.5 Å threshold), matching the PDB's
declared SSBOND records exactly. A 1 ps native MD test asserts the
bonded topology with disulfides keeps the chain within 4 Å Cα RMSD;
without the S-S bonds the chain would unfold. The six yellow sulfurs
(three pairs) are visible in the centre of the fold:
Multi-chain proteins. Insulin (PDB 2HIU, 51 residues across chains
A and B) is the smallest standard test for multi-chain support. The
two-chain Structure representation, chain-aware PDB I/O, and the
two-chain-aware peptide-bond auto-detection all ship in
crates/io/tests/fixtures/2HIU_insulin.pdb. Disulfide detection is
purely geometric, so the two inter-chain bridges (A7-B7 and A20-B19)
fall out of the same logic that catches crambin's intra-chain
disulfides. The yellow sulfurs split into three pairs:
A 1 ps native MD test asserts Cα RMSD stays under 1 Å.
Done so far: translation (M1), all-atom chain building (M2), energy evaluation with CHARMM36-borrowed constants and GB OBC II implicit solvent (M3), energy minimisation with L-BFGS (M4), BAOAB Langevin dynamics with trajectory rendering (M5), exact analytical SASA via spherical Gauss-Bonnet (PSA.1, ~1% match to Shrake-Rupley), analytical SASA forces in the gradient (PSA.2), co-translational chain growth with optional exit-tunnel constraint (M6), validation against three small folds (M7): chignolin (1UAO), Trp-cage (1L2Y), and villin headpiece HP-35 (2F4K). For each, the native fold scores at least 30 000 kJ/mol below the same sequence built as an extended chain, and 2 ps of Langevin dynamics from the Trp-cage native conformation keeps Cα RMSD under 1 Å.
Bonded-topology features:
- Disulfide bonds auto-detected via geometric SG-SG distance check (intra- and inter-chain; matches PDB SSBOND records on crambin and insulin)
- Multi-chain
Structurewith chain-aware peptide-bond auto-detection (no phantom C(A21)-N(B1) bond across insulin's chain break) - Monomer enum holding either an
AminoAcid(protein) or aNucleotide(RNA) —chem::Nucleotideis wired through but the dynamics path stays protein-only for now (the PDB reader silently skips ribonucleotide residues; future work promotes them to full inclusion as part of the long-horizon ribosome goal)
Integrator and physics:
- Reaction-field Coulomb at the 10 Å cutoff (Tironi ε_RF → ∞); both V and F vanish smoothly at the cutoff
- SHAKE on X-H bonds → integrator stable at dt = 2 fs → 2× wall-clock throughput per simulated picosecond
- SoA + rayon-parallel nonbonded pair loop (~3.3× kernel-level speedup on Trp-cage; threshold-gated parallelism so smaller proteins use the no-overhead serial path)
- SoA Born-radii for GB with parallel descreening sum
- REMD (parallel tempering) — N replicas at increasing temperatures with Metropolis swaps; production trajectory is the lowest-T replica
Analysis:
origami analyze— per-frame Cα RMSD vs reference + radius of gyration + end-to-end distance + Kabsch-Sander DSSP secondary structure (with Ramachandran fallback for heavy-atom-only PDBs) + optional residue-residue contact frequency map- Trajectory animation —
origami render --output-dir frames/emits per-frame PNGs with a stable camera locked across frames (so molecules visually grow / fold in place); optional--frame-dt-fsstampst = N.NN psin the top-left corner
Performance benchmarks (release build, Apple Silicon, M3 Pro):
| protein | atoms | pairs (10 Å) | total_force / step | sim/wall (dt=2 + SHAKE) |
|---|---|---|---|---|
| chignolin (1UAO) | 134 | 7 470 | 0.27 ms | ~500 ns / day |
| Trp-cage (1L2Y) | 300 | 24 193 | 1.12 ms | ~150 ns / day |
| villin (2F4K) | 520 | 48 434 | 2.76 ms | ~60 ns / day |
With analytical SASA forces on: ~30× slower (still bounded by the SASA cost, which is ~30 ms/step on Trp-cage — the SoA / parallel / SHAKE work didn't move that dial). Numerical-vs-analytical SASA agreement: max 4.5 × 10⁻⁹ on the gradient.
Up next: NeRF placement for ribonucleotides and a Monomer-aware topology graph so RNA chains can actually be built and dynamics'd.
cargo build --workspace
cargo test --workspace# mRNA FASTA → amino-acid sequence
origami translate examples/insulin.fasta
# Sequence → all-atom PDB (extended chain)
origami build --seq NLYIQWLKDGGPSSGRPPPS --output trp_cage.pdb
# Energy of a structure with per-term breakdown
origami energy trp_cage.pdb
# L-BFGS or steepest-descent minimisation
origami minimize trp_cage.pdb --output trp_cage_min.pdb --algorithm lbfgs
# BAOAB Langevin dynamics at 310 K — writes a multi-MODEL trajectory PDB
origami dynamics trp_cage_min.pdb --output-trajectory traj.pdb \
--steps 3000 --save-every 100 --temperature 310 --friction 5.0
# Same, with SHAKE-constrained X-H bonds and dt = 2 fs (≈ 2× faster)
origami dynamics trp_cage_min.pdb --output-trajectory traj.pdb \
--steps 1500 --save-every 50 --dt 2.0 --shake-h
# Replica-exchange MD — 8 replicas on a geometric T ladder, swap every 0.5 ps
origami remd trp_cage_min.pdb --output-trajectory remd.pdb \
--temperatures 300,310,321,333,346,360,375,391 --time-ps 50 \
--swap-interval-ps 0.5 --dt 2.0 --shake-h
# Co-translational chain growth — append one residue, then run dynamics
# until the ribosome emits the next residue. Optional cylindrical exit
# tunnel mimics the ribosomal tunnel.
origami cotranslate --seq NLYIQWLKDGGPSSGRPPPS --output-trajectory cotrans.pdb \
--interval 500 --tail 5000 --save-every 50 --with-tunnel
# Render single-frame or trajectory (multi-MODEL → frame_NNNN.png per model).
# `--frame-dt-fs` adds a `t = N.NN ps` overlay in the top-left of each frame.
origami render trp_cage.pdb --output trp_cage.png --width 800 --height 600
origami render traj.pdb --output-dir frames/ --width 800 --height 600 \
--frame-dt-fs 100
# Trajectory analysis: per-frame Cα RMSD, Rg, end-to-end, DSSP secondary
# structure (H/E/C string + helix/strand %), optional residue-residue
# contact-frequency map, optional RMSD-clustering of frames into fold
# basins.
origami analyze cotrans.pdb \
--reference crates/io/tests/fixtures/1UAO_chignolin.pdb \
--output metrics.tsv \
--contact-map contacts.tsv --contact-cutoff 8.0 \
--cluster-cutoff 1.5 \
--cluster-out-prefix /tmp/basin_ # also writes basin_0.pdb, basin_1.pdb, …crates/
chem/ — atom/AA/codon data, CHARMM36 parameter loader, atom typing
translate/ — mRNA → amino-acid chain
geom/ — 3D math, NeRF, all-atom chain builder, topology graph, cell list
io/ — PDB writer + reader, PNG renderer
energy/ — bonded + LJ + Coulomb + GB OBC II + SASA, plus analytical forces
dynamics/ — backtracking line search, steepest descent, L-BFGS minimisation,
BAOAB Langevin integrator + xoshiro256++ PRNG
cli/ — `origami` binary
data/charmm36 — vendored CHARMM36m parameter and topology files
MIT. CHARMM36 parameter files vendored under data/charmm36/ are
redistributed for academic use; see the headers inside those files for
attribution.


















