Reinforcement Learning for Autonomous Low-Thrust Interplanetary Guidance

Master's Thesis — Politecnico di Milano / Aerospace Engineering Author: Marcello Pareschi Supervisor: Prof. Francesco Topputo Year: 2026

Summary

This repository contains the full implementation of a guidance system for low-thrust interplanetary spacecraft transfers, developed as part of a master's thesis in Aerospace Engineering. The system learns a fuel-efficient guidance policy for an Earth–Mars transfer through a three-stage pipeline: (1) generation of a reference dataset from the indirect optimal control solution, (2) behavioural cloning (supervised pre-training), and (3) PPO-based reinforcement learning fine-tuning in a custom Gymnasium environment. The resulting policy is then exported to ONNX format and deployed on a Jetson Orin Nano for real-time onboard inference, validated through Processor-in-the-Loop (PIL) experiments.

Research Context and Motivation

Low-thrust electric propulsion offers significant mass-efficiency advantages for deep-space missions, but the resulting optimal control problems are notoriously difficult to solve in real time onboard a spacecraft. Classical indirect methods (based on Pontryagin's Minimum Principle) produce fuel-optimal trajectories but require solving a sensitive Two-Point Boundary Value Problem (TPBVP), which is computationally prohibitive for embedded hardware.

This work addresses the gap by training a lightweight neural network policy that can replicate optimal guidance decisions in microseconds. The key challenges addressed are:

Fuel optimality: the policy must closely track the indirect-method solution.
Robustness: the policy must remain effective under observation noise, actuation noise, and Missed Thrust Events (MTEs).
Embedded deployment: the policy must run within the computational and power constraints of a Jetson Orin Nano in 7 W mode.

Main Contributions

Optimal control dataset generation using a Taylor-adaptive integrator (Heyoka) and an indirect shooting method based on Pontryagin's Minimum Principle, with a log-barrier regularisation for control smoothing.
Behavioural cloning pipeline with a custom angular loss function that treats throttle and thrust direction separately, allowing more physically meaningful supervision.
PPO fine-tuning in a custom Gymnasium environment with a progressively tightened terminal constraint schedule (curriculum learning) and configurable stochastic perturbations.
Robustness analysis under observation noise, actuation noise, and Missed Thrust Events (MTEs), validated via Monte Carlo campaigns.
Edge deployment on NVIDIA Jetson Orin Nano: ONNX Runtime (CPU, 7 W mode) and TensorRT (GPU) backends, with a Flask HTTP inference server and full PIL validation.

Repository Structure

.
├── README.md
├── DATA.md                     # Data strategy: what is excluded, how to regenerate
├── LICENSE
├── requirements.txt
├── .gitignore
├── .gitattributes              # Git LFS configuration
│
├── MyFunctions_OCP.py          # Optimal control: dynamics, TPBVP solver, dataset generation
├── Myfunctions_NN.py           # Neural network architecture, BC loss, training loop
├── MyFunctions_RL.py           # Gymnasium environment, RL utilities, reward functions
├── MyFunctions_PIL.py          # PIL/SIL simulation runners, inference wrappers
│
├── notebooks/
│   ├── data_generation/
│   │   ├── Mars_nominal_trajectory.ipynb       # Nominal reference trajectory computation
│   │   ├── Compute_Database_Mars.ipynb         # Generate optimal trajectory dataset
│   │   └── Compute_Database_Mars_New.ipynb     # Updated dataset generation
│   ├── bc/
│   │   ├── Pretrain_Actor.ipynb                # Behavioural cloning training
│   │   └── Evaluate_Pretrained_Model.ipynb     # BC model evaluation
│   ├── rl/
│   │   ├── Train_Model_RL.ipynb                # PPO fine-tuning
│   │   ├── Tuning_Reward.ipynb                 # Reward weight hyperparameter study
│   │   ├── clip_range_tuning.ipynb             # PPO clip range study
│   │   ├── learning_rate_study.ipynb           # Learning rate study
│   │   └── sigma0_tuning.ipynb                 # Initial exploration noise study
│   ├── evaluation/
│   │   ├── SIL_simulation.ipynb                # Software-in-the-Loop evaluation
│   │   ├── Montecarlo_Policy_Evaluation.ipynb  # Monte Carlo robustness assessment
│   │   ├── Evaluate_Montecarlo_Results.ipynb   # Post-processing of MC results
│   │   └── Comparison_Stochasticity.ipynb      # Effect of training stochasticity
│   ├── deployment/
│   │   ├── Onnx_export.ipynb                   # Export policy to ONNX
│   │   ├── PIL_simulation.ipynb                # Processor-in-the-Loop evaluation
│   │   └── Compare_ORT_TRT.ipynb               # ORT vs TensorRT latency comparison
│   ├── analysis/
│   │   ├── Plot_Learning_Curves.ipynb          # Training curve visualisation
│   │   └── Plot_tensorboard.ipynb              # TensorBoard log parsing
│   └── dev/                                    # Development and scratch notebooks
│
├── pil_simulation/             # Jetson deployment scripts and server implementations
│   ├── server_jetson_inference_onnxruntime_7W_final.py  # Production ORT server (7 W)
│   ├── server_jetson_inference_trt.py                   # TensorRT inference server
│   ├── trt_helper.py                                    # TensorRT engine wrapper
│   ├── build_tensorrt_model.py                          # ONNX → TensorRT compilation
│   ├── parity_test.py                                   # ORT vs TRT numerical parity check
│   ├── test_onnx.py                                     # ONNX Runtime unit test
│   └── test_server.py                                   # Flask endpoint test
│
├── onnx_models/
│   ├── policy_OAM.onnx         # Production policy (MLP 64×64×64, input 8D, output 3D)
│   └── MODEL_CARD.md           # Architecture, training provenance, I/O specification
│
├── kernels/                    # SPICE ephemeris kernels (NAIF/JPL)
│   ├── de432s.bsp              # Solar system ephemeris (DE432)
│   ├── naif0012.tls            # Leap-second kernel
│   └── pck00010.tpc            # Planetary constants kernel
│
├── nominal_trajectory/         # Reference optimal trajectory (NPZ)
│   ├── mars_nominal_trajectory.npz
│   └── mars_nominal_trajectory_fixed_time.npz
│
└── plots/                      # Generated figures

Note on large files: The dataset/, training_data/, trained_models/, results_pil/, results_sil/, montecarlo_results/, and pil_simulation/env_* directories are excluded from the repository (see .gitignore) because they collectively exceed 1 GB. See DATA.md for a complete description and regeneration instructions.

Method Overview

1. Nominal Trajectory and Optimal Control Dataset

The reference trajectory is computed by solving the Earth–Mars low-thrust transfer as a fuel-optimal control problem via Pontryagin's Minimum Principle (indirect method). The state-costate system is integrated using Heyoka's Taylor-adaptive integrator at tolerance 1e-16. A log-barrier regularisation term is included in the Hamiltonian to smooth the bang-bang optimal control law into a continuous throttle profile.

A training dataset is generated by sampling perturbed initial conditions around the nominal trajectory and propagating the corresponding optimal control solutions. Each sample contains a time sequence of (observation, action) pairs representing the optimal feedback policy evaluated along a perturbed arc.

Key parameters:

State: position (x, y, z) [AU], velocity (vx, vy, vz) [AU/day], mass m [kg] → 7D
Costate: associated adjoint variables → 7D (used only in OCP solving)
Observation: (x, y, z, vx, vy, vz, m_normalised, t) → 8D
Action: throttle α ∈ [0, 1], thrust direction (ix, iy, iz) (unit vector) → 3D (encoded as azimuth + elevation in [-1, 1])

2. Behavioural Cloning (Pre-training)

The actor network (a configurable MLP with hidden layers of size 64×64×64 by default, tanh activations) is pre-trained via supervised learning on the OCP dataset. A custom loss function (CustomBCLoss) combines:

a squared error on the throttle magnitude
a cosine-dissimilarity term on the thrust direction (invariant to vector magnitude)

Training uses early stopping and saves the best checkpoint by validation loss.

3. PPO Fine-tuning

The pre-trained actor is used to initialise the policy network of a PPO agent (Stable-Baselines3). Training takes place in LowThrustEnv, a custom Gymnasium environment that:

propagates spacecraft dynamics using the Taylor integrator at each step
evaluates a reward combining fuel consumption, trajectory tracking, and terminal accuracy
applies a progressive terminal constraint (curriculum): the tolerance on the final state error is tightened according to a predefined schedule during training
optionally adds observation noise, actuation noise, and Missed Thrust Events (random zero-thrust windows) for robustness training

4. Robustness Evaluation

Monte Carlo campaigns evaluate the trained policy under:

Observation noise: Gaussian noise on position and velocity measurements
Actuation noise: random perturbations on the applied thrust
Missed Thrust Events (MTEs): random intervals during which the thruster is forced off, simulating hardware failures

Results are aggregated across thousands of independent trajectories and compared against the nominal performance.

5. Deployment on Jetson Orin Nano (PIL)

The trained policy is exported to ONNX format via Onnx_export.ipynb and deployed on a NVIDIA Jetson Orin Nano in two configurations:

ONNX Runtime (CPU, 7 W power mode): single-threaded inference, ~1 ms latency
TensorRT (GPU): compiled engine from ONNX, sub-millisecond latency

A Flask HTTP server (pil_simulation/server_jetson_inference_onnxruntime_7W_final.py) exposes a /get_action endpoint. The simulation loop on the host machine sends observations over HTTP and integrates the received actions locally, forming a full Processor-in-the-Loop (PIL) closed loop. Hardware telemetry (CPU utilisation, board power) is collected via jtop.

Installation

Prerequisites

Python 3.10+ (3.8+ for Jetson)
Heyoka and heyoka.py (install via conda-forge — see note below)
SPICE kernels (included in kernels/)

Steps

# 1. Clone the repository
git clone https://github.com/<your-username>/<repo-name>.git
cd <repo-name>

# 2. Create a virtual environment
python -m venv .venv
source .venv/bin/activate

# 3. Install Heyoka via conda-forge (required before pip install)
conda install -c conda-forge heyoka "heyoka.py>=5.0"

# 4. Install remaining Python dependencies
pip install -r requirements.txt

Heyoka is a C++ library with Python bindings. It cannot be installed via pip alone and requires conda-forge or building from source. See the Heyoka documentation for details.

Usage

All workflow steps are controlled through Jupyter notebooks. Launch Jupyter from the repository root:

jupyter lab

Dataset Generation

Run notebooks/data_generation/Compute_Database_Mars_New.ipynb.

This solves the TPBVP for a large number of perturbed initial conditions and saves the resulting trajectory dataset to dataset/ as .npz files.

Behavioural Cloning

Run notebooks/bc/Pretrain_Actor.ipynb.

Loads a dataset from dataset/, trains the MLP actor via supervised learning, and saves the best checkpoint to training_data/<model_name>/ckpt/actor_best.pth.

PPO Fine-tuning

Run notebooks/rl/Train_Model_RL.ipynb.

Initialises the PPO policy from a pre-trained BC checkpoint and trains in LowThrustEnv. Saves the final model and monitor logs to trained_models/.

Monte Carlo Evaluation

Run notebooks/evaluation/Montecarlo_Policy_Evaluation.ipynb followed by notebooks/evaluation/Evaluate_Montecarlo_Results.ipynb.

Evaluates the trained policy across thousands of perturbed scenarios and produces performance statistics.

ONNX Export

Run notebooks/deployment/Onnx_export.ipynb.

Exports the trained PyTorch policy to ONNX format, saved to onnx_models/policy_OAM.onnx. See onnx_models/MODEL_CARD.md for full I/O specification.

Jetson Inference Server (PIL)

On the Jetson (after transferring the ONNX model and setting up the environment):

cd pil_simulation
python server_jetson_inference_onnxruntime_7W_final.py

On the host machine, run notebooks/deployment/PIL_simulation.ipynb to start the closed-loop simulation.

TensorRT Compilation (optional)

cd pil_simulation
python build_tensorrt_model.py

Then use server_jetson_inference_trt.py instead of the ONNX Runtime server.

Reproducing Experiments

The following steps reproduce the full experimental pipeline from scratch:

Nominal trajectory: Mars_nominal_trajectory.ipynb
Dataset generation: Compute_Database_Mars_New.ipynb (~100k trajectories, ~420 MB)
Behavioural cloning: Pretrain_Actor.ipynb
PPO training: Train_Model_RL.ipynb (~25M steps)
Monte Carlo evaluation: Montecarlo_Policy_Evaluation.ipynb
ONNX export: Onnx_export.ipynb
PIL validation: run server on Jetson + PIL_simulation.ipynb on host

Results Overview

Metric	Value
Transfer mission	Earth → Mars (low-thrust, fixed time of flight)
Policy architecture	MLP 64×64×64, tanh
RL algorithm	PPO (Stable-Baselines3)
Training steps	~25M environment interactions
ORT inference latency (Jetson, 7 W)	~1 ms
TRT inference latency (Jetson, GPU)	< 1 ms
Power consumption (7 W mode)	~7 W
Robustness (MTE, observation noise)	evaluated via Monte Carlo

For detailed numerical results, refer to the thesis document: 2026_03_Pareschi_Thesis.pdf.

Technology Stack

Component	Technology
ODE integration	Heyoka (Taylor-adaptive, `tol=1e-16`)
Optimal control	Pontryagin's Minimum Principle, indirect shooting
Ephemeris	SPICE / spiceypy (DE432)
Neural networks	PyTorch
RL training	Stable-Baselines3 (PPO)
RL environment	Gymnasium
Model export	ONNX
Edge inference	ONNX Runtime / TensorRT
Deployment server	Flask
Edge hardware	NVIDIA Jetson Orin Nano
Hardware monitoring	jtop (jetson-stats)
Numerics	NumPy, SciPy, scikit-learn
Visualisation	Matplotlib

Possible Future Work

Guidance generalisation: extend to multi-target or time-free transfer problems.
Multi-revolution transfers: handle trajectories with multiple revolutions around the Sun.
State estimation integration: couple the guidance policy with an onboard filter for realistic navigation.
Fuel-time trade-off: incorporate time-of-flight as an additional optimisation variable.
Uncertainty-aware policies: explore distributional RL or conformal prediction for risk-bounded guarantees.
Further hardware compression: quantisation-aware training or pruning for lower-power deployment.
Formal verification: bound worst-case performance under MTE sequences using reachability analysis.

Citation

If you use this code in your research, please cite:

@mastersthesis{pareschi2026rl,
  author  = {Pareschi, Marcello},
  title   = {Reinforcement Learning for Autonomous Low-Thrust Interplanetary Guidance},
  school  = {Politecnico di Milano},
  year    = {2026},
  type    = {Master's Thesis}
}

The thesis document is included in this repository as 2026_03_Pareschi_Thesis.pdf.

Author

Marcello Pareschi MSc Aerospace Engineering — Politecnico di Milano Thesis supervisor: Prof. Francesco Topputo

License

This project is released under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
kernels		kernels
models		models
nominal_trajectory		nominal_trajectory
notebooks		notebooks
pil_simulation		pil_simulation
plots		plots
pretrained_wfuel1_wtracking0_wterminated50_lr5e-05_batchsize64_epochs30_cliprange0.2_entcoef0.0_std0.1_25Msteps_deterministic		pretrained_wfuel1_wtracking0_wterminated50_lr5e-05_batchsize64_epochs30_cliprange0.2_entcoef0.0_std0.1_25Msteps_deterministic
.gitattributes		.gitattributes
.gitignore		.gitignore
2026_03_Pareschi_Executive_Summary.pdf		2026_03_Pareschi_Executive_Summary.pdf
2026_03_Pareschi_Thesis.pdf		2026_03_Pareschi_Thesis.pdf
DATA.md		DATA.md
LICENSE		LICENSE
MyFunctions_OCP.py		MyFunctions_OCP.py
MyFunctions_PIL.py		MyFunctions_PIL.py
MyFunctions_RL.py		MyFunctions_RL.py
Myfunctions_NN.py		Myfunctions_NN.py
README.md		README.md
earth_mars_nominal.npz		earth_mars_nominal.npz
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Reinforcement Learning for Autonomous Low-Thrust Interplanetary Guidance

Summary

Research Context and Motivation

Main Contributions

Repository Structure

Method Overview

1. Nominal Trajectory and Optimal Control Dataset

2. Behavioural Cloning (Pre-training)

3. PPO Fine-tuning

4. Robustness Evaluation

5. Deployment on Jetson Orin Nano (PIL)

Installation

Prerequisites

Steps

Usage

Dataset Generation

Behavioural Cloning

PPO Fine-tuning

Monte Carlo Evaluation

ONNX Export

Jetson Inference Server (PIL)

TensorRT Compilation (optional)

Reproducing Experiments

Results Overview

Technology Stack

Possible Future Work

Citation

Author

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages