Master's Thesis — Politecnico di Milano / Aerospace Engineering Author: Marcello Pareschi Supervisor: Prof. Francesco Topputo Year: 2026
This repository contains the full implementation of a guidance system for low-thrust interplanetary spacecraft transfers, developed as part of a master's thesis in Aerospace Engineering. The system learns a fuel-efficient guidance policy for an Earth–Mars transfer through a three-stage pipeline: (1) generation of a reference dataset from the indirect optimal control solution, (2) behavioural cloning (supervised pre-training), and (3) PPO-based reinforcement learning fine-tuning in a custom Gymnasium environment. The resulting policy is then exported to ONNX format and deployed on a Jetson Orin Nano for real-time onboard inference, validated through Processor-in-the-Loop (PIL) experiments.
Low-thrust electric propulsion offers significant mass-efficiency advantages for deep-space missions, but the resulting optimal control problems are notoriously difficult to solve in real time onboard a spacecraft. Classical indirect methods (based on Pontryagin's Minimum Principle) produce fuel-optimal trajectories but require solving a sensitive Two-Point Boundary Value Problem (TPBVP), which is computationally prohibitive for embedded hardware.
This work addresses the gap by training a lightweight neural network policy that can replicate optimal guidance decisions in microseconds. The key challenges addressed are:
- Fuel optimality: the policy must closely track the indirect-method solution.
- Robustness: the policy must remain effective under observation noise, actuation noise, and Missed Thrust Events (MTEs).
- Embedded deployment: the policy must run within the computational and power constraints of a Jetson Orin Nano in 7 W mode.
- Optimal control dataset generation using a Taylor-adaptive integrator (Heyoka) and an indirect shooting method based on Pontryagin's Minimum Principle, with a log-barrier regularisation for control smoothing.
- Behavioural cloning pipeline with a custom angular loss function that treats throttle and thrust direction separately, allowing more physically meaningful supervision.
- PPO fine-tuning in a custom Gymnasium environment with a progressively tightened terminal constraint schedule (curriculum learning) and configurable stochastic perturbations.
- Robustness analysis under observation noise, actuation noise, and Missed Thrust Events (MTEs), validated via Monte Carlo campaigns.
- Edge deployment on NVIDIA Jetson Orin Nano: ONNX Runtime (CPU, 7 W mode) and TensorRT (GPU) backends, with a Flask HTTP inference server and full PIL validation.
.
├── README.md
├── DATA.md # Data strategy: what is excluded, how to regenerate
├── LICENSE
├── requirements.txt
├── .gitignore
├── .gitattributes # Git LFS configuration
│
├── MyFunctions_OCP.py # Optimal control: dynamics, TPBVP solver, dataset generation
├── Myfunctions_NN.py # Neural network architecture, BC loss, training loop
├── MyFunctions_RL.py # Gymnasium environment, RL utilities, reward functions
├── MyFunctions_PIL.py # PIL/SIL simulation runners, inference wrappers
│
├── notebooks/
│ ├── data_generation/
│ │ ├── Mars_nominal_trajectory.ipynb # Nominal reference trajectory computation
│ │ ├── Compute_Database_Mars.ipynb # Generate optimal trajectory dataset
│ │ └── Compute_Database_Mars_New.ipynb # Updated dataset generation
│ ├── bc/
│ │ ├── Pretrain_Actor.ipynb # Behavioural cloning training
│ │ └── Evaluate_Pretrained_Model.ipynb # BC model evaluation
│ ├── rl/
│ │ ├── Train_Model_RL.ipynb # PPO fine-tuning
│ │ ├── Tuning_Reward.ipynb # Reward weight hyperparameter study
│ │ ├── clip_range_tuning.ipynb # PPO clip range study
│ │ ├── learning_rate_study.ipynb # Learning rate study
│ │ └── sigma0_tuning.ipynb # Initial exploration noise study
│ ├── evaluation/
│ │ ├── SIL_simulation.ipynb # Software-in-the-Loop evaluation
│ │ ├── Montecarlo_Policy_Evaluation.ipynb # Monte Carlo robustness assessment
│ │ ├── Evaluate_Montecarlo_Results.ipynb # Post-processing of MC results
│ │ └── Comparison_Stochasticity.ipynb # Effect of training stochasticity
│ ├── deployment/
│ │ ├── Onnx_export.ipynb # Export policy to ONNX
│ │ ├── PIL_simulation.ipynb # Processor-in-the-Loop evaluation
│ │ └── Compare_ORT_TRT.ipynb # ORT vs TensorRT latency comparison
│ ├── analysis/
│ │ ├── Plot_Learning_Curves.ipynb # Training curve visualisation
│ │ └── Plot_tensorboard.ipynb # TensorBoard log parsing
│ └── dev/ # Development and scratch notebooks
│
├── pil_simulation/ # Jetson deployment scripts and server implementations
│ ├── server_jetson_inference_onnxruntime_7W_final.py # Production ORT server (7 W)
│ ├── server_jetson_inference_trt.py # TensorRT inference server
│ ├── trt_helper.py # TensorRT engine wrapper
│ ├── build_tensorrt_model.py # ONNX → TensorRT compilation
│ ├── parity_test.py # ORT vs TRT numerical parity check
│ ├── test_onnx.py # ONNX Runtime unit test
│ └── test_server.py # Flask endpoint test
│
├── onnx_models/
│ ├── policy_OAM.onnx # Production policy (MLP 64×64×64, input 8D, output 3D)
│ └── MODEL_CARD.md # Architecture, training provenance, I/O specification
│
├── kernels/ # SPICE ephemeris kernels (NAIF/JPL)
│ ├── de432s.bsp # Solar system ephemeris (DE432)
│ ├── naif0012.tls # Leap-second kernel
│ └── pck00010.tpc # Planetary constants kernel
│
├── nominal_trajectory/ # Reference optimal trajectory (NPZ)
│ ├── mars_nominal_trajectory.npz
│ └── mars_nominal_trajectory_fixed_time.npz
│
└── plots/ # Generated figures
Note on large files: The
dataset/,training_data/,trained_models/,results_pil/,results_sil/,montecarlo_results/, andpil_simulation/env_*directories are excluded from the repository (see.gitignore) because they collectively exceed 1 GB. SeeDATA.mdfor a complete description and regeneration instructions.
The reference trajectory is computed by solving the Earth–Mars low-thrust transfer as a fuel-optimal control problem via Pontryagin's Minimum Principle (indirect method). The state-costate system is integrated using Heyoka's Taylor-adaptive integrator at tolerance 1e-16. A log-barrier regularisation term is included in the Hamiltonian to smooth the bang-bang optimal control law into a continuous throttle profile.
A training dataset is generated by sampling perturbed initial conditions around the nominal trajectory and propagating the corresponding optimal control solutions. Each sample contains a time sequence of (observation, action) pairs representing the optimal feedback policy evaluated along a perturbed arc.
Key parameters:
- State: position
(x, y, z)[AU], velocity(vx, vy, vz)[AU/day], massm[kg] → 7D - Costate: associated adjoint variables → 7D (used only in OCP solving)
- Observation:
(x, y, z, vx, vy, vz, m_normalised, t)→ 8D - Action: throttle
α ∈ [0, 1], thrust direction(ix, iy, iz)(unit vector) → 3D (encoded as azimuth + elevation in[-1, 1])
The actor network (a configurable MLP with hidden layers of size 64×64×64 by default, tanh activations) is pre-trained via supervised learning on the OCP dataset. A custom loss function (CustomBCLoss) combines:
- a squared error on the throttle magnitude
- a cosine-dissimilarity term on the thrust direction (invariant to vector magnitude)
Training uses early stopping and saves the best checkpoint by validation loss.
The pre-trained actor is used to initialise the policy network of a PPO agent (Stable-Baselines3). Training takes place in LowThrustEnv, a custom Gymnasium environment that:
- propagates spacecraft dynamics using the Taylor integrator at each step
- evaluates a reward combining fuel consumption, trajectory tracking, and terminal accuracy
- applies a progressive terminal constraint (curriculum): the tolerance on the final state error is tightened according to a predefined schedule during training
- optionally adds observation noise, actuation noise, and Missed Thrust Events (random zero-thrust windows) for robustness training
Monte Carlo campaigns evaluate the trained policy under:
- Observation noise: Gaussian noise on position and velocity measurements
- Actuation noise: random perturbations on the applied thrust
- Missed Thrust Events (MTEs): random intervals during which the thruster is forced off, simulating hardware failures
Results are aggregated across thousands of independent trajectories and compared against the nominal performance.
The trained policy is exported to ONNX format via Onnx_export.ipynb and deployed on a NVIDIA Jetson Orin Nano in two configurations:
- ONNX Runtime (CPU, 7 W power mode): single-threaded inference, ~1 ms latency
- TensorRT (GPU): compiled engine from ONNX, sub-millisecond latency
A Flask HTTP server (pil_simulation/server_jetson_inference_onnxruntime_7W_final.py) exposes a /get_action endpoint. The simulation loop on the host machine sends observations over HTTP and integrates the received actions locally, forming a full Processor-in-the-Loop (PIL) closed loop. Hardware telemetry (CPU utilisation, board power) is collected via jtop.
- Python 3.10+ (3.8+ for Jetson)
- Heyoka and heyoka.py (install via conda-forge — see note below)
- SPICE kernels (included in
kernels/)
# 1. Clone the repository
git clone https://github.com/<your-username>/<repo-name>.git
cd <repo-name>
# 2. Create a virtual environment
python -m venv .venv
source .venv/bin/activate
# 3. Install Heyoka via conda-forge (required before pip install)
conda install -c conda-forge heyoka "heyoka.py>=5.0"
# 4. Install remaining Python dependencies
pip install -r requirements.txtHeyoka is a C++ library with Python bindings. It cannot be installed via pip alone and requires conda-forge or building from source. See the Heyoka documentation for details.
All workflow steps are controlled through Jupyter notebooks. Launch Jupyter from the repository root:
jupyter labRun notebooks/data_generation/Compute_Database_Mars_New.ipynb.
This solves the TPBVP for a large number of perturbed initial conditions and saves the resulting trajectory dataset to dataset/ as .npz files.
Run notebooks/bc/Pretrain_Actor.ipynb.
Loads a dataset from dataset/, trains the MLP actor via supervised learning, and saves the best checkpoint to training_data/<model_name>/ckpt/actor_best.pth.
Run notebooks/rl/Train_Model_RL.ipynb.
Initialises the PPO policy from a pre-trained BC checkpoint and trains in LowThrustEnv. Saves the final model and monitor logs to trained_models/.
Run notebooks/evaluation/Montecarlo_Policy_Evaluation.ipynb followed by notebooks/evaluation/Evaluate_Montecarlo_Results.ipynb.
Evaluates the trained policy across thousands of perturbed scenarios and produces performance statistics.
Run notebooks/deployment/Onnx_export.ipynb.
Exports the trained PyTorch policy to ONNX format, saved to onnx_models/policy_OAM.onnx. See onnx_models/MODEL_CARD.md for full I/O specification.
On the Jetson (after transferring the ONNX model and setting up the environment):
cd pil_simulation
python server_jetson_inference_onnxruntime_7W_final.pyOn the host machine, run notebooks/deployment/PIL_simulation.ipynb to start the closed-loop simulation.
cd pil_simulation
python build_tensorrt_model.pyThen use server_jetson_inference_trt.py instead of the ONNX Runtime server.
The following steps reproduce the full experimental pipeline from scratch:
- Nominal trajectory:
Mars_nominal_trajectory.ipynb - Dataset generation:
Compute_Database_Mars_New.ipynb(~100k trajectories, ~420 MB) - Behavioural cloning:
Pretrain_Actor.ipynb - PPO training:
Train_Model_RL.ipynb(~25M steps) - Monte Carlo evaluation:
Montecarlo_Policy_Evaluation.ipynb - ONNX export:
Onnx_export.ipynb - PIL validation: run server on Jetson +
PIL_simulation.ipynbon host
| Metric | Value |
|---|---|
| Transfer mission | Earth → Mars (low-thrust, fixed time of flight) |
| Policy architecture | MLP 64×64×64, tanh |
| RL algorithm | PPO (Stable-Baselines3) |
| Training steps | ~25M environment interactions |
| ORT inference latency (Jetson, 7 W) | ~1 ms |
| TRT inference latency (Jetson, GPU) | < 1 ms |
| Power consumption (7 W mode) | ~7 W |
| Robustness (MTE, observation noise) | evaluated via Monte Carlo |
For detailed numerical results, refer to the thesis document:
2026_03_Pareschi_Thesis.pdf.
| Component | Technology |
|---|---|
| ODE integration | Heyoka (Taylor-adaptive, tol=1e-16) |
| Optimal control | Pontryagin's Minimum Principle, indirect shooting |
| Ephemeris | SPICE / spiceypy (DE432) |
| Neural networks | PyTorch |
| RL training | Stable-Baselines3 (PPO) |
| RL environment | Gymnasium |
| Model export | ONNX |
| Edge inference | ONNX Runtime / TensorRT |
| Deployment server | Flask |
| Edge hardware | NVIDIA Jetson Orin Nano |
| Hardware monitoring | jtop (jetson-stats) |
| Numerics | NumPy, SciPy, scikit-learn |
| Visualisation | Matplotlib |
- Guidance generalisation: extend to multi-target or time-free transfer problems.
- Multi-revolution transfers: handle trajectories with multiple revolutions around the Sun.
- State estimation integration: couple the guidance policy with an onboard filter for realistic navigation.
- Fuel-time trade-off: incorporate time-of-flight as an additional optimisation variable.
- Uncertainty-aware policies: explore distributional RL or conformal prediction for risk-bounded guarantees.
- Further hardware compression: quantisation-aware training or pruning for lower-power deployment.
- Formal verification: bound worst-case performance under MTE sequences using reachability analysis.
If you use this code in your research, please cite:
@mastersthesis{pareschi2026rl,
author = {Pareschi, Marcello},
title = {Reinforcement Learning for Autonomous Low-Thrust Interplanetary Guidance},
school = {Politecnico di Milano},
year = {2026},
type = {Master's Thesis}
}The thesis document is included in this repository as 2026_03_Pareschi_Thesis.pdf.
Marcello Pareschi MSc Aerospace Engineering — Politecnico di Milano Thesis supervisor: Prof. Francesco Topputo
This project is released under the MIT License.