Consistency-IL is an imitation learning method built on top of ThriftyDAgger, with an additional consistency filter for noisy or conflicting human feedback.
| LunarLanderContinuous-v3 | PandaPickAndPlace-v3 |
|---|---|
![]() |
![]() |
Regenerate them with:
python scripts/make_process_gifs.py --env all --fps 10ThriftyDAgger learns a policy by switching control between a robot and an expert. Two adaptive thresholds decide when to hand off to the expert:
| Threshold | Trigger condition | Meaning |
|---|---|---|
| tau_n (novelty) | ensemble action std dev > tau_n | the state is out-of-distribution |
| tau_q (risk) | Q(s, pi(s)) < tau_q | the predicted success probability is too low |
Both thresholds are re-calibrated over time so that interventions stay near the target expert budget.
Once the expert takes over, the robot regains control only after:
- the minimum takeover duration is satisfied
- the robot action is close enough to the expert action
- the Q-value is above
tau_q
The policy is an ensemble of independent PolicyNet members. Each member is
trained with behaviour cloning on D_human. Novelty is measured from ensemble
disagreement: high action variance means the state looks unfamiliar.
A single Q-network estimates a success-like value for (state, action) and is
used as the risk signal during switching.
After each episode, every contiguous expert-control block is checked before it
is added to D_human.
Step 1 - segment construction.
Segments are contiguous takeover blocks, not action-type chunks. In the current
code, segment_len acts as a minimum block-length threshold.
Step 2 - retrieval.
Each block is encoded with ContrastiveTrajectoryEncoder and compared against
stored reference segments in SegmentMemory.
Step 3 - conflict metrics.
For similar matches:
action_gap = L2(mean_action_candidate - mean_action_reference)
reward_gap = |mean_reward_candidate - mean_reward_reference|
Step 4 - decision.
| Condition | Decision |
|---|---|
| no conflict | KEEP |
| severe conflict | DISCARD whole block |
| per-step noisy action | CORRECT only flagged steps |
| mild segment conflict | REQUERY or auto requery |
Current interaction semantics:
KEEPAccept the block as-is.DISCARDDrop the block.CORRECTPer-step noise was detected. In human mode, the user directly enters a new action for the flagged step.REQUERYThe block conflicts with similar memory segments. In human mode, the user chooses between the current action and similar memory actions. Withrequery_mode=auto, the best similar memory action is used automatically. Withrequery_mode=off, the segment-level requery stage is skipped entirely.
The training pipeline has three separate interaction stages, and all three are
configured from each environment's params.json.
| Stage | Mode key | Values |
|---|---|---|
| Bootstrap demos | demo_mode |
auto, human |
| Online takeover | feedback_mode |
auto, human, auto_noisy |
| Segment requery | requery_mode |
auto, human, off |
human means real human input, auto means demos or feedback
generated automatically by expert.py, auto_noisy means the same automatic
source with added action noise, and off disables segment-level requery.
core/
agent.py main training / rollout / evaluation loop
buffer.py replay buffers for D_human and D_robot
consistency.py segment-memory and conflict logic
contrastive.py trajectory encoder and contrastive pretraining
ensemble.py ensemble behavior-cloning policy
feedback_filter.py episode-level filtering and feedback commit
networks.py policy and Q-network definitions
segments.py segment extraction and segment-memory bootstrap helpers
envs/
__init__.py public env API (`make_env`, control helpers)
_control_common.py shared keyboard-control utilities
<env_name>/
__init__.py
params.json env-specific training configuration
expert.py automatic human-stand-in policy for experiments
wrapper.py optional env-specific observation wrapper
control_scheme.py optional keyboard-control mapping
experiments/
ablation.py ablation suite
noise_robustness.py factored demo/feedback noise suite
demo_feedback_noise_grid.py demo noise x feedback noise grid suite
feedback_frequency_sweep.py feedback noise frequency sweep suite
bootstrap_demo_count_sweep.py bootstrap demo count sweep suite
analyze.py result loading and plot generation
scripts/
train.py main training entry point
evaluate.py checkpoint evaluation
collect_human_demos.py live human demo collection
collect_expert_demos.py automatic stand-in demo collection
run_experiments.py batch experiment runner
outputs/
demos/ optional manually collected or pre-collected demo files
runs/ per-run outputs (checkpoint, summary, plot)
experiments/ per-experiment index, generated demos, and analysis plots
utils/
checkpoints.py checkpoint save / load helpers
config.py shared paths and train-key validation
expert_demos.py automatic stand-in demo collection helpers
human_controls.py takeover and requery input handling
human_demos.py human demo save / load helpers
interaction.py requery callback wiring
plotting.py training history plots
success.py shared success-signal helpers
The checked-in Conda file is environment.yaml.
conda env create -f environment.yaml
conda activate consistency-ilpython scripts/train.py --env <env_name>
python scripts/train.py --env <env_name> --demo_path <saved_human_demo.pt>
python scripts/train.py --load <checkpoint.pt>Additional public flags:
--no_render--no_plot--bc_only— skip encoder and online loop; run BC pretraining only--set KEY=VALUE— override any param fromparams.jsonat runtime (e.g.--set seed=42 --set auto_feedback_noise=0.5)
Rules:
--envstarts a fresh run fromenvs/<env_name>/params.json--demo_pathreplaces live bootstrap demo collection with a saved human demo file--loadresumes from a checkpoint
Each training run creates a timestamped run directory:
outputs/runs/<env>/<YYYYMMDD_HHMMSS>/
├── checkpoint.pt final checkpoint
├── best_checkpoint.pt best checkpoint
├── params_used.json
├── summary.json
└── training.png if `--no_plot` is not used
summary.json contains final evaluation metrics, per-iteration training history,
and consistency filter statistics (total_expert_steps, conflicts_detected,
segments_accepted, segments_corrected, segments_discarded).
python scripts/evaluate.py outputs/runs/<env>/<timestamp>/checkpoint.ptAdditional public flags:
--n_episodes— number of evaluation episodes (default: 5)--switching— enable robot/expert switching during evaluation--trace_switching— log switching events per episode--trace_interval— interval between switching trace logs--success_threshold— override the environment's success threshold--render— render the environment--seed
Collect human-controlled demos:
python scripts/collect_human_demos.py \
--env lunar_lander \
--n_episodes 5The keyboard scheme is resolved automatically from the environment package.
Additional public flags:
--out— path to save the recorded demo file--seed
For automated experiments, the repository also provides heuristic policies that
stand in for a human. Those live in each environment's expert.py.
Collect automatic stand-in demos:
python scripts/collect_expert_demos.py \
--env lunar_lander \
--n_episodes 5Additional public flags:
--success_threshold— override the environment's success threshold--action_noise— std dev of Gaussian noise added to expert actions--noise_prob— probability each step gets noise applied (default: 1.0)--out— path to save the recorded demo file--render— render expert collection live--render_fps--save_gif--seed
Manual or pre-collected demo files can be stored under outputs/demos/, but the
batch experiment suites below do not require those files. Suites that need fixed
bootstrap demos generate them at the start of each experiment under that
experiment's result directory:
outputs/experiments/<experiment>/<timestamp>/generated_demos/
<env>/
*.pt
Five experiment suites evaluate Consistency-IL against BC and ThriftyDAgger
baselines. scripts/run_experiments.py is the common runner for all suites.
When an experiment defines prepare_runs(), the runner first creates fresh
bootstrap demo files under outputs/experiments/<experiment>/<timestamp>/generated_demos/
and injects those paths into each training run.
Compares five method variants under fixed noisy online feedback to isolate which filter components contribute to performance. Bootstrap demos are generated clean once per environment and reused across the method variants.
| Method label | consistency_mode | requery_mode |
|---|---|---|
| BC | - | - |
| ThriftyDAgger | off | off |
| Consistency-IL_similarity_only | similarity_only | off |
| Consistency-IL_no_requery | full | off |
| Consistency-IL_full | full | auto |
Environments: lunar_lander, panda_pick_and_place
Seeds: 42, 123, 999
Feedback noise: LunarLander auto_feedback_noise=2.0; Panda auto_feedback_noise=0.5; noise_prob=1.0
Total runs: 30
Separates bootstrap demo noise from online feedback noise. Demo files are generated once for each environment and demo level, then reused by all runs with the same demo setting.
| Demo level | demo_action_noise | demo noise_prob |
|---|---|---|
| low | 0.1 | 1.0 |
| medium | 0.3 | 1.0 |
| high | 0.5 | 1.0 |
Environments: lunar_lander, panda_pick_and_place
Methods: BC, ThriftyDAgger, Consistency-IL_full
Online feedback noise: auto_feedback_noise low/medium/high = 0.1/0.3/0.5
Online feedback frequency: noise_prob low/medium/high = 0.5/0.7/1.0
Generated demos: 6 files (2 envs x 3 demo levels)
Total runs: 114
Runs a full grid where demo noise and feedback noise use the same four
clean/low/medium/high levels. Here the feedback frequency is fixed at
noise_prob=1.0, so low/medium/high differ by noise magnitude only.
| Level | noise | noise_prob |
|---|---|---|
| clean | 0.0 | 1.0 |
| low | 0.1 | 1.0 |
| medium | 0.3 | 1.0 |
| high | 0.5 | 1.0 |
Environments: lunar_lander, panda_pick_and_place
Seeds: 42, 123, 999
Methods: BC, ThriftyDAgger, Consistency-IL_full
Generated demos: 24 files (2 envs x 4 demo levels x 3 seeds)
Total runs: 216
Holds feedback noise magnitude fixed at auto_feedback_noise=0.3 and varies how
often that noise is applied. Demo quality is fixed to either clean or medium.
Demo levels: clean (demo_action_noise=0.0) and medium (demo_action_noise=0.3)
Feedback noise: auto_feedback_noise=0.3
Feedback frequency: noise_prob=0.25, 0.50, 0.75, 1.00
Environments: lunar_lander, panda_pick_and_place
Seeds: 42, 123, 999
Methods: BC, ThriftyDAgger, Consistency-IL_full
Generated demos: 12 files (2 envs x 2 demo levels x 3 seeds)
Total runs: 108
Measures how many initial clean expert demo episodes are needed before online learning starts.
Demo counts: 1, 3, 5, 10 episodes
Demo noise: clean (demo_action_noise=0.0, noise_prob=1.0)
Online feedback noise: auto_feedback_noise=0.3, noise_prob=1.0 for online methods
Environments: lunar_lander, panda_pick_and_place
Seeds: 42, 123, 999
Methods: BC, ThriftyDAgger, Consistency-IL_full
Generated demos: 24 files (2 envs x 4 demo counts x 3 seeds)
Total runs: 72
# Run a single experiment suite
python scripts/run_experiments.py --experiment ablation --no_render --no_plot
python scripts/run_experiments.py --experiment noise_robustness --no_render --no_plot
python scripts/run_experiments.py --experiment demo_feedback_noise_grid --no_render --no_plot
python scripts/run_experiments.py --experiment feedback_frequency_sweep --no_render --no_plot
python scripts/run_experiments.py --experiment bootstrap_demo_count_sweep --no_render --no_plot
# Run all suites back-to-back
python scripts/run_experiments.py --experiment all --no_render --no_plot
# Dry-run (print argv without executing)
python scripts/run_experiments.py --experiment noise_robustness --dry_run
# Resume an interrupted run
python scripts/run_experiments.py --resume outputs/experiments/ablation/<timestamp>/index.jsonResults are written to outputs/experiments/<experiment>/<YYYYMMDD_HHMMSS>/index.json.
Each completed run records its run_dir pointer so the analyzer can locate summary.json.
# Latest run (if only one timestamp exists)
python experiments/analyze.py --experiment ablation
python experiments/analyze.py --experiment noise_robustness
python experiments/analyze.py --experiment demo_feedback_noise_grid
python experiments/analyze.py --experiment feedback_frequency_sweep
python experiments/analyze.py --experiment bootstrap_demo_count_sweep
# Specific timestamp
python experiments/analyze.py --experiment ablation --timestamp 20260422_020015Plots are saved alongside index.json in the experiment result directory. File
names are prefixed by the experiment name:
outputs/experiments/<experiment>/<timestamp>/
├── index.json
├── <experiment>_autonomous_success_rate*.png
├── <experiment>_autonomous_return_mean*.png
├── <experiment>_total_expert_steps*.png
├── <experiment>_segment_stats.png
├── training_curves_LunarLanderContinuous_v3.png
└── training_curves_PandaPickAndPlace_v3.png
| Env key | Gym env |
|---|---|
bipedal_walker |
BipedalWalker-v3 |
car_racing |
CarRacing-v3 |
inverted_double_pendulum |
InvertedDoublePendulum-v5 |
inverted_pendulum |
InvertedPendulum-v5 |
lunar_lander |
LunarLanderContinuous-v3 |
mountain_car |
MountainCarContinuous-v0 |
panda_pick_and_place |
PandaPickAndPlace-v3 |
pendulum |
Pendulum-v1 |
reacher |
Reacher-v5 |

