Skip to content

ReQuery-IL/Consistency-IL

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Consistency-IL

Consistency-IL is an imitation learning method built on top of ThriftyDAgger, with an additional consistency filter for noisy or conflicting human feedback.

LunarLanderContinuous-v3 PandaPickAndPlace-v3
LunarLander feedback and requery PandaPickAndPlace feedback and requery

Regenerate them with:

python scripts/make_process_gifs.py --env all --fps 10

How It Works

Baseline: ThriftyDAgger

ThriftyDAgger learns a policy by switching control between a robot and an expert. Two adaptive thresholds decide when to hand off to the expert:

Threshold Trigger condition Meaning
tau_n (novelty) ensemble action std dev > tau_n the state is out-of-distribution
tau_q (risk) Q(s, pi(s)) < tau_q the predicted success probability is too low

Both thresholds are re-calibrated over time so that interventions stay near the target expert budget.

Once the expert takes over, the robot regains control only after:

  • the minimum takeover duration is satisfied
  • the robot action is close enough to the expert action
  • the Q-value is above tau_q

Ensemble Policy and Novelty

The policy is an ensemble of independent PolicyNet members. Each member is trained with behaviour cloning on D_human. Novelty is measured from ensemble disagreement: high action variance means the state looks unfamiliar.

Risk Q-Network

A single Q-network estimates a success-like value for (state, action) and is used as the risk signal during switching.

Consistency Noise Filter (Ours)

After each episode, every contiguous expert-control block is checked before it is added to D_human.

Step 1 - segment construction.
Segments are contiguous takeover blocks, not action-type chunks. In the current code, segment_len acts as a minimum block-length threshold.

Step 2 - retrieval.
Each block is encoded with ContrastiveTrajectoryEncoder and compared against stored reference segments in SegmentMemory.

Step 3 - conflict metrics.
For similar matches:

action_gap = L2(mean_action_candidate - mean_action_reference)
reward_gap = |mean_reward_candidate - mean_reward_reference|

Step 4 - decision.

Condition Decision
no conflict KEEP
severe conflict DISCARD whole block
per-step noisy action CORRECT only flagged steps
mild segment conflict REQUERY or auto requery

Current interaction semantics:

  • KEEP Accept the block as-is.
  • DISCARD Drop the block.
  • CORRECT Per-step noise was detected. In human mode, the user directly enters a new action for the flagged step.
  • REQUERY The block conflicts with similar memory segments. In human mode, the user chooses between the current action and similar memory actions. With requery_mode=auto, the best similar memory action is used automatically. With requery_mode=off, the segment-level requery stage is skipped entirely.

Feedback sources and modes

The training pipeline has three separate interaction stages, and all three are configured from each environment's params.json.

Stage Mode key Values
Bootstrap demos demo_mode auto, human
Online takeover feedback_mode auto, human, auto_noisy
Segment requery requery_mode auto, human, off

human means real human input, auto means demos or feedback generated automatically by expert.py, auto_noisy means the same automatic source with added action noise, and off disables segment-level requery.

Repository layout

core/
  agent.py            main training / rollout / evaluation loop
  buffer.py           replay buffers for D_human and D_robot
  consistency.py      segment-memory and conflict logic
  contrastive.py      trajectory encoder and contrastive pretraining
  ensemble.py         ensemble behavior-cloning policy
  feedback_filter.py  episode-level filtering and feedback commit
  networks.py         policy and Q-network definitions
  segments.py         segment extraction and segment-memory bootstrap helpers

envs/
  __init__.py         public env API (`make_env`, control helpers)
  _control_common.py  shared keyboard-control utilities
  <env_name>/
    __init__.py
    params.json        env-specific training configuration
    expert.py          automatic human-stand-in policy for experiments
    wrapper.py         optional env-specific observation wrapper
    control_scheme.py  optional keyboard-control mapping

experiments/
  ablation.py                    ablation suite
  noise_robustness.py            factored demo/feedback noise suite
  demo_feedback_noise_grid.py    demo noise x feedback noise grid suite
  feedback_frequency_sweep.py    feedback noise frequency sweep suite
  bootstrap_demo_count_sweep.py  bootstrap demo count sweep suite
  analyze.py                     result loading and plot generation

scripts/
  train.py                main training entry point
  evaluate.py             checkpoint evaluation
  collect_human_demos.py  live human demo collection
  collect_expert_demos.py automatic stand-in demo collection
  run_experiments.py      batch experiment runner

outputs/
  demos/              optional manually collected or pre-collected demo files
  runs/               per-run outputs (checkpoint, summary, plot)
  experiments/        per-experiment index, generated demos, and analysis plots

utils/
  checkpoints.py     checkpoint save / load helpers
  config.py          shared paths and train-key validation
  expert_demos.py    automatic stand-in demo collection helpers
  human_controls.py  takeover and requery input handling
  human_demos.py     human demo save / load helpers
  interaction.py     requery callback wiring
  plotting.py        training history plots
  success.py         shared success-signal helpers

Installation

The checked-in Conda file is environment.yaml.

conda env create -f environment.yaml
conda activate consistency-il

Training

python scripts/train.py --env <env_name>
python scripts/train.py --env <env_name> --demo_path <saved_human_demo.pt>
python scripts/train.py --load <checkpoint.pt>

Additional public flags:

  • --no_render
  • --no_plot
  • --bc_only — skip encoder and online loop; run BC pretraining only
  • --set KEY=VALUE — override any param from params.json at runtime (e.g. --set seed=42 --set auto_feedback_noise=0.5)

Rules:

  • --env starts a fresh run from envs/<env_name>/params.json
  • --demo_path replaces live bootstrap demo collection with a saved human demo file
  • --load resumes from a checkpoint

Run outputs

Each training run creates a timestamped run directory:

outputs/runs/<env>/<YYYYMMDD_HHMMSS>/
├── checkpoint.pt         final checkpoint
├── best_checkpoint.pt    best checkpoint
├── params_used.json
├── summary.json
└── training.png          if `--no_plot` is not used

summary.json contains final evaluation metrics, per-iteration training history, and consistency filter statistics (total_expert_steps, conflicts_detected, segments_accepted, segments_corrected, segments_discarded).

Evaluation

python scripts/evaluate.py outputs/runs/<env>/<timestamp>/checkpoint.pt

Additional public flags:

  • --n_episodes — number of evaluation episodes (default: 5)
  • --switching — enable robot/expert switching during evaluation
  • --trace_switching — log switching events per episode
  • --trace_interval — interval between switching trace logs
  • --success_threshold — override the environment's success threshold
  • --render — render the environment
  • --seed

Human demos

Collect human-controlled demos:

python scripts/collect_human_demos.py \
  --env lunar_lander \
  --n_episodes 5

The keyboard scheme is resolved automatically from the environment package.

Additional public flags:

  • --out — path to save the recorded demo file
  • --seed

Automatic human-stand-in demos

For automated experiments, the repository also provides heuristic policies that stand in for a human. Those live in each environment's expert.py.

Collect automatic stand-in demos:

python scripts/collect_expert_demos.py \
  --env lunar_lander \
  --n_episodes 5

Additional public flags:

  • --success_threshold — override the environment's success threshold
  • --action_noise — std dev of Gaussian noise added to expert actions
  • --noise_prob — probability each step gets noise applied (default: 1.0)
  • --out — path to save the recorded demo file
  • --render — render expert collection live
  • --render_fps
  • --save_gif
  • --seed

Manual or pre-collected demo files can be stored under outputs/demos/, but the batch experiment suites below do not require those files. Suites that need fixed bootstrap demos generate them at the start of each experiment under that experiment's result directory:

outputs/experiments/<experiment>/<timestamp>/generated_demos/
  <env>/
    *.pt

Experiments

Five experiment suites evaluate Consistency-IL against BC and ThriftyDAgger baselines. scripts/run_experiments.py is the common runner for all suites. When an experiment defines prepare_runs(), the runner first creates fresh bootstrap demo files under outputs/experiments/<experiment>/<timestamp>/generated_demos/ and injects those paths into each training run.

Ablation

Compares five method variants under fixed noisy online feedback to isolate which filter components contribute to performance. Bootstrap demos are generated clean once per environment and reused across the method variants.

Method label consistency_mode requery_mode
BC - -
ThriftyDAgger off off
Consistency-IL_similarity_only similarity_only off
Consistency-IL_no_requery full off
Consistency-IL_full full auto

Environments: lunar_lander, panda_pick_and_place
Seeds: 42, 123, 999
Feedback noise: LunarLander auto_feedback_noise=2.0; Panda auto_feedback_noise=0.5; noise_prob=1.0
Total runs: 30

Noise Robustness

Separates bootstrap demo noise from online feedback noise. Demo files are generated once for each environment and demo level, then reused by all runs with the same demo setting.

Demo level demo_action_noise demo noise_prob
low 0.1 1.0
medium 0.3 1.0
high 0.5 1.0

Environments: lunar_lander, panda_pick_and_place
Methods: BC, ThriftyDAgger, Consistency-IL_full
Online feedback noise: auto_feedback_noise low/medium/high = 0.1/0.3/0.5
Online feedback frequency: noise_prob low/medium/high = 0.5/0.7/1.0
Generated demos: 6 files (2 envs x 3 demo levels)
Total runs: 114

Demo x Feedback Noise Grid

Runs a full grid where demo noise and feedback noise use the same four clean/low/medium/high levels. Here the feedback frequency is fixed at noise_prob=1.0, so low/medium/high differ by noise magnitude only.

Level noise noise_prob
clean 0.0 1.0
low 0.1 1.0
medium 0.3 1.0
high 0.5 1.0

Environments: lunar_lander, panda_pick_and_place
Seeds: 42, 123, 999
Methods: BC, ThriftyDAgger, Consistency-IL_full
Generated demos: 24 files (2 envs x 4 demo levels x 3 seeds)
Total runs: 216

Feedback Frequency Sweep

Holds feedback noise magnitude fixed at auto_feedback_noise=0.3 and varies how often that noise is applied. Demo quality is fixed to either clean or medium.

Demo levels: clean (demo_action_noise=0.0) and medium (demo_action_noise=0.3)
Feedback noise: auto_feedback_noise=0.3
Feedback frequency: noise_prob=0.25, 0.50, 0.75, 1.00
Environments: lunar_lander, panda_pick_and_place
Seeds: 42, 123, 999
Methods: BC, ThriftyDAgger, Consistency-IL_full
Generated demos: 12 files (2 envs x 2 demo levels x 3 seeds)
Total runs: 108

Bootstrap Demo Count Sweep

Measures how many initial clean expert demo episodes are needed before online learning starts.

Demo counts: 1, 3, 5, 10 episodes
Demo noise: clean (demo_action_noise=0.0, noise_prob=1.0)
Online feedback noise: auto_feedback_noise=0.3, noise_prob=1.0 for online methods
Environments: lunar_lander, panda_pick_and_place
Seeds: 42, 123, 999
Methods: BC, ThriftyDAgger, Consistency-IL_full
Generated demos: 24 files (2 envs x 4 demo counts x 3 seeds)
Total runs: 72

Running experiments

# Run a single experiment suite
python scripts/run_experiments.py --experiment ablation --no_render --no_plot
python scripts/run_experiments.py --experiment noise_robustness --no_render --no_plot
python scripts/run_experiments.py --experiment demo_feedback_noise_grid --no_render --no_plot
python scripts/run_experiments.py --experiment feedback_frequency_sweep --no_render --no_plot
python scripts/run_experiments.py --experiment bootstrap_demo_count_sweep --no_render --no_plot

# Run all suites back-to-back
python scripts/run_experiments.py --experiment all --no_render --no_plot

# Dry-run (print argv without executing)
python scripts/run_experiments.py --experiment noise_robustness --dry_run

# Resume an interrupted run
python scripts/run_experiments.py --resume outputs/experiments/ablation/<timestamp>/index.json

Results are written to outputs/experiments/<experiment>/<YYYYMMDD_HHMMSS>/index.json. Each completed run records its run_dir pointer so the analyzer can locate summary.json.

Analyzing results

# Latest run (if only one timestamp exists)
python experiments/analyze.py --experiment ablation
python experiments/analyze.py --experiment noise_robustness
python experiments/analyze.py --experiment demo_feedback_noise_grid
python experiments/analyze.py --experiment feedback_frequency_sweep
python experiments/analyze.py --experiment bootstrap_demo_count_sweep

# Specific timestamp
python experiments/analyze.py --experiment ablation --timestamp 20260422_020015

Plots are saved alongside index.json in the experiment result directory. File names are prefixed by the experiment name:

outputs/experiments/<experiment>/<timestamp>/
├── index.json
├── <experiment>_autonomous_success_rate*.png
├── <experiment>_autonomous_return_mean*.png
├── <experiment>_total_expert_steps*.png
├── <experiment>_segment_stats.png
├── training_curves_LunarLanderContinuous_v3.png
└── training_curves_PandaPickAndPlace_v3.png

Supported environments

Env key Gym env
bipedal_walker BipedalWalker-v3
car_racing CarRacing-v3
inverted_double_pendulum InvertedDoublePendulum-v5
inverted_pendulum InvertedPendulum-v5
lunar_lander LunarLanderContinuous-v3
mountain_car MountainCarContinuous-v0
panda_pick_and_place PandaPickAndPlace-v3
pendulum Pendulum-v1
reacher Reacher-v5

About

Imitation learning method that filters noisy or conflicting human feedback using contrastive trajectory encoding, built on top of ThriftyDAgger.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages