Train LLM agents collaboratively across decentralized clients, without sharing local data.
- [Jun 2026] Initial release of the FedAgent library, federated PPO/GRPO trainer, two-level heterogeneity suite, and full WebShop + ALFWorld reproduction.
- [Jun 2026] Paper online: Is Decentralized LLM Agent RL Robust to Heterogeneity? An Asymmetric Tale, by Canyu Chen*, Kangyu Zhu*, Zhaorun Chen, Zhanhui Zhou, Shizhe Diao, Yiping Lu, Tian Li, Manling Li+, and Dawn Song+ (*Equal contribution, +Equal Advising, Homepage: https://fed-agent.github.io, PDF). This work is honored to receive the π Best Paper Award in the AAAI 2026 Workshop on Trust and Control in Agentic AI and π Outstanding Paper Award in the AAAI 2026 Workshop on Personalization in the Era of Large Foundation Models.
FedAgent is a library for federated RL training of LLM agents. It implements a federated training server with FedAvg aggregation (plus optional client-side FedProx), a two-level heterogeneity suite (task vs environment partitioning), and federated PPO/GRPO trainers built on verl-agent. You can reproduce the paper's experiments or extend the framework with your own datasets, environments, and algorithms.
FedAgent is also the reference implementation for the paper, which formalizes
agent heterogeneity at two structurally distinct levels (task vs environment)
and derives an asymmetric robustness result: federated training is robust to
task-level heterogeneity but worst-case non-robust to environment-level
heterogeneity. See docs/heterogeneity.md for
the full construction.
- Federated PPO and GRPO trainers β drop-in federated counterparts of the verl-agent trainers; swap one config to go from single-client to federated
- Two-level heterogeneity suite β task-level (Preference / Coverage / Hardness) and environment-level (5 WebShop transition variants), the first systematic decomposition for agent FL
- FedAvg aggregation with FSDP-sharded model support, pluggable for custom rules, plus optional client-side FedProx (a proximal term added to local training, not a server rule)
- Fully configurable federation protocol (clients
N, clients/roundM, local epochsE, roundsT, tasks/client|Xα΅’|) with ready-made sweeps - Any HuggingFace backbone (paper uses Qwen2.5-1.5B/3B/7B-Instruct, Llama-3.2-3B-Instruct); WebShop and ALFWorld benchmarks out of the box
- FSDP sharding, single-GPU to multi-node, SLURM / torchrun launch paths
Clients can run serially or in parallel across GPUs; the library is
W&B-free (metrics go to JSON / console) and exposes extension points for new
datasets, environments, heterogeneity strategies, and aggregation rules
(see docs/extending.md).
Full details in docs/features.md.
fedagent/
βββ README.md # this file
βββ LICENSE # Apache-2.0
βββ NOTICE # third-party attributions
βββ CITATION.cff # how to cite (TODO: finalize once published)
βββ reproduce.sh # one-command reproduction entry point
βββ evaluate.sh # evaluate a trained checkpoint + collect trajectories
βββ download_data.sh # fetch WebShop / ALFWorld data (not shipped)
βββ .env.example # optional environment variables (W&B removed)
βββ .gitignore
βββ core/ # federated server + aggregation + trainers (contribution)
βββ utils/ # model aggregation (FedAvg, incl. FSDP)
βββ tools/ # run_federated.py, resolve_paths.py, checkpoint monitor
β βββ aggregation/ # aggregation verification / diagnostic toolbox
βββ scripts/ # setup_env.sh, runners, verl-agent base launch scripts
βββ config/ # curated experiment configs (W&B stripped)
β βββ paths.yaml.example # path template consumed by tools/resolve_paths.py
β βββ example.yaml # fully annotated example config
βββ docs/ # user-facing documentation (see below)
βββ eval/ # checkpoint evaluation + trajectory collection
βββ third_party/
βββ verl-agent/ # vendored upstream (Apache-2.0), no bundled 5.6 GB data
FedAgent is a framework extension, so first-party code spans two layers: a
top-level control plane and in-framework hooks that live inside the vendored
tree (verl-agent imports/runs them). Everything else under
third_party/verl-agent/ is unmodified upstream (Apache-2.0). Per-file detail:
docs/ARCHITECTURE.md; exhaustive edit list:
CHANGES.md.
fedagent/ ββ first-party (this work) ββ
βββ core/ control plane: federated server, round orchestration, aggregation
βββ utils/ model aggregation (FedAvg, incl. FSDP)
βββ tools/ run_federated.py, resolve_paths.py, aggregation/, env_heterogeneity/, heterogeneity_test/, monitor/
βββ eval/ checkpoint evaluation + trajectory collection
βββ scripts/ setup_env.sh, federated runners, verl-agent launchers, plotting/
βββ config/, docs/ experiment configs (W&B stripped) + documentation
β
βββ third_party/verl-agent/ ββ vendored upstream (Apache-2.0); our hooks woven in ββ
βββ agent_system/environments/partition_strategy.py core heterogeneity constructions
βββ agent_system/environments/fed_env_manager.py federated env managers
βββ verl/trainer/main_ppo_fed.py federated PPO/GRPO entry point
βββ verl/trainer/ppo/ray_trainer_fed.py Ray federated trainer
βββ verl/utils/checkpoint/fsdp_checkpoint_manager_fed.py federated checkpoint manager
βββ verl/utils/tracking_fed.py per-round / per-client tracking
FedAgent runs on Python 3.10. WebShop and ALFWorld have conflicting
dependencies (WebShop needs a Java/Lucene search stack via pyserini/pyjnius;
ALFWorld needs the TextWorld + Fast-Downward planning stack). Following
verl-agent's own guidance,
each benchmark gets its own conda env:
# WebShop -> conda env `fedagent-webshop` (Python 3.10), incl. vendored verl-agent
bash scripts/setup_env.sh create webshop
conda activate fedagent-webshop
# ALFWorld -> conda env `fedagent-alfworld`
bash scripts/setup_env.sh create alfworld
conda activate fedagent-alfworld
alfworld-download -f # one-time: PDDL + game files -> ~/.cache/alfworld/
# Path template (both envs)
cp config/paths.yaml.example config/paths.yaml && $EDITOR config/paths.yamlWebShop additionally needs a JDK on PATH (for pyserini). Each reproduce.sh /
evaluate.sh run must happen inside the matching env. Full step-by-step setup (both envs, data, and the upstream env packages) is in
docs/installation.md.
W&B logging is removed from this release, no tracking account or key needed.
The default configs run out of the box: the three small WebShop catalog files
(items_shuffle_1000.json, items_ins_v2_1000.json, items_human_ins.json,
backing webshop.use_small: true) are already shipped in the repo, where the
WebShop env loads them from
third_party/verl-agent/agent_system/environments/env_package/webshop/webshop/data/
(not the top-level data/, which ships only a README).
Two things are fetched separately:
bash download_data.sh # ALFWorld game files (auto) + WebShop full-catalog instructions- ALFWorld game files: auto-downloaded by the script (
alfworld-download) to~/.cache/alfworld, where the env reads them. - WebShop full catalog (
items_shuffle.json~5.2 GB +items_ins_v2.json), needed only for full-scalewebshop.use_small: falseruns, and fetched manually: the script prints instructions to download them from princeton-nlp/WebShop into the same WebShopdata/directory. The shipped small files already reproduce the paper's WebShop results. Seedocs/configuration.mdfor theuse_smallswitch.
Backbones are HuggingFace model ids (default Qwen/Qwen2.5-1.5B-Instruct) and
auto-download on first run to ~/.cache/huggingface (set HF_HOME to relocate;
~3 GB for 1.5B up to ~15 GB for 7B). Two caveats: the main table's
Llama-3.2-3B-Instruct is gated: accept its HuggingFace license and
huggingface-cli login (or set HF_TOKEN) first; and on offline / air-gapped
clusters pre-fetch on a login node and set HF_HUB_OFFLINE=1. See
docs/installation.md for details.
Run a FedAgent experiment directly with the federated runner: give it a config
name (its path under config/, without the .yaml) and a round count, from the
repository root inside the matching conda env.
# WebShop main run, 70 rounds. The config sets the backbone, GPU count, and protocol:
python tools/run_federated.py --restart-resume \
uniform/Qwen2.5-1.5B-Instruct/main/grpo/fed_webshop_grpo_total-100_cl-per-rd-2_rd-70_ep-per-cl-3_min-goals-per-cl-100_p-uniform 70The runner resolves the config, creates the run's ./output/ directory, and
launches per-client training; it is re-runnable and resumes where it left off.
Hardware (GPU count, FSDP offload, serial vs parallel clients) is read from the
config (verl.trainer.*, federated.training.*); to change it, edit those keys or
follow the running guide, which also documents the
lower-level launcher scripts/start_federated.sh.
Evaluate a trained checkpoint and collect trajectories:
bash evaluate.sh webshop /path/to/checkpointTrained checkpoints are saved as FSDP shards; evaluate.sh merges them to
HuggingFace format on first use (see eval/README.md).
To reproduce the paper, reproduce.sh wraps the runner with named experiments
and hardware flags: it resolves the canonical config, applies any overrides, and
launches it.
bash reproduce.sh webshop-main # WebShop main table, GRPO, 4 GPUs
bash reproduce.sh alfworld-main --single-gpu # ALFWorld main, 1-GPU debug run
bash reproduce.sh webshop-main --mode serial # clients run one at a time
bash reproduce.sh webshop-main --slurm # submit via SLURM (cluster)The full guide is in docs/reproducing.md: every table
and figure mapped to its config directory, with run commands, seeds, and compute
estimates (~1,800 H100 GPU-hours total). It covers the main table (Local /
Centralized / FedAgent across four backbones Γ WebShop + ALFWorld), the task- and
environment-level heterogeneity studies, and the decentralized ablations.
| Doc | Contents |
|---|---|
docs/features.md |
Key features in depth: the config keys, flags, and files behind each headline capability. |
docs/installation.md |
Two-conda-env setup (WebShop vs ALFWorld), full step-by-step, JDK / game-file notes. |
docs/running.md |
Running FedAgent: the run-mode matrix (parallel vs serial, FSDP on/off, single-GPU, variable GPU count, multi-node, SLURM), flag-to-knob table, and worked examples. |
docs/reproducing.md |
Per-experiment reproduction recipes, compute estimates, and seeds. |
docs/heterogeneity.md |
The two-level heterogeneity taxonomy and how to construct/select each variant. |
docs/configuration.md |
Config filename decoder and field reference for the federated: and verl: blocks. |
docs/extending.md |
Extension points: new dataset/env, new heterogeneity strategy, new RL algorithm, new aggregation strategy. |
If you use FedAgent in your research, please cite:
@article{fedagent2026,
title = {Is Decentralized LLM Agent RL Robust to Heterogeneity? An Asymmetric Tale},
author = {Chen, Canyu and Zhu, Kangyu and Chen, Zhaorun and Zhou, Zhanhui and Diao, Shizhe and Lu, Yiping and Li, Tian and Li, Manling and Song, Dawn},
journal = {arXiv preprint arXiv:},
year = {2026}
}This project is released under the Apache License 2.0: see
LICENSE.
FedAgent builds on a vendored, modified fork of verl-agent, which itself extends veRL. We gratefully acknowledge:
- veRL: Β© ByteDance / the veRL authors (Apache-2.0): the base RL training framework. https://github.com/volcengine/verl
- verl-agent / GiGPO: Feng et al., Group-in-Group Policy Optimization for LLM Agent Reinforcement Learning (arXiv:2505.10978): the agent-RL fork FedAgent is built on. https://github.com/langfengQ/verl-agent
- WebShop: Yao et al., Princeton NLP (MIT License): the e-commerce agent benchmark.
- ALFWorld: Shridhar et al., Microsoft Research (MIT License): the embodied household agent benchmark.
Full per-component attributions and license texts are aggregated in
NOTICE.
