Skip to content

feat(agent-service): persist agent state to disk and rehydrate on startup#5276

Open
bobbai00 wants to merge 1 commit into
apache:mainfrom
bobbai00:feat/5267-agent-service-persistence
Open

feat(agent-service): persist agent state to disk and rehydrate on startup#5276
bobbai00 wants to merge 1 commit into
apache:mainfrom
bobbai00:feat/5267-agent-service-persistence

Conversation

@bobbai00
Copy link
Copy Markdown
Contributor

What changes were proposed in this PR?

Each agent was a stateful object in a process-local Map, never persisted, so a restart / crash / redeploy lost every agent and its conversation. This adds opt-in disk persistence for agents:

  • TexeraAgent.toSnapshot() / restoreFromSnapshot() — capture and restore the durable state: the ReAct step tree, HEAD, settings, delegate metadata, and the workflow being edited. The short-lived user token and the recomputable execution-result cache are intentionally excluded. createdAt is preserved across restarts.
  • AgentSnapshotStore (src/persistence/) — one JSON file per agent, atomic writes (temp file + rename), debounced coalescing of rapid updates, and loadAll() that skips unreadable/unsupported files.
  • Server wiring — persists on create / clear / checkout / settings change and after each WebSocket turn; removes the file on delete; and rehydrates all agents on startup via rehydrateAgents. Enabled by AGENT_STATE_DIR; leaving it empty keeps the previous in-memory-only behavior.
Before:  restart -> agentStore empty            -> all agents + history lost
After:   restart -> loadAll() + rehydrate       -> agents restored from disk

Any related issues, documentation, discussions?

Closes #5267

How was this PR tested?

cd agent-service
bun test            # 110 pass, 0 fail
bun run typecheck   # clean
bun run format:check

New tests:

  • agent/texera-agent-snapshot.test.tstoSnapshot of a fresh agent, createdAt round-trip, restoreFromSnapshot restoring conversation/settings/workflow/delegate, a full JSON round-trip deep-equal, and version rejection.
  • persistence/agent-snapshot-store.test.ts — save/load, directory auto-create, overwrite, loadAll (incl. skipping corrupt/unsupported/non-matching files), remove (incl. missing), and debounced scheduleSave + flush coalescing.
  • server.test.ts — a created agent is written to disk, delete removes the file, and rehydrateAgents restores a persisted agent so it is served again.

Was this PR authored or co-authored using generative AI tooling?

Generated-by: Claude Opus 4.8 (1M context)

…rtup

Agents lived only in a process-local Map, so a restart/crash/redeploy lost
every agent and its conversation. This adds opt-in disk persistence:

- TexeraAgent.toSnapshot()/restoreFromSnapshot() capture and restore the
  ReAct step tree, HEAD, settings, delegate metadata, and workflow content
  (the user token and result caches are intentionally excluded).
- AgentSnapshotStore writes one JSON file per agent (atomic temp+rename),
  with debounced coalescing, and loads/skips-corrupt on startup.
- The server persists on create/clear/checkout/settings-change and after each
  WS turn, removes on delete, and rehydrates all agents on startup. Enabled by
  setting AGENT_STATE_DIR; empty keeps the previous in-memory-only behavior.

Closes apache#5267
@codecov-commenter
Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 90.26549% with 22 lines in your changes missing coverage. Please review.
✅ Project coverage is 49.39%. Comparing base (7bd6550) to head (1fdca13).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
agent-service/src/server.ts 78.26% 15 Missing ⚠️
...nt-service/src/persistence/agent-snapshot-store.ts 90.90% 7 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #5276      +/-   ##
============================================
+ Coverage     49.01%   49.39%   +0.37%     
  Complexity     2378     2378              
============================================
  Files          1050     1051       +1     
  Lines         40336    40544     +208     
  Branches       4277     4277              
============================================
+ Hits          19772    20027     +255     
+ Misses        19407    19360      -47     
  Partials       1157     1157              
Flag Coverage Δ *Carryforward flag
access-control-service 39.53% <ø> (ø) Carriedforward from 7bd6550
agent-service 39.03% <90.26%> (+5.26%) ⬆️
amber 51.58% <ø> (ø) Carriedforward from 7bd6550
computing-unit-managing-service 0.00% <ø> (ø) Carriedforward from 7bd6550
config-service 0.00% <ø> (ø) Carriedforward from 7bd6550
file-service 37.99% <ø> (ø) Carriedforward from 7bd6550
frontend 40.82% <ø> (ø) Carriedforward from 7bd6550
python 90.79% <ø> (ø) Carriedforward from 7bd6550
workflow-compiling-service 56.81% <ø> (ø) Carriedforward from 7bd6550

*This pull request uses carry forward flags. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Persist agent service state instead of keeping agents only in memory

2 participants