Skip to content

feat(k8s): add agent service and LiteLLM to the Helm chart#5272

Open
bobbai00 wants to merge 1 commit into
apache:mainfrom
bobbai00:feat/5269-agent-service-k8s
Open

feat(k8s): add agent service and LiteLLM to the Helm chart#5272
bobbai00 wants to merge 1 commit into
apache:mainfrom
bobbai00:feat/5269-agent-service-k8s

Conversation

@bobbai00
Copy link
Copy Markdown
Contributor

@bobbai00 bobbai00 commented May 28, 2026

What changes were proposed in this PR?

The agent-service image is built and runs under single-node compose but had no Helm deployment, and the chart had no in-cluster LLM gateway. This adds both, mirroring the proven preview/production chart while aligning the agent service's env to what the code on main actually reads.

Agent service (gated on agentService.enabled)

  • agent-service-deployment.yaml + agent-service-service.yaml, wired to in-cluster service DNS using the env names from agent-service/src/config/env.ts: TEXERA_DASHBOARD_SERVICE_ENDPOINT (webserver), LLM_ENDPOINT (access-control-service), WORKFLOW_COMPILING_SERVICE_ENDPOINT, and a per-CU EXECUTION_ENDPOINT_TEMPLATE.
  • A dedicated /api/agents HTTPRoute (REST + the /api/agents/:id/react WebSocket) plus a BackendTrafficPolicy that consistent-hashes on X-Agent-Workflow-Id — agents are held in memory per pod, so a workflow's requests must always reach the same replica (the client already stamps that header).
  • Readiness/liveness on /api/healthcheck.

LiteLLM — in-cluster LLM gateway (gated on litellm.enabled)

  • litellm-deployment.yaml + litellm-service.yaml + litellm-config.yaml (ConfigMap).
  • Postgres persistence on by default: a texera_litellm database is created by the Postgres init script, and the deployment sets DATABASE_URL + STORE_MODEL_IN_DB=true so keys, spend, and model config survive restarts.
  • access-control-service wired to LiteLLM (LITELLM_BASE_URL, LITELLM_MASTER_KEY, copilot enabled).

A shared Opaque Secret holds the agent gateway key, the LiteLLM master key, and the provider API keys (supply via --set/override; none committed).

Any related issues, documentation, discussions?

Closes #5269

Also implements the in-cluster LiteLLM Helm support tracked by #4108 (supersedes the approach in #4109).

How was this PR tested?

helm lint and helm template against the chart (subchart dependencies: stripped locally so the render needs no remote charts):

helm lint .                                        # 1 chart(s) linted, 0 chart(s) failed
helm template texera . -f values-development.yaml  # RC=0, 50 objects, no errors

Verified the rendered output: agent deployment env + /api/healthcheck probes; agent-service-svc; the /api/agents -> agent-service-svc HTTPRoute and the BackendTrafficPolicy targeting it (consistent hash on X-Agent-Workflow-Id); LiteLLM DATABASE_URL=postgresql://…/texera_litellm + STORE_MODEL_IN_DB; access-control-service LITELLM_BASE_URL/MASTER_KEY; and the idempotent CREATE DATABASE texera_litellm in the Postgres init script. Not applied to a live cluster.

Was this PR authored or co-authored using generative AI tooling?

Generated-by: Claude Opus 4.8 (1M context)

The agent-service image is built and runs under single-node compose but had
no Helm deployment, and there was no in-cluster LLM gateway. This adds both,
mirroring the proven preview/production chart:

Agent service
- deployment + service (gated on agentService.enabled), wired to in-cluster
  service DNS using the env names the service actually reads
  (TEXERA_DASHBOARD_SERVICE_ENDPOINT, LLM_ENDPOINT,
  WORKFLOW_COMPILING_SERVICE_ENDPOINT, EXECUTION_ENDPOINT_TEMPLATE).
- a dedicated /api/agents HTTPRoute plus a BackendTrafficPolicy that
  consistent-hashes on X-Agent-Workflow-Id, so a workflow's requests always
  reach the replica holding its in-memory agent.
- readiness/liveness on /api/healthcheck.

LiteLLM (in-cluster LLM gateway)
- deployment + service + config ConfigMap (gated on litellm.enabled).
- Postgres persistence enabled by default: a texera_litellm database created
  by the postgres init script, with DATABASE_URL + STORE_MODEL_IN_DB so keys,
  spend, and model config survive restarts.
- access-control-service wired to LiteLLM (LITELLM_BASE_URL/MASTER_KEY,
  copilot enabled).

A shared Opaque Secret holds the agent gateway key, the LiteLLM master key,
and the provider API keys (supply via --set / override; none committed).

Closes apache#5269
@bobbai00 bobbai00 force-pushed the feat/5269-agent-service-k8s branch from ede4eec to 7620a52 Compare May 29, 2026 00:15
@bobbai00 bobbai00 changed the title feat(k8s): add agent service deployment to Helm chart feat(k8s): add agent service and LiteLLM to the Helm chart May 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add agent service deployment to the Kubernetes Helm chart

1 participant