docs(blog): Routing agent traffic is really three decisions#5847
docs(blog): Routing agent traffic is really three decisions#5847biefy wants to merge 21 commits into
Conversation
Adds a blog post and companion example manifests showing how to route agent LLM traffic on AKS across three layers — semantic routing (RouteLLM), an AI gateway (agentgateway), and inference-aware serving (Gateway API Inference Extension over KAITO/vLLM) — with a single OpenAI-compatible endpoint, real command output, and a live observability dashboard.
There was a problem hiding this comment.
Pull request overview
Adds a new AKS Engineering Blog post and companion example manifests that demonstrate routing agent LLM traffic across three layers on AKS (semantic routing, AI gateway policy, and inference-aware replica selection) behind a single OpenAI-compatible endpoint.
Changes:
- Adds a new blog post with end-to-end walkthrough plus architecture/request-path diagrams.
- Adds example manifests (KAITO Workspace, Inference Extension objects, Gateway API resources, agentgateway config, and Managed Prometheus PodMonitors) with a README.
- Adds a new blog author entry.
Reviewed changes
Copilot reviewed 8 out of 11 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| website/blog/authors.yml | Adds new author key fuyuan-bie for the blog post. |
| website/blog/2026-06-29-llm-routing-on-aks/index.md | New post describing the three-layer routing pattern and how to deploy/validate it. |
| website/blog/2026-06-29-llm-routing-on-aks/llm-routing-on-aks-request-path.svg | New request-path diagram used by the post. |
| website/blog/2026-06-29-llm-routing-on-aks/llm-routing-on-aks-architecture.svg | New architecture/ownership diagram used by the post. |
| examples/llm-routing-on-aks/README.md | Companion README explaining file roles, apply order, and gotchas. |
| examples/llm-routing-on-aks/podmonitors.yaml | Adds Managed Prometheus PodMonitors for vLLM and agentgateway metrics scraping. |
| examples/llm-routing-on-aks/kaito-workspace.yaml | Adds KAITO Workspace manifest to serve the “weak” model via vLLM. |
| examples/llm-routing-on-aks/inference-pool.yaml | Adds InferencePool and InferenceObjective resources for inference-aware selection. |
| examples/llm-routing-on-aks/inference-gateway.yaml | Adds Gateway API Gateway + HTTPRoute to front the InferencePool. |
| examples/llm-routing-on-aks/agentgateway-config.yaml | Adds agentgateway application config to route strong/weak backends and document model catalog placement. |
Defines two new tags (agentgateway, routellm) in tags.yml and updates the post's frontmatter to feature them in place of gpu, since they are the components the post centers on.
- Collapse double blank lines left by removed section dividers (MD012). - Use spaced table separators to match house style (MD060). - PodMonitors: drop the empty `port: ""` and scrape by `targetPort` only, per Copilot review — an empty port string can fail validation.
Rewrite "an OpenAI- and agent-native proxy" to "OpenAI-compatible, agent-native" in both occurrences so the compound modifier reads cleanly.
- Title → "Routing agent traffic is really three decisions" (idea-forward; AKS stays in the description), and update the README link text. - agentgateway-config.yaml: add a working token rate limit (route policy) and the model-cost catalog under the top-level `config:` key, so the file actually contains the policies the post and README describe, per Copilot. - Reword the post to match: rate limit is a route policy, cost catalog is under `config:`.
Replace the rendered metrics chart with an actual Azure Managed Grafana dashboard capture, and update the alt text to describe the panels and values it actually shows.
Per review, kaito-workspace.yaml targets namespace llm but the post never created it, so a fresh-cluster apply (and the later helm install -n llm) would fail. Add kubectl create namespace llm before the first namespaced apply.
Per review, move both PodMonitors out of kube-system (reserved for
system components) into the llm workload namespace, which Azure
Managed Prometheus supports ("can be deployed in any namespace"). They
now sit alongside the pods they scrape, so the cross-namespace
namespaceSelector is dropped. Verified on the live cluster: the
ama-metrics target allocator discovers podMonitor/llm/vllm-podmonitor
and reports it up.
Per review (anson627): the param failure is litellm raising UnsupportedParamsError for the target model, not the frontier model itself rejecting the params — reword accordingly. Also make the InferenceObjective explanation explicit that it replaces the older InferenceModel (dropping modelName/criticality for an integer priority) instead of vaguely "redesigning" it.
Per review, the KAITO Workspace field layout varies across versions. Add a comment noting this manifest targets kaito.sh/v1beta1 (top-level resource:/inference:), validated against the AKS add-on in mid-2026, and pointing readers at kubectl explain workspace.resource if their add-on serves a different API version.
Add a short caveat to the threshold-tuning section: cached input tokens cost a fraction of uncached ones, both the Azure OpenAI and vLLM prefix caches are per-model/per-prefix, and per-call routing works against them — so marginal cost diverges from the rate card. Reinforces measuring billed cost over a static token rate, and notes session/task-granularity routing where caching dominates.
Per review, the "can live anywhere" comment was misleading: a PodMonitor without a namespaceSelector only discovers pods in its own namespace. Reword to explain these sit in llm precisely because they're same-namespace as the pods, and note that moving one requires adding a namespaceSelector.
| -d '{"model":"gpt-5.1","messages":[{"role":"user","content":"Say no in one word."}]}' | ||
|
|
||
| # weak → KAITO via the Endpoint Picker | ||
| curl -s http://agentgateway.llm.svc.cluster.local:4000/weak \ |
There was a problem hiding this comment.
Medium (docs — asymmetric strong/weak curl shapes).
The strong-path and weak-path smoke-test curl commands are asymmetric: strong hits /strong/v1/chat/completions (line 199) while weak hits just /weak (line 204). RouteLLM appends /v1/chat/completions to api_base for both models, so the weak curl shape doesn't reflect what the router will actually send in production.
The weak route's urlRewrite: { path: { full: /v1/chat/completions } } policy collapses both /weak and /weak/v1/chat/completions to the same upstream path, so both shapes "work" for the smoke test — but the asymmetric demo confuses readers about the real traffic shape. Either:
- Make the demo symmetric (
/weak/v1/chat/completions), or - Add one line explaining why the demo intentionally uses the shorter form even though RouteLLM doesn't.
— multi-agent panel review; flagged by: copilot
| name: phi-weak | ||
| namespace: llm | ||
| spec: | ||
| priority: 0 # integer priority; no modelName/criticality in v1 |
There was a problem hiding this comment.
Medium (correctness — priority: 0 reads as highest priority).
The blog now nicely explains (line 105) that InferenceObjective replaces the older InferenceModel and takes only an integer priority — but the manifest itself still sets priority: 0 for the only objective with no comment on what "0" means. Upstream migration-guide examples use 1 / 2 (small-segment-lora=1, base-model=2), and the semantics documented there are lower number = higher priority. Setting 0 for the "weak" workload reads as "highest priority" — which contradicts the post's narrative that strong traffic carries the cost/latency pressure.
Either bump to a sane mid-range integer, or add an inline comment on this line that priority is not consulted when only one objective exists in the pool (which is the case here).
— multi-agent panel review; flagged by: claude
| resource: | ||
| instanceType: "Standard_NC24ads_A100_v4" # KAITO provisions this node pool | ||
| count: 2 | ||
| labelSelector: |
There was a problem hiding this comment.
Medium (docs — labelSelector here is node placement, not pod selection).
resource.labelSelector.matchLabels: { apps: phi-4-mini } on lines 14-16 controls where KAITO schedules its pods (node placement), NOT the label the pods themselves end up carrying. The InferencePool selector in inference-pool.yaml:9 and the vLLM PodMonitor in podmonitors.yaml:25 both select pods on kaito.sh/workspace: workspace-phi-4-mini — which is the label KAITO stamps on the pods it creates.
Currently works; the two are semantically different but both happen to match. A reader who edits the apps: value expecting it to reshape pool membership or PodMonitor scrape will be confused when nothing changes downstream. One inline comment here — e.g. # node placement — not the pod label; the pods carry kaito.sh/workspace instead — closes the trap.
— multi-agent panel review; flagged by: claude
| # workloadIdentity: {} | ||
| # backendTLS: {} | ||
| backendAuth: | ||
| key: "<your-azure-openai-api-key>" |
There was a problem hiding this comment.
Medium (docs — inconsistent placeholder convention is a footgun).
Placeholder convention is inconsistent across the file: "<your-azure-openai-api-key>" on line 56, "<your-aoai-resource>" on line 68. Both read like plausible literal values. A reader who pastes through kubectl create configmap --from-file=… will see agentgateway start up and the first /strong call return an auth failure that doesn't obviously trace back to "you didn't substitute the placeholder."
Two small improvements:
- Use a visually distinct marker that can't be mistaken for a real value:
REPLACE_ME_AOAI_API_KEYandREPLACE_ME_AOAI_RESOURCE. - Add a one-line shell snippet in the README showing the
envsubst/sedsubstitution step readers should run before creating the ConfigMap.
— multi-agent panel review; flagged by: claude
Summary
Adds a new AKS Engineering Blog post and companion example manifests on routing agent LLM traffic across three layers on AKS:
agentgateway is the front door: its
inferenceRoutingbackend policy speaks ext-proc directly to the Inference Extension Endpoint Picker, so there is no separate Gateway API gateway. The result is a single OpenAI-compatible endpoint where cheap calls land on a self-hosted model and hard calls escalate to Azure OpenAI, with every call metered and traced.What's included
website/blog/2026-06-29-llm-routing-on-aks/— the post, a hero image, two SVG diagrams (architecture + request path), and a live observability dashboard image.examples/llm-routing-on-aks/— KAITOWorkspace,InferencePool/InferenceObjective, the agentgateway config (with theinferenceRoutingpolicy that calls the Endpoint Picker directly), and Managed-PrometheusPodMonitors, with a README. There is noGateway/HTTPRoute— agentgateway is the front door.website/blog/authors.yml— new author entry (fuyuan-bie).Notes
Test plan
cd website && npm run buildsucceeds (verified locally)markdownlintpasses against the repo blog lint configkubectl apply --dry-run=servervalidates the InferencePool/InferenceObjective against live v1.0.0 CRDs