docs(blog): Routing agent traffic is really three decisions by biefy · Pull Request #5847 · Azure/AKS

biefy · 2026-06-30T02:15:09Z

Summary

Adds a new AKS Engineering Blog post and companion example manifests on routing agent LLM traffic across three layers on AKS:

Semantic routing (which model) — RouteLLM
AI gateway (auth, cost, guardrails, and the inference-aware front door) — agentgateway
Inference-aware serving (which replica) — Gateway API Inference Extension over KAITO/vLLM

agentgateway is the front door: its inferenceRouting backend policy speaks ext-proc directly to the Inference Extension Endpoint Picker, so there is no separate Gateway API gateway. The result is a single OpenAI-compatible endpoint where cheap calls land on a self-hosted model and hard calls escalate to Azure OpenAI, with every call metered and traced.

What's included

website/blog/2026-06-29-llm-routing-on-aks/ — the post, a hero image, two SVG diagrams (architecture + request path), and a live observability dashboard image.
examples/llm-routing-on-aks/ — KAITO Workspace, InferencePool/InferenceObjective, the agentgateway config (with the inferenceRouting policy that calls the Endpoint Picker directly), and Managed-Prometheus PodMonitors, with a README. There is no Gateway/HTTPRoute — agentgateway is the front door.
website/blog/authors.yml — new author entry (fuyuan-bie).

Notes

Every step was validated end-to-end on a live AKS cluster; the post includes real command output and a real metrics screenshot.
Because agentgateway calls the Endpoint Picker directly, the design needs no Gateway API gateway and no App Routing add-on — it works today without the preview-gated managed App Routing + Inference Extension pairing.

Test plan

cd website && npm run build succeeds (verified locally)
markdownlint passes against the repo blog lint config
Post renders with the hero, both diagrams, the observability image, and admonitions
Author and tag keys resolve
kubectl apply --dry-run=server validates the InferencePool/InferenceObjective against live v1.0.0 CRDs

Adds a blog post and companion example manifests showing how to route agent LLM traffic on AKS across three layers — semantic routing (RouteLLM), an AI gateway (agentgateway), and inference-aware serving (Gateway API Inference Extension over KAITO/vLLM) — with a single OpenAI-compatible endpoint, real command output, and a live observability dashboard.

Copilot

Pull request overview

Adds a new AKS Engineering Blog post and companion example manifests that demonstrate routing agent LLM traffic across three layers on AKS (semantic routing, AI gateway policy, and inference-aware replica selection) behind a single OpenAI-compatible endpoint.

Changes:

Adds a new blog post with end-to-end walkthrough plus architecture/request-path diagrams.
Adds example manifests (KAITO Workspace, Inference Extension objects, Gateway API resources, agentgateway config, and Managed Prometheus PodMonitors) with a README.
Adds a new blog author entry.

Reviewed changes

Copilot reviewed 8 out of 11 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
website/blog/authors.yml	Adds new author key `fuyuan-bie` for the blog post.
website/blog/2026-06-29-llm-routing-on-aks/index.md	New post describing the three-layer routing pattern and how to deploy/validate it.
website/blog/2026-06-29-llm-routing-on-aks/llm-routing-on-aks-request-path.svg	New request-path diagram used by the post.
website/blog/2026-06-29-llm-routing-on-aks/llm-routing-on-aks-architecture.svg	New architecture/ownership diagram used by the post.
examples/llm-routing-on-aks/README.md	Companion README explaining file roles, apply order, and gotchas.
examples/llm-routing-on-aks/podmonitors.yaml	Adds Managed Prometheus `PodMonitor`s for vLLM and agentgateway metrics scraping.
examples/llm-routing-on-aks/kaito-workspace.yaml	Adds KAITO `Workspace` manifest to serve the “weak” model via vLLM.
examples/llm-routing-on-aks/inference-pool.yaml	Adds `InferencePool` and `InferenceObjective` resources for inference-aware selection.
examples/llm-routing-on-aks/inference-gateway.yaml	Adds Gateway API `Gateway` + `HTTPRoute` to front the `InferencePool`.
examples/llm-routing-on-aks/agentgateway-config.yaml	Adds agentgateway application config to route strong/weak backends and document model catalog placement.

Defines two new tags (agentgateway, routellm) in tags.yml and updates the post's frontmatter to feature them in place of gpu, since they are the components the post centers on.

- Collapse double blank lines left by removed section dividers (MD012). - Use spaced table separators to match house style (MD060). - PodMonitors: drop the empty `port: ""` and scrape by `targetPort` only, per Copilot review — an empty port string can fail validation.

Copilot

Pull request overview

Copilot reviewed 9 out of 12 changed files in this pull request and generated 1 comment.

Rewrite "an OpenAI- and agent-native proxy" to "OpenAI-compatible, agent-native" in both occurrences so the compound modifier reads cleanly.

Copilot

Pull request overview

Copilot reviewed 9 out of 12 changed files in this pull request and generated 2 comments.

- Title → "Routing agent traffic is really three decisions" (idea-forward; AKS stays in the description), and update the README link text. - agentgateway-config.yaml: add a working token rate limit (route policy) and the model-cost catalog under the top-level `config:` key, so the file actually contains the policies the post and README describe, per Copilot. - Reword the post to match: rate limit is a route policy, cost catalog is under `config:`.

Replace the rendered metrics chart with an actual Azure Managed Grafana dashboard capture, and update the alt text to describe the panels and values it actually shows.

Copilot

Pull request overview

Copilot reviewed 9 out of 12 changed files in this pull request and generated 3 comments.

Copilot

Pull request overview

Copilot reviewed 9 out of 12 changed files in this pull request and generated 1 comment.

Copilot

Pull request overview

Copilot reviewed 8 out of 12 changed files in this pull request and generated 2 comments.

Per review, kaito-workspace.yaml targets namespace llm but the post never created it, so a fresh-cluster apply (and the later helm install -n llm) would fail. Add kubectl create namespace llm before the first namespaced apply.

Copilot

Pull request overview

Copilot reviewed 8 out of 12 changed files in this pull request and generated 2 comments.

Per review, move both PodMonitors out of kube-system (reserved for system components) into the llm workload namespace, which Azure Managed Prometheus supports ("can be deployed in any namespace"). They now sit alongside the pods they scrape, so the cross-namespace namespaceSelector is dropped. Verified on the live cluster: the ama-metrics target allocator discovers podMonitor/llm/vllm-podmonitor and reports it up.

Copilot

Pull request overview

Copilot reviewed 8 out of 12 changed files in this pull request and generated no new comments.

Per review (anson627): the param failure is litellm raising UnsupportedParamsError for the target model, not the frontier model itself rejecting the params — reword accordingly. Also make the InferenceObjective explanation explicit that it replaces the older InferenceModel (dropping modelName/criticality for an integer priority) instead of vaguely "redesigning" it.

Copilot

Pull request overview

Copilot reviewed 8 out of 12 changed files in this pull request and generated 2 comments.

Per review, the KAITO Workspace field layout varies across versions. Add a comment noting this manifest targets kaito.sh/v1beta1 (top-level resource:/inference:), validated against the AKS add-on in mid-2026, and pointing readers at kubectl explain workspace.resource if their add-on serves a different API version.

Copilot

Pull request overview

Copilot reviewed 8 out of 12 changed files in this pull request and generated no new comments.

Add a short caveat to the threshold-tuning section: cached input tokens cost a fraction of uncached ones, both the Azure OpenAI and vLLM prefix caches are per-model/per-prefix, and per-call routing works against them — so marginal cost diverges from the rate card. Reinforces measuring billed cost over a static token rate, and notes session/task-granularity routing where caching dominates.

Copilot

Pull request overview

Copilot reviewed 8 out of 12 changed files in this pull request and generated 2 comments.

Per review, the "can live anywhere" comment was misleading: a PodMonitor without a namespaceSelector only discovers pods in its own namespace. Reword to explain these sit in llm precisely because they're same-namespace as the pods, and note that moving one requires adding a namespaceSelector.

Copilot

Pull request overview

Copilot reviewed 8 out of 12 changed files in this pull request and generated no new comments.

nilekhc · 2026-07-01T21:50:49Z

+  -d '{"model":"gpt-5.1","messages":[{"role":"user","content":"Say no in one word."}]}'
+
+# weak → KAITO via the Endpoint Picker
+curl -s http://agentgateway.llm.svc.cluster.local:4000/weak \


Medium (docs — asymmetric strong/weak curl shapes).

The strong-path and weak-path smoke-test curl commands are asymmetric: strong hits /strong/v1/chat/completions (line 199) while weak hits just /weak (line 204). RouteLLM appends /v1/chat/completions to api_base for both models, so the weak curl shape doesn't reflect what the router will actually send in production.

The weak route's urlRewrite: { path: { full: /v1/chat/completions } } policy collapses both /weak and /weak/v1/chat/completions to the same upstream path, so both shapes "work" for the smoke test — but the asymmetric demo confuses readers about the real traffic shape. Either:

Make the demo symmetric (/weak/v1/chat/completions), or

Add one line explaining why the demo intentionally uses the shorter form even though RouteLLM doesn't.

— multi-agent panel review; flagged by: copilot

nilekhc · 2026-07-01T21:51:00Z

+  name: phi-weak
+  namespace: llm
+spec:
+  priority: 0                                    # integer priority; no modelName/criticality in v1


Medium (correctness — priority: 0 reads as highest priority).

The blog now nicely explains (line 105) that InferenceObjective replaces the older InferenceModel and takes only an integer priority — but the manifest itself still sets priority: 0 for the only objective with no comment on what "0" means. Upstream migration-guide examples use 1 / 2 (small-segment-lora=1, base-model=2), and the semantics documented there are lower number = higher priority. Setting 0 for the "weak" workload reads as "highest priority" — which contradicts the post's narrative that strong traffic carries the cost/latency pressure.

Either bump to a sane mid-range integer, or add an inline comment on this line that priority is not consulted when only one objective exists in the pool (which is the case here).

— multi-agent panel review; flagged by: claude

nilekhc · 2026-07-01T21:51:08Z

+resource:
+  instanceType: "Standard_NC24ads_A100_v4"   # KAITO provisions this node pool
+  count: 2
+  labelSelector:


Medium (docs — labelSelector here is node placement, not pod selection).

resource.labelSelector.matchLabels: { apps: phi-4-mini } on lines 14-16 controls where KAITO schedules its pods (node placement), NOT the label the pods themselves end up carrying. The InferencePool selector in inference-pool.yaml:9 and the vLLM PodMonitor in podmonitors.yaml:25 both select pods on kaito.sh/workspace: workspace-phi-4-mini — which is the label KAITO stamps on the pods it creates.

Currently works; the two are semantically different but both happen to match. A reader who edits the apps: value expecting it to reshape pool membership or PodMonitor scrape will be confused when nothing changes downstream. One inline comment here — e.g. # node placement — not the pod label; the pods carry kaito.sh/workspace instead — closes the trap.

— multi-agent panel review; flagged by: claude

nilekhc · 2026-07-01T21:51:24Z

+              #         workloadIdentity: {}
+              #     backendTLS: {}
+              backendAuth:
+                key: "<your-azure-openai-api-key>"


Medium (docs — inconsistent placeholder convention is a footgun).

Placeholder convention is inconsistent across the file: "<your-azure-openai-api-key>" on line 56, "<your-aoai-resource>" on line 68. Both read like plausible literal values. A reader who pastes through kubectl create configmap --from-file=… will see agentgateway start up and the first /strong call return an auth failure that doesn't obviously trace back to "you didn't substitute the placeholder."

Two small improvements:

Use a visually distinct marker that can't be mistaken for a real value: REPLACE_ME_AOAI_API_KEY and REPLACE_ME_AOAI_RESOURCE.

Add a one-line shell snippet in the README showing the envsubst / sed substitution step readers should run before creating the ConfigMap.

— multi-agent panel review; flagged by: claude

biefy requested review from a team, colinmixonn and Copilot June 30, 2026 02:15

biefy requested review from a team and palma21 as code owners June 30, 2026 02:15

biefy requested a review from serbrech June 30, 2026 02:15

Copilot started reviewing on behalf of biefy June 30, 2026 02:15 View session

Copilot AI reviewed Jun 30, 2026

View reviewed changes

Comment thread examples/llm-routing-on-aks/podmonitors.yaml Outdated

Comment thread examples/llm-routing-on-aks/podmonitors.yaml Outdated

biefy added 2 commits June 29, 2026 19:20

Add agentgateway and routellm tags; swap gpu for them on the post

8dbcead

Defines two new tags (agentgateway, routellm) in tags.yml and updates the post's frontmatter to feature them in place of gpu, since they are the components the post centers on.

Copilot AI review requested due to automatic review settings June 30, 2026 02:27

Copilot started reviewing on behalf of biefy June 30, 2026 02:28 View session

Copilot AI reviewed Jun 30, 2026

View reviewed changes

Comment thread website/blog/2026-06-29-llm-routing-on-aks/index.md Outdated

biefy added 2 commits June 29, 2026 19:32

Fix split hyphenated modifier per review

817c167

Rewrite "an OpenAI- and agent-native proxy" to "OpenAI-compatible, agent-native" in both occurrences so the compound modifier reads cleanly.

Uppercase admonition titles to match house style

a5ca483

Copilot AI review requested due to automatic review settings June 30, 2026 02:40

Copilot started reviewing on behalf of biefy June 30, 2026 02:41 View session

Copilot AI reviewed Jun 30, 2026

View reviewed changes

Comment thread examples/llm-routing-on-aks/README.md Outdated

Comment thread website/blog/2026-06-29-llm-routing-on-aks/index.md Outdated

biefy changed the title ~~Add blog post: Routing agent traffic on AKS is three decisions~~ docs(blog): Routing agent traffic is really three decisions Jun 30, 2026

Use real Grafana screenshot for the observability image

531383f

Replace the rendered metrics chart with an actual Azure Managed Grafana dashboard capture, and update the alt text to describe the panels and values it actually shows.

Copilot AI review requested due to automatic review settings June 30, 2026 02:58

Copilot started reviewing on behalf of biefy June 30, 2026 02:59 View session

Copilot AI reviewed Jun 30, 2026

View reviewed changes

Comment thread examples/llm-routing-on-aks/agentgateway-config.yaml

Comment thread examples/llm-routing-on-aks/agentgateway-config.yaml

Comment thread website/blog/2026-06-29-llm-routing-on-aks/index.md

Potential fix for pull request finding

6462395

Copilot AI review requested due to automatic review settings June 30, 2026 03:02

Copilot started reviewing on behalf of biefy June 30, 2026 03:03 View session

Copilot AI reviewed Jun 30, 2026

View reviewed changes

Comment thread website/blog/2026-06-29-llm-routing-on-aks/index.md Outdated

biefy requested a review from Copilot June 30, 2026 03:06

Copilot AI reviewed Jun 30, 2026

View reviewed changes

Comment thread website/blog/2026-06-29-llm-routing-on-aks/index.md

Comment thread examples/llm-routing-on-aks/README.md

Create the llm namespace before applying the first manifest

49d4d61

Per review, kaito-workspace.yaml targets namespace llm but the post never created it, so a fresh-cluster apply (and the later helm install -n llm) would fail. Add kubectl create namespace llm before the first namespaced apply.

biefy requested a review from Copilot June 30, 2026 17:59

Copilot started reviewing on behalf of biefy June 30, 2026 18:00 View session

Copilot AI reviewed Jun 30, 2026

View reviewed changes

Comment thread examples/llm-routing-on-aks/podmonitors.yaml Outdated

Comment thread examples/llm-routing-on-aks/podmonitors.yaml Outdated

biefy requested a review from Copilot June 30, 2026 18:08

Copilot started reviewing on behalf of biefy June 30, 2026 18:09 View session

Copilot AI reviewed Jun 30, 2026

View reviewed changes

anson627 reviewed Jun 30, 2026

View reviewed changes

Comment thread website/blog/2026-06-29-llm-routing-on-aks/index.md Outdated

Comment thread website/blog/2026-06-29-llm-routing-on-aks/index.md Outdated

biefy requested a review from Copilot June 30, 2026 22:10

Copilot started reviewing on behalf of biefy June 30, 2026 22:11 View session

Copilot AI reviewed Jun 30, 2026

View reviewed changes

Comment thread website/blog/2026-06-29-llm-routing-on-aks/index.md

Comment thread examples/llm-routing-on-aks/kaito-workspace.yaml

biefy requested a review from Copilot June 30, 2026 22:17

Copilot started reviewing on behalf of biefy June 30, 2026 22:18 View session

Copilot AI reviewed Jun 30, 2026

View reviewed changes

biefy requested a review from Copilot June 30, 2026 22:56

Copilot started reviewing on behalf of biefy June 30, 2026 22:57 View session

Copilot AI reviewed Jun 30, 2026

View reviewed changes

Comment thread examples/llm-routing-on-aks/podmonitors.yaml

Comment thread examples/llm-routing-on-aks/podmonitors.yaml

biefy requested a review from Copilot June 30, 2026 23:03

Copilot started reviewing on behalf of biefy June 30, 2026 23:03 View session

Copilot AI reviewed Jun 30, 2026

View reviewed changes

nilekhc reviewed Jul 1, 2026

View reviewed changes

Uh oh!

Conversation

biefy commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's included

Notes

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

nilekhc Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

biefy commented Jun 30, 2026 •

edited

Loading