Skip to content

docs(blog): Routing agent traffic is really three decisions#5847

Open
biefy wants to merge 21 commits into
Azure:masterfrom
biefy:fuyuanbie/llm-routing-on-aks
Open

docs(blog): Routing agent traffic is really three decisions#5847
biefy wants to merge 21 commits into
Azure:masterfrom
biefy:fuyuanbie/llm-routing-on-aks

Conversation

@biefy

@biefy biefy commented Jun 30, 2026

Copy link
Copy Markdown
Member

Summary

Adds a new AKS Engineering Blog post and companion example manifests on routing agent LLM traffic across three layers on AKS:

  • Semantic routing (which model) — RouteLLM
  • AI gateway (auth, cost, guardrails, and the inference-aware front door) — agentgateway
  • Inference-aware serving (which replica) — Gateway API Inference Extension over KAITO/vLLM

agentgateway is the front door: its inferenceRouting backend policy speaks ext-proc directly to the Inference Extension Endpoint Picker, so there is no separate Gateway API gateway. The result is a single OpenAI-compatible endpoint where cheap calls land on a self-hosted model and hard calls escalate to Azure OpenAI, with every call metered and traced.

What's included

  • website/blog/2026-06-29-llm-routing-on-aks/ — the post, a hero image, two SVG diagrams (architecture + request path), and a live observability dashboard image.
  • examples/llm-routing-on-aks/ — KAITO Workspace, InferencePool/InferenceObjective, the agentgateway config (with the inferenceRouting policy that calls the Endpoint Picker directly), and Managed-Prometheus PodMonitors, with a README. There is no Gateway/HTTPRoute — agentgateway is the front door.
  • website/blog/authors.yml — new author entry (fuyuan-bie).

Notes

  • Every step was validated end-to-end on a live AKS cluster; the post includes real command output and a real metrics screenshot.
  • Because agentgateway calls the Endpoint Picker directly, the design needs no Gateway API gateway and no App Routing add-on — it works today without the preview-gated managed App Routing + Inference Extension pairing.

Test plan

  • cd website && npm run build succeeds (verified locally)
  • markdownlint passes against the repo blog lint config
  • Post renders with the hero, both diagrams, the observability image, and admonitions
  • Author and tag keys resolve
  • kubectl apply --dry-run=server validates the InferencePool/InferenceObjective against live v1.0.0 CRDs

Adds a blog post and companion example manifests showing how to route
agent LLM traffic on AKS across three layers — semantic routing
(RouteLLM), an AI gateway (agentgateway), and inference-aware serving
(Gateway API Inference Extension over KAITO/vLLM) — with a single
OpenAI-compatible endpoint, real command output, and a live
observability dashboard.
@biefy biefy requested review from a team, colinmixonn and Copilot June 30, 2026 02:15
@biefy biefy requested review from a team and palma21 as code owners June 30, 2026 02:15
@biefy biefy requested a review from serbrech June 30, 2026 02:15

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new AKS Engineering Blog post and companion example manifests that demonstrate routing agent LLM traffic across three layers on AKS (semantic routing, AI gateway policy, and inference-aware replica selection) behind a single OpenAI-compatible endpoint.

Changes:

  • Adds a new blog post with end-to-end walkthrough plus architecture/request-path diagrams.
  • Adds example manifests (KAITO Workspace, Inference Extension objects, Gateway API resources, agentgateway config, and Managed Prometheus PodMonitors) with a README.
  • Adds a new blog author entry.

Reviewed changes

Copilot reviewed 8 out of 11 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
website/blog/authors.yml Adds new author key fuyuan-bie for the blog post.
website/blog/2026-06-29-llm-routing-on-aks/index.md New post describing the three-layer routing pattern and how to deploy/validate it.
website/blog/2026-06-29-llm-routing-on-aks/llm-routing-on-aks-request-path.svg New request-path diagram used by the post.
website/blog/2026-06-29-llm-routing-on-aks/llm-routing-on-aks-architecture.svg New architecture/ownership diagram used by the post.
examples/llm-routing-on-aks/README.md Companion README explaining file roles, apply order, and gotchas.
examples/llm-routing-on-aks/podmonitors.yaml Adds Managed Prometheus PodMonitors for vLLM and agentgateway metrics scraping.
examples/llm-routing-on-aks/kaito-workspace.yaml Adds KAITO Workspace manifest to serve the “weak” model via vLLM.
examples/llm-routing-on-aks/inference-pool.yaml Adds InferencePool and InferenceObjective resources for inference-aware selection.
examples/llm-routing-on-aks/inference-gateway.yaml Adds Gateway API Gateway + HTTPRoute to front the InferencePool.
examples/llm-routing-on-aks/agentgateway-config.yaml Adds agentgateway application config to route strong/weak backends and document model catalog placement.

Comment thread examples/llm-routing-on-aks/podmonitors.yaml Outdated
Comment thread examples/llm-routing-on-aks/podmonitors.yaml Outdated
biefy added 2 commits June 29, 2026 19:20
Defines two new tags (agentgateway, routellm) in tags.yml and updates
the post's frontmatter to feature them in place of gpu, since they are
the components the post centers on.
- Collapse double blank lines left by removed section dividers (MD012).
- Use spaced table separators to match house style (MD060).
- PodMonitors: drop the empty `port: ""` and scrape by `targetPort` only,
  per Copilot review — an empty port string can fail validation.
Copilot AI review requested due to automatic review settings June 30, 2026 02:27

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 12 changed files in this pull request and generated 1 comment.

Comment thread website/blog/2026-06-29-llm-routing-on-aks/index.md Outdated
biefy added 2 commits June 29, 2026 19:32
Rewrite "an OpenAI- and agent-native proxy" to "OpenAI-compatible,
agent-native" in both occurrences so the compound modifier reads cleanly.
Copilot AI review requested due to automatic review settings June 30, 2026 02:40

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 12 changed files in this pull request and generated 2 comments.

Comment thread examples/llm-routing-on-aks/README.md Outdated
Comment thread website/blog/2026-06-29-llm-routing-on-aks/index.md Outdated
- Title → "Routing agent traffic is really three decisions" (idea-forward;
  AKS stays in the description), and update the README link text.
- agentgateway-config.yaml: add a working token rate limit (route policy)
  and the model-cost catalog under the top-level `config:` key, so the file
  actually contains the policies the post and README describe, per Copilot.
- Reword the post to match: rate limit is a route policy, cost catalog is
  under `config:`.
@biefy biefy changed the title Add blog post: Routing agent traffic on AKS is three decisions docs(blog): Routing agent traffic is really three decisions Jun 30, 2026
Replace the rendered metrics chart with an actual Azure Managed Grafana
dashboard capture, and update the alt text to describe the panels and
values it actually shows.
Copilot AI review requested due to automatic review settings June 30, 2026 02:58

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 12 changed files in this pull request and generated 3 comments.

Comment thread examples/llm-routing-on-aks/agentgateway-config.yaml
Comment thread examples/llm-routing-on-aks/agentgateway-config.yaml
Comment thread website/blog/2026-06-29-llm-routing-on-aks/index.md
Copilot AI review requested due to automatic review settings June 30, 2026 03:02

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 12 changed files in this pull request and generated 1 comment.

Comment thread website/blog/2026-06-29-llm-routing-on-aks/index.md Outdated
@biefy biefy requested a review from Copilot June 30, 2026 03:06

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 12 changed files in this pull request and generated 2 comments.

Comment thread website/blog/2026-06-29-llm-routing-on-aks/index.md
Comment thread examples/llm-routing-on-aks/README.md
Per review, kaito-workspace.yaml targets namespace llm but the post
never created it, so a fresh-cluster apply (and the later helm install
-n llm) would fail. Add kubectl create namespace llm before the first
namespaced apply.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 12 changed files in this pull request and generated 2 comments.

Comment thread examples/llm-routing-on-aks/podmonitors.yaml Outdated
Comment thread examples/llm-routing-on-aks/podmonitors.yaml Outdated
Per review, move both PodMonitors out of kube-system (reserved for
system components) into the llm workload namespace, which Azure
Managed Prometheus supports ("can be deployed in any namespace"). They
now sit alongside the pods they scrape, so the cross-namespace
namespaceSelector is dropped. Verified on the live cluster: the
ama-metrics target allocator discovers podMonitor/llm/vllm-podmonitor
and reports it up.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 12 changed files in this pull request and generated no new comments.

Comment thread website/blog/2026-06-29-llm-routing-on-aks/index.md Outdated
Comment thread website/blog/2026-06-29-llm-routing-on-aks/index.md Outdated
Per review (anson627): the param failure is litellm raising
UnsupportedParamsError for the target model, not the frontier model
itself rejecting the params — reword accordingly. Also make the
InferenceObjective explanation explicit that it replaces the older
InferenceModel (dropping modelName/criticality for an integer
priority) instead of vaguely "redesigning" it.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 12 changed files in this pull request and generated 2 comments.

Comment thread website/blog/2026-06-29-llm-routing-on-aks/index.md
Comment thread examples/llm-routing-on-aks/kaito-workspace.yaml
Per review, the KAITO Workspace field layout varies across versions.
Add a comment noting this manifest targets kaito.sh/v1beta1 (top-level
resource:/inference:), validated against the AKS add-on in mid-2026,
and pointing readers at kubectl explain workspace.resource if their
add-on serves a different API version.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 12 changed files in this pull request and generated no new comments.

Add a short caveat to the threshold-tuning section: cached input
tokens cost a fraction of uncached ones, both the Azure OpenAI and
vLLM prefix caches are per-model/per-prefix, and per-call routing
works against them — so marginal cost diverges from the rate card.
Reinforces measuring billed cost over a static token rate, and notes
session/task-granularity routing where caching dominates.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 12 changed files in this pull request and generated 2 comments.

Comment thread examples/llm-routing-on-aks/podmonitors.yaml
Comment thread examples/llm-routing-on-aks/podmonitors.yaml
Per review, the "can live anywhere" comment was misleading: a
PodMonitor without a namespaceSelector only discovers pods in its own
namespace. Reword to explain these sit in llm precisely because
they're same-namespace as the pods, and note that moving one requires
adding a namespaceSelector.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 12 changed files in this pull request and generated no new comments.

-d '{"model":"gpt-5.1","messages":[{"role":"user","content":"Say no in one word."}]}'

# weak → KAITO via the Endpoint Picker
curl -s http://agentgateway.llm.svc.cluster.local:4000/weak \

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Medium (docs — asymmetric strong/weak curl shapes).

The strong-path and weak-path smoke-test curl commands are asymmetric: strong hits /strong/v1/chat/completions (line 199) while weak hits just /weak (line 204). RouteLLM appends /v1/chat/completions to api_base for both models, so the weak curl shape doesn't reflect what the router will actually send in production.

The weak route's urlRewrite: { path: { full: /v1/chat/completions } } policy collapses both /weak and /weak/v1/chat/completions to the same upstream path, so both shapes "work" for the smoke test — but the asymmetric demo confuses readers about the real traffic shape. Either:

  • Make the demo symmetric (/weak/v1/chat/completions), or
  • Add one line explaining why the demo intentionally uses the shorter form even though RouteLLM doesn't.

— multi-agent panel review; flagged by: copilot

name: phi-weak
namespace: llm
spec:
priority: 0 # integer priority; no modelName/criticality in v1

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Medium (correctness — priority: 0 reads as highest priority).

The blog now nicely explains (line 105) that InferenceObjective replaces the older InferenceModel and takes only an integer priority — but the manifest itself still sets priority: 0 for the only objective with no comment on what "0" means. Upstream migration-guide examples use 1 / 2 (small-segment-lora=1, base-model=2), and the semantics documented there are lower number = higher priority. Setting 0 for the "weak" workload reads as "highest priority" — which contradicts the post's narrative that strong traffic carries the cost/latency pressure.

Either bump to a sane mid-range integer, or add an inline comment on this line that priority is not consulted when only one objective exists in the pool (which is the case here).

— multi-agent panel review; flagged by: claude

resource:
instanceType: "Standard_NC24ads_A100_v4" # KAITO provisions this node pool
count: 2
labelSelector:

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Medium (docs — labelSelector here is node placement, not pod selection).

resource.labelSelector.matchLabels: { apps: phi-4-mini } on lines 14-16 controls where KAITO schedules its pods (node placement), NOT the label the pods themselves end up carrying. The InferencePool selector in inference-pool.yaml:9 and the vLLM PodMonitor in podmonitors.yaml:25 both select pods on kaito.sh/workspace: workspace-phi-4-mini — which is the label KAITO stamps on the pods it creates.

Currently works; the two are semantically different but both happen to match. A reader who edits the apps: value expecting it to reshape pool membership or PodMonitor scrape will be confused when nothing changes downstream. One inline comment here — e.g. # node placement — not the pod label; the pods carry kaito.sh/workspace instead — closes the trap.

— multi-agent panel review; flagged by: claude

# workloadIdentity: {}
# backendTLS: {}
backendAuth:
key: "<your-azure-openai-api-key>"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Medium (docs — inconsistent placeholder convention is a footgun).

Placeholder convention is inconsistent across the file: "<your-azure-openai-api-key>" on line 56, "<your-aoai-resource>" on line 68. Both read like plausible literal values. A reader who pastes through kubectl create configmap --from-file=… will see agentgateway start up and the first /strong call return an auth failure that doesn't obviously trace back to "you didn't substitute the placeholder."

Two small improvements:

  • Use a visually distinct marker that can't be mistaken for a real value: REPLACE_ME_AOAI_API_KEY and REPLACE_ME_AOAI_RESOURCE.
  • Add a one-line shell snippet in the README showing the envsubst / sed substitution step readers should run before creating the ConfigMap.

— multi-agent panel review; flagged by: claude

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants