Skip to content

docs(blog): add post on streaming vLLM weights from Azure Blob Storage#5845

Open
surajssd wants to merge 15 commits into
Azure:masterfrom
surajssd:add-run-ai-blog
Open

docs(blog): add post on streaming vLLM weights from Azure Blob Storage#5845
surajssd wants to merge 15 commits into
Azure:masterfrom
surajssd:add-run-ai-blog

Conversation

@surajssd

@surajssd surajssd commented Jun 26, 2026

Copy link
Copy Markdown
Member

Summary

Adds a new AKS Engineering Blog post, "Stream Model Weights to NVIDIA GPU running vLLM from Azure Blob Storage on AKS" (website/blog/2026-06-26-runai-streamer-vllm/index.md), a runnable end-to-end walkthrough for serving microsoft/phi-4 with vLLM on AKS while streaming model weights directly from Azure Blob Storage via the RunAI Model Streamer's native az:// scheme. The post leans on a fully managed A100 GPU node pool and workload identity so no storage keys are needed, and explains why streaming beats the default download-then-load path for autoscaling inference cold starts.

Changes

  • New blog post (index.md, ~640 lines) walking through: deploying an AKS cluster with OIDC + workload identity and a managed GPU node pool (--enable-managed-gpu=true), creating a premium block-blob storage account, wiring up workload identity for keyless Blob access, an in-cluster upload Job that pushes microsoft/phi-4 weights to Blob, and a vLLM Deployment that streams them via --load-format runai_streamer. Includes "Why stream", "Trade-offs and downsides", a verification step (spotting the Loading safetensors using Runai Model Streamer log line), and a conclusion.
  • Four diagrams: a hero image plus 1-why-stream-vs-download.png, 2-identity.png (workload-identity trust chain), and 3-end-to-end.png (end-to-end flow), each with descriptive alt text.
  • New author entry hariharan-sethuraman added to website/blog/authors.yml; post co-authored with suraj-deshmukh.
  • Review-feedback hardening incorporated over the branch: wired serviceAccountName: ${SERVICE_ACCOUNT_NAME} into both pod specs (and the Job's envsubst allowlist) with an idempotent kubectl create serviceaccount step, added set -euo pipefail and pinned huggingface-hub>=0.34 in the upload Job, reserved ephemeral-storage to avoid DiskPressure eviction, added an az feature show poll so the node-pool step doesn't race feature registration, added envsubst to Prerequisites, fixed the phi-4 size figure, corrected the workload-identity label explanation, and depersonalized the Configuration variables.
  • Formatting: converted blockquote callouts to Docusaurus :::note/:::caution/:::tip admonitions and replaced the Mermaid diagrams with images.

Test Plan

  • npm run build succeeds in website/ (static site compiles; post renders at /2026/06/26/runai-streamer-vllm with both authors).
  • markdownlint-cli2 passes against blog/linters/.markdown-lint.yml (0 errors).
  • codespell clean with the repo's ignore list.
  • Embedded Job and Deployment YAML manifests parse as valid Kubernetes objects.
  • Reviewer confirms the embedded az/kubectl walkthrough on a live AKS cluster with A100 quota, and that vllm/vllm-openai:v0.23.0 pulls.

- add `2026-06-26-runai-streamer-vllm` post covering serving
  `microsoft/phi-4` with vLLM on AKS, streaming weights from Azure Blob
  via the RunAI Model Streamer (`az://`) with workload identity
- add `hariharan-sethuraman` author entry to `authors.yml` (placeholder
  details pending)

Signed-off-by: Suraj Deshmukh <[email protected]>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new AKS blog post that walks through serving microsoft/phi-4 with vLLM while streaming weights directly from Azure Blob Storage via the RunAI Model Streamer (az://) using workload identity, plus a new author profile entry to support the post.

Changes:

  • Added a new blog post: streaming vLLM weights from Azure Blob on AKS with workload identity.
  • Added a new author key (hariharan-sethuraman) to website/blog/authors.yml.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 7 comments.

File Description
website/blog/authors.yml Adds a new author entry used by the new blog post.
website/blog/2026-06-26-runai-streamer-vllm/index.md New end-to-end tutorial post for streaming vLLM weights from Azure Blob Storage on AKS.

Comment thread website/blog/authors.yml Outdated
Comment thread website/blog/2026-06-26-runai-streamer-vllm/index.md Outdated
Comment thread website/blog/2026-06-26-runai-streamer-vllm/index.md Outdated
Comment thread website/blog/2026-06-26-runai-streamer-vllm/index.md Outdated
Comment thread website/blog/2026-06-26-runai-streamer-vllm/index.md Outdated
Comment thread website/blog/2026-06-26-runai-streamer-vllm/index.md Outdated
Comment thread website/blog/2026-06-26-runai-streamer-vllm/index.md
surajssd added 2 commits June 26, 2026 16:18
- reword the S3/GCS-to-Azure-Blob intro for clearer phrasing
- add a closing transition sentence to the cold-start section
- add a note on bumping the upload Job timeout for larger (70B+) models

Signed-off-by: Suraj Deshmukh <[email protected]>
Upload Job reliability:
- add `set -euo pipefail` so a failed `curl | tar` azcopy download fails
  loudly instead of silently
- pin `huggingface-hub>=0.34` to guarantee the renamed `hf` CLI is
  present
- reserve disk via an `emptyDir` scratch volume and `ephemeral-storage`
  requests/limits to avoid `DiskPressure` eviction
- document deleting the immutable `Job` before re-applying to avoid
  `AlreadyExists`
- rewrite the timeout note to cover disk sizing for larger models, not
  just time

Walkthrough correctness:
- add `--overwrite` to the `kubectl annotate`/`label serviceaccount`
  commands so the section is idempotent
- wait for `/health` before the first `curl` so it does not race
  `kubectl port-forward`
- clarify that the `azure.workload.identity/use` pod-template label (not
  the SA label) drives token injection

Editorial:
- standardize on `NVIDIA`, `Microsoft Entra ID`, and `labeled`
- add the missing trailing period on the `--load-format runai_streamer`
  line

Signed-off-by: Suraj Deshmukh <[email protected]>
Copilot AI review requested due to automatic review settings June 27, 2026 00:02

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 7 comments.

Comment thread website/blog/authors.yml
Comment thread website/blog/2026-06-26-runai-streamer-vllm/index.md Outdated
Comment thread website/blog/2026-06-26-runai-streamer-vllm/index.md
Comment thread website/blog/2026-06-26-runai-streamer-vllm/index.md Outdated
Comment thread website/blog/2026-06-26-runai-streamer-vllm/index.md Outdated
Comment thread website/blog/2026-06-26-runai-streamer-vllm/index.md
Comment thread website/blog/2026-06-26-runai-streamer-vllm/index.md
surajssd added 2 commits June 29, 2026 11:18
Blog post (`2026-06-26-runai-streamer-vllm`):
- replace "checkpoint" with "model weights" throughout for consistent
  terminology
- reword list item 3 from "so pods" to "that lets pods" to fix the
  grammar and match the parallel list structure
- add a "Tuning the streamer" note documenting
  `--model-loader-extra-config` options (`distributed`, `concurrency`,
  `memory_limit`)

Author metadata:
- replace the `TODO` placeholders in the `hariharan-sethuraman`
  `authors.yml` entry with the real LinkedIn URL and GitHub handle

Signed-off-by: Suraj Deshmukh <[email protected]>
…l creation

- poll `az feature show` until `ManagedGPUExperiencePreview` reports
  `Registered` so the step 1c node-pool command does not fail while the
  feature is still registering
- refresh the resource provider with `az provider register` once the
  feature is registered

Signed-off-by: Suraj Deshmukh <[email protected]>
Copilot AI review requested due to automatic review settings June 29, 2026 18:23

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

Comment thread website/blog/2026-06-26-runai-streamer-vllm/index.md Outdated
Comment thread website/blog/2026-06-26-runai-streamer-vllm/index.md
Comment thread website/blog/2026-06-26-runai-streamer-vllm/index.md Outdated
Comment thread website/blog/2026-06-26-runai-streamer-vllm/index.md Outdated
Copilot AI review requested due to automatic review settings June 29, 2026 18:55
…nsides

- swap the Mermaid "why stream" flowchart for an optimized
  `1-why-stream-vs-download.png` (1200px, 256-color, ~400 KB) with
  descriptive alt text
- add a "Trade-offs and downsides" section covering per-cold-start
  streaming cost, the one-upload-per-model step, Safetensors-only
  support, and owning a second copy of the weights

Signed-off-by: Suraj Deshmukh <[email protected]>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 3 changed files in this pull request and generated 4 comments.

Comment thread website/blog/2026-06-26-runai-streamer-vllm/index.md
Comment thread website/blog/2026-06-26-runai-streamer-vllm/index.md
Comment thread website/blog/2026-06-26-runai-streamer-vllm/index.md Outdated
Comment thread website/blog/2026-06-26-runai-streamer-vllm/index.md Outdated
Copilot AI review requested due to automatic review settings June 29, 2026 19:00

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 3 changed files in this pull request and generated 7 comments.

Comment thread website/blog/2026-06-26-runai-streamer-vllm/index.md Outdated
Comment thread website/blog/2026-06-26-runai-streamer-vllm/index.md
Comment thread website/blog/2026-06-26-runai-streamer-vllm/index.md Outdated
Comment thread website/blog/2026-06-26-runai-streamer-vllm/index.md Outdated
Comment thread website/blog/2026-06-26-runai-streamer-vllm/index.md
Comment thread website/blog/2026-06-26-runai-streamer-vllm/index.md
Comment thread website/blog/2026-06-26-runai-streamer-vllm/index.md
surajssd added 2 commits June 29, 2026 12:13
… with images

- swap the §3 workload-identity trust-chain Mermaid flowchart for
  `2-identity.png` with descriptive alt text
- swap the "How it fits together" Mermaid flowchart for
  `3-end-to-end.png` with descriptive alt text

Signed-off-by: Suraj Deshmukh <[email protected]>
…clusion

Content:
- convert the five blockquote callouts (why-upload, role-assignment
  permissions, propagation wait, streamer tuning, scaling note) to
  Docusaurus `:::note`/`:::caution`/`:::tip` admonitions
- add a §5 step showing how to confirm the weights loaded via the RunAI
  streamer by spotting the `Loading safetensors using Runai Model
  Streamer` log line
- add a Conclusion section summarizing the approach and when to adopt it
- minor intro and workload-identity prose rewording

Assets:
- re-export `1-why-stream-vs-download.png` and `3-end-to-end.png`
  (slightly smaller files)

Signed-off-by: Suraj Deshmukh <[email protected]>
Copilot AI review requested due to automatic review settings June 29, 2026 20:17

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 5 changed files in this pull request and generated 6 comments.

Comment thread website/blog/2026-06-26-runai-streamer-vllm/index.md Outdated
Comment thread website/blog/2026-06-26-runai-streamer-vllm/index.md
Comment thread website/blog/2026-06-26-runai-streamer-vllm/index.md Outdated
Comment thread website/blog/2026-06-26-runai-streamer-vllm/index.md Outdated
Comment thread website/blog/2026-06-26-runai-streamer-vllm/index.md Outdated
Comment thread website/blog/2026-06-26-runai-streamer-vllm/index.md
surajssd added 2 commits June 29, 2026 13:21
…post

- replace personal names in `AZURE_RESOURCE_GROUP` and
  `STORAGE_ACCOUNT_NAME` with generic values
- turn `AZURE_REGION`, `NODE_POOL_VM_SIZE`, and `STORAGE_ACCOUNT_NAME`
  into descriptive placeholders
- note that readers should modify the variables to match their
  environment

Signed-off-by: Suraj Deshmukh <[email protected]>
- wire `serviceAccountName: ${SERVICE_ACCOUNT_NAME}` into the Job and
  Deployment pod specs, add it to the Job `envsubst` allowlist, and add
  an idempotent `kubectl create serviceaccount` step so a non-default
  SERVICE_ACCOUNT_NAME no longer breaks Blob auth
- fix the phi-4 size contradiction: `~14 GB` is now `14.7B-parameter,
  ~29 GB on disk in bf16`, matching the ephemeral-storage comment
- remove the incorrect claim that pods inherit the
  `azure.workload.identity/use` label from their ServiceAccount
- add `envsubst` (GNU gettext) to the Prerequisites
- rewrite the front-matter `description` to ~155 chars for SEO

Signed-off-by: Suraj Deshmukh <[email protected]>
Copilot AI review requested due to automatic review settings June 29, 2026 20:51

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 5 changed files in this pull request and generated 4 comments.

Comment thread website/blog/2026-06-26-runai-streamer-vllm/index.md
Comment thread website/blog/2026-06-26-runai-streamer-vllm/index.md
Comment thread website/blog/2026-06-26-runai-streamer-vllm/index.md
Comment thread website/blog/2026-06-26-runai-streamer-vllm/index.md
Copilot AI review requested due to automatic review settings June 30, 2026 00:15

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 6 changed files in this pull request and generated 3 comments.

Comment thread website/blog/2026-06-26-runai-streamer-vllm/index.md Outdated
Comment thread website/blog/2026-06-26-runai-streamer-vllm/index.md Outdated
Comment thread website/blog/2026-06-26-runai-streamer-vllm/index.md Outdated
Comment thread website/blog/2026-06-26-runai-streamer-vllm/index.md Outdated
Comment thread website/blog/2026-06-26-runai-streamer-vllm/index.md Outdated
Comment thread website/blog/2026-06-26-runai-streamer-vllm/index.md Outdated
Comment thread website/blog/2026-06-26-runai-streamer-vllm/index.md Outdated
Comment thread website/blog/2026-06-26-runai-streamer-vllm/index.md Outdated
Comment thread website/blog/2026-06-26-runai-streamer-vllm/index.md Outdated
- reword the opening to mention Kubernetes and tidy phrasing ("on
  Kubernetes", "back-to-back", "However")
- add a sentence on what faster cold starts mean for production
  inference on AKS (failure recovery, rollouts, autoscaling)
- drop the AWS S3/GCS lead-in to focus the availability note on Azure
  Blob, and use "As of ... now supported"
- link `HuggingFace Hub` and say "entire model" in the diagram alt text
- expand the closing line to cover how the win compounds with larger
  models and busier autoscaling

Signed-off-by: Suraj Deshmukh <[email protected]>
Copilot AI review requested due to automatic review settings June 30, 2026 17:33

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 6 changed files in this pull request and generated 3 comments.

Comment thread website/blog/2026-06-26-runai-streamer-vllm/index.md
Comment thread website/blog/2026-06-26-runai-streamer-vllm/index.md Outdated
Comment thread website/blog/2026-06-26-runai-streamer-vllm/index.md
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants