docs(blog): add post on streaming vLLM weights from Azure Blob Storage#5845
Open
surajssd wants to merge 15 commits into
Open
docs(blog): add post on streaming vLLM weights from Azure Blob Storage#5845surajssd wants to merge 15 commits into
surajssd wants to merge 15 commits into
Conversation
- add `2026-06-26-runai-streamer-vllm` post covering serving `microsoft/phi-4` with vLLM on AKS, streaming weights from Azure Blob via the RunAI Model Streamer (`az://`) with workload identity - add `hariharan-sethuraman` author entry to `authors.yml` (placeholder details pending) Signed-off-by: Suraj Deshmukh <[email protected]>
Contributor
There was a problem hiding this comment.
Pull request overview
Adds a new AKS blog post that walks through serving microsoft/phi-4 with vLLM while streaming weights directly from Azure Blob Storage via the RunAI Model Streamer (az://) using workload identity, plus a new author profile entry to support the post.
Changes:
- Added a new blog post: streaming vLLM weights from Azure Blob on AKS with workload identity.
- Added a new author key (
hariharan-sethuraman) towebsite/blog/authors.yml.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 7 comments.
| File | Description |
|---|---|
| website/blog/authors.yml | Adds a new author entry used by the new blog post. |
| website/blog/2026-06-26-runai-streamer-vllm/index.md | New end-to-end tutorial post for streaming vLLM weights from Azure Blob Storage on AKS. |
- reword the S3/GCS-to-Azure-Blob intro for clearer phrasing - add a closing transition sentence to the cold-start section - add a note on bumping the upload Job timeout for larger (70B+) models Signed-off-by: Suraj Deshmukh <[email protected]>
Upload Job reliability: - add `set -euo pipefail` so a failed `curl | tar` azcopy download fails loudly instead of silently - pin `huggingface-hub>=0.34` to guarantee the renamed `hf` CLI is present - reserve disk via an `emptyDir` scratch volume and `ephemeral-storage` requests/limits to avoid `DiskPressure` eviction - document deleting the immutable `Job` before re-applying to avoid `AlreadyExists` - rewrite the timeout note to cover disk sizing for larger models, not just time Walkthrough correctness: - add `--overwrite` to the `kubectl annotate`/`label serviceaccount` commands so the section is idempotent - wait for `/health` before the first `curl` so it does not race `kubectl port-forward` - clarify that the `azure.workload.identity/use` pod-template label (not the SA label) drives token injection Editorial: - standardize on `NVIDIA`, `Microsoft Entra ID`, and `labeled` - add the missing trailing period on the `--load-format runai_streamer` line Signed-off-by: Suraj Deshmukh <[email protected]>
Blog post (`2026-06-26-runai-streamer-vllm`): - replace "checkpoint" with "model weights" throughout for consistent terminology - reword list item 3 from "so pods" to "that lets pods" to fix the grammar and match the parallel list structure - add a "Tuning the streamer" note documenting `--model-loader-extra-config` options (`distributed`, `concurrency`, `memory_limit`) Author metadata: - replace the `TODO` placeholders in the `hariharan-sethuraman` `authors.yml` entry with the real LinkedIn URL and GitHub handle Signed-off-by: Suraj Deshmukh <[email protected]>
…l creation - poll `az feature show` until `ManagedGPUExperiencePreview` reports `Registered` so the step 1c node-pool command does not fail while the feature is still registering - refresh the resource provider with `az provider register` once the feature is registered Signed-off-by: Suraj Deshmukh <[email protected]>
…nsides - swap the Mermaid "why stream" flowchart for an optimized `1-why-stream-vs-download.png` (1200px, 256-color, ~400 KB) with descriptive alt text - add a "Trade-offs and downsides" section covering per-cold-start streaming cost, the one-upload-per-model step, Safetensors-only support, and owning a second copy of the weights Signed-off-by: Suraj Deshmukh <[email protected]>
82c3a20 to
91c179e
Compare
… with images - swap the §3 workload-identity trust-chain Mermaid flowchart for `2-identity.png` with descriptive alt text - swap the "How it fits together" Mermaid flowchart for `3-end-to-end.png` with descriptive alt text Signed-off-by: Suraj Deshmukh <[email protected]>
…clusion Content: - convert the five blockquote callouts (why-upload, role-assignment permissions, propagation wait, streamer tuning, scaling note) to Docusaurus `:::note`/`:::caution`/`:::tip` admonitions - add a §5 step showing how to confirm the weights loaded via the RunAI streamer by spotting the `Loading safetensors using Runai Model Streamer` log line - add a Conclusion section summarizing the approach and when to adopt it - minor intro and workload-identity prose rewording Assets: - re-export `1-why-stream-vs-download.png` and `3-end-to-end.png` (slightly smaller files) Signed-off-by: Suraj Deshmukh <[email protected]>
…post - replace personal names in `AZURE_RESOURCE_GROUP` and `STORAGE_ACCOUNT_NAME` with generic values - turn `AZURE_REGION`, `NODE_POOL_VM_SIZE`, and `STORAGE_ACCOUNT_NAME` into descriptive placeholders - note that readers should modify the variables to match their environment Signed-off-by: Suraj Deshmukh <[email protected]>
- wire `serviceAccountName: ${SERVICE_ACCOUNT_NAME}` into the Job and
Deployment pod specs, add it to the Job `envsubst` allowlist, and add
an idempotent `kubectl create serviceaccount` step so a non-default
SERVICE_ACCOUNT_NAME no longer breaks Blob auth
- fix the phi-4 size contradiction: `~14 GB` is now `14.7B-parameter,
~29 GB on disk in bf16`, matching the ephemeral-storage comment
- remove the incorrect claim that pods inherit the
`azure.workload.identity/use` label from their ServiceAccount
- add `envsubst` (GNU gettext) to the Prerequisites
- rewrite the front-matter `description` to ~155 chars for SEO
Signed-off-by: Suraj Deshmukh <[email protected]>
3992664 to
9e03686
Compare
sdesai345
reviewed
Jun 30, 2026
sdesai345
reviewed
Jun 30, 2026
sdesai345
reviewed
Jun 30, 2026
sdesai345
reviewed
Jun 30, 2026
sdesai345
reviewed
Jun 30, 2026
sdesai345
reviewed
Jun 30, 2026
- reword the opening to mention Kubernetes and tidy phrasing ("on
Kubernetes", "back-to-back", "However")
- add a sentence on what faster cold starts mean for production
inference on AKS (failure recovery, rollouts, autoscaling)
- drop the AWS S3/GCS lead-in to focus the availability note on Azure
Blob, and use "As of ... now supported"
- link `HuggingFace Hub` and say "entire model" in the diagram alt text
- expand the closing line to cover how the win compounds with larger
models and busier autoscaling
Signed-off-by: Suraj Deshmukh <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a new AKS Engineering Blog post, "Stream Model Weights to NVIDIA GPU running vLLM from Azure Blob Storage on AKS" (
website/blog/2026-06-26-runai-streamer-vllm/index.md), a runnable end-to-end walkthrough for servingmicrosoft/phi-4with vLLM on AKS while streaming model weights directly from Azure Blob Storage via the RunAI Model Streamer's nativeaz://scheme. The post leans on a fully managed A100 GPU node pool and workload identity so no storage keys are needed, and explains why streaming beats the default download-then-load path for autoscaling inference cold starts.Changes
index.md, ~640 lines) walking through: deploying an AKS cluster with OIDC + workload identity and a managed GPU node pool (--enable-managed-gpu=true), creating a premium block-blob storage account, wiring up workload identity for keyless Blob access, an in-cluster uploadJobthat pushesmicrosoft/phi-4weights to Blob, and a vLLMDeploymentthat streams them via--load-format runai_streamer. Includes "Why stream", "Trade-offs and downsides", a verification step (spotting theLoading safetensors using Runai Model Streamerlog line), and a conclusion.1-why-stream-vs-download.png,2-identity.png(workload-identity trust chain), and3-end-to-end.png(end-to-end flow), each with descriptive alt text.hariharan-sethuramanadded towebsite/blog/authors.yml; post co-authored withsuraj-deshmukh.serviceAccountName: ${SERVICE_ACCOUNT_NAME}into both pod specs (and theJob'senvsubstallowlist) with an idempotentkubectl create serviceaccountstep, addedset -euo pipefailand pinnedhuggingface-hub>=0.34in the uploadJob, reservedephemeral-storageto avoidDiskPressureeviction, added anaz feature showpoll so the node-pool step doesn't race feature registration, addedenvsubstto Prerequisites, fixed thephi-4size figure, corrected the workload-identity label explanation, and depersonalized theConfigurationvariables.:::note/:::caution/:::tipadmonitions and replaced the Mermaid diagrams with images.Test Plan
npm run buildsucceeds inwebsite/(static site compiles; post renders at/2026/06/26/runai-streamer-vllmwith both authors).markdownlint-cli2passes againstblog/linters/.markdown-lint.yml(0 errors).codespellclean with the repo's ignore list.JobandDeploymentYAML manifests parse as valid Kubernetes objects.az/kubectlwalkthrough on a live AKS cluster with A100 quota, and thatvllm/vllm-openai:v0.23.0pulls.