dstackai · peterschmidt85 · May 23, 2026 · May 23, 2026 · May 23, 2026 · May 23, 2026
diff --git a/mkdocs/docs/examples/accelerators/amd.md b/mkdocs/docs/examples/accelerators/amd.md
@@ -1,17 +1,56 @@
 ---
 title: AMD
-description: Deploying and fine-tuning models on AMD MI300X GPUs using SGLang, vLLM, TRL, and Axolotl
+description: Running dev environments, tasks, and services on AMD GPUs
 ---
 
 # AMD
 
-`dstack` supports running dev environments, tasks, and services on AMD GPUs.
-You can do that by setting up an [SSH fleet](../../concepts/fleets.md#ssh-fleets)
-with on-prem AMD GPUs or configuring a backend that offers AMD GPUs such as the `runpod` backend.
+`dstack` natively supports AMD GPUs. This page covers the basics of setting up
+fleets, running inference, training, and dev environments on AMD GPUs.
 
-## Deployment
+## Fleets
 
-Here are examples of a [service](../../concepts/services.md) that deploy
+`dstack` supports native cloud provisioning, and can also work with existing
+Kubernetes clusters or vanilla bare-metal hosts.
+
+=== "Clouds"
+
+    `dstack` supports native provisioning of VMs with AMD GPUs across a number
+    of clouds, including
+    [AMD Developer Cloud](../../concepts/backends.md#amd-developer-cloud) and
+    [Hot Aisle](../../concepts/backends.md#hot-aisle). More cloud support is
+    coming soon.
+
+    To provision compute in these clouds, configure the corresponding
+    [backend](../../concepts/backends.md) and create a
+    [backend fleet](../../concepts/fleets.md).
+
+=== "Kubernetes"
+
+    To use `dstack` with existing Kubernetes cluster(s), configure the
+    [`kubernetes` backend](../../concepts/backends.md#kubernetes) and point it
+    to your kubeconfig file. Then create a
+    [backend fleet](../../concepts/fleets.md).
+
+=== "SSH fleets"
+
+    If you'd like `dstack` to use a cluster or machine that is already
+    provisioned and that you have access to, create an
+    [SSH fleet](../../concepts/fleets.md).
+
+!!! info "Cluster placement"
+    For multi-node workloads, the fleet must
+    [set](../../concepts/fleets.md#cluster-placement) `placement` to `cluster`.
+    For Kubernetes and SSH fleets, the network must be properly configured.
+
+    To test whether the cluster is properly configured, run the
+    [RCCL tests via a distributed task](../clusters/nccl-rccl-tests.md).
+
+Once a fleet is created, you can run dev environments, tasks, and services.
+
+## Inference
+
+Here are examples of a [service](../../concepts/services.md) that deploys
 `Qwen/Qwen3.6-27B` on AMD MI300X GPUs using
 [SGLang](https://github.com/sgl-project/sglang) and
 [vLLM](https://docs.vllm.ai/en/latest/).
@@ -22,7 +61,7 @@ Here are examples of a [service](../../concepts/services.md) that deploy
 
     ```yaml
     type: service
-    name: qwen36-service-sglang-amd
+    name: qwen36-sglang-amd
 
     image: lmsysorg/sglang:v0.5.10-rocm720-mi30x
 
@@ -50,18 +89,26 @@ Here are examples of a [service](../../concepts/services.md) that deploy
       memory: 896GB..
       shm_size: 16GB
       disk: 450GB..
-      gpu: MI300X:4
+      gpu: MI300X:4..
     ```
 
     </div>
 
+    !!! info "PD disaggregation"
+        To run SGLang with prefill and decode workers on an interconnected
+        cluster of AMD GPU instances, see the
+        [SGLang PD disaggregation](../inference/sglang.md#pd-disaggregation)
+        example.
+
+        For multi-node PD disaggregation, the fleet must use [cluster placement](../../concepts/fleets.md#cluster-placement).
+
 === "vLLM"
 
     <div editor-title="service.dstack.yml">
 
     ```yaml
     type: service
-    name: qwen36-service-vllm-amd
+    name: qwen36-vllm-amd
 
     image: vllm/vllm-openai-rocm:v0.19.1
 
@@ -87,164 +134,123 @@ Here are examples of a [service](../../concepts/services.md) that deploy
       memory: 896GB..
       shm_size: 16GB
       disk: 450GB..
-      gpu: MI300X:4
+      gpu: MI300X:4..
     ```
 
     </div>
 
-!!! info "Docker image"
-    AMD workloads require specifying an image with ROCm-compatible userspace and
-    framework packages. The SGLang and vLLM examples above use pinned ROCm
-    images.
+Use the [`dstack apply`](../../reference/cli/dstack/apply.md) command to apply
+any configuration, including services, tasks, dev environments, and fleets.
+
+<div class="termy">
 
-    If you already have a ROCm-compatible image, use it. Otherwise, choose an
-    image for the framework you use from
-    [ROCm Docker images](https://hub.docker.com/u/rocm), e.g. `rocm/sgl-dev`
-    for SGLang, `rocm/vllm` for vLLM, or `rocm/pytorch` for PyTorch. For
-    generic AMD dev environments or tasks, use `rocm/dev-ubuntu-24.04`.
+```shell
+$ dstack apply -f service.dstack.yml
+```
 
-To request multiple GPUs, specify the quantity after the GPU name, separated by a colon, e.g., `MI300X:4`.
+</div>
 
-## Fine-tuning
+## Training
+
+Below is a [task](../../concepts/tasks.md) that fine-tunes a small language
+model using the official
+[Transformers causal language modeling example](https://github.com/huggingface/transformers/tree/main/examples/pytorch/language-modeling)
+on AMD GPUs.
+
+<div editor-title="train.dstack.yml">
+
+```yaml
+type: task
+name: amd-qwen3-train
+
+image: rocm/pytorch:latest
+
+commands:
+  - git clone --depth 1 https://github.com/huggingface/transformers.git
+  - pip install -e ./transformers -r transformers/examples/pytorch/language-modeling/requirements.txt
+  - |
+    torchrun --standalone --nproc-per-node $DSTACK_GPUS_PER_NODE \
+      transformers/examples/pytorch/language-modeling/run_clm.py \
+      --model_name_or_path Qwen/Qwen3-0.6B-Base \
+      --dataset_name Salesforce/wikitext \
+      --dataset_config_name wikitext-2-raw-v1 \
+      --do_train \
+      --per_device_train_batch_size 1 \
+      --gradient_accumulation_steps 8 \
+      --max_steps 10 \
+      --block_size 512 \
+      --learning_rate 2e-5 \
+      --bf16 \
+      --logging_steps 1 \
+      --output_dir /tmp/qwen3-clm
+
+resources:
+  gpu: MI300X:4..
+  disk: 100GB..
+```
 
-> If you're planning multi-node AMD training, validate cluster networking first
-with the [NCCL/RCCL tests](../clusters/nccl-rccl-tests.md)
-example.
+</div>
 
-=== "TRL"
+!!! info "Distributed tasks"
+    To run training across multiple nodes, use
+    [distributed tasks](../../concepts/tasks.md#distributed-tasks). Distributed
+    tasks may run on a cluster; in that case, the fleet must use
+    [cluster placement](../../concepts/fleets.md#cluster-placement).
 
-    Below is an example of LoRA fine-tuning Llama 3.1 8B using [TRL](https://rocm.docs.amd.com/en/latest/how-to/llm-fine-tuning-optimization/single-gpu-fine-tuning-and-inference.html)
-    and the [`mlabonne/guanaco-llama2-1k`](https://huggingface.co/datasets/mlabonne/guanaco-llama2-1k)
-    dataset.
+## Dev environments
 
-    <div editor-title="train.dstack.yml">
+Here's an example of a [dev environment](../../concepts/dev-environments.md)
+that can be accessed via your desktop IDE.
 
-    ```yaml
-    type: task
-    name: trl-amd-llama31-train
-
-    # Using Runpod's ROCm Docker image
-    image: runpod/pytorch:2.1.2-py3.10-rocm6.1-ubuntu22.04
-
-    # Required environment variables
-    env:
-      - HF_TOKEN
-    # Mount files
-    files:
-      - train.py
-    # Commands of the task
-    commands:
-      - export PATH=/opt/conda/envs/py_3.10/bin:$PATH
-      - git clone https://github.com/ROCm/bitsandbytes
-      - cd bitsandbytes
-      - git checkout rocm_enabled
-      - pip install -r requirements-dev.txt
-      - cmake -DBNB_ROCM_ARCH="gfx942" -DCOMPUTE_BACKEND=hip -S  .
-      - make
-      - pip install .
-      - pip install trl
-      - pip install peft
-      - pip install transformers datasets huggingface-hub scipy
-      - cd ..
-      - python train.py
-
-    # Uncomment to leverage spot instances
-    #spot_policy: auto
+<div editor-title=".dstack.yml">
 
-    resources:
-      gpu: MI300X
-      disk: 150GB
-    ```
+```yaml
+type: dev-environment
+name: amd-vscode
 
-    </div>
+image: rocm/dev-ubuntu-24.04
 
-=== "Axolotl"
-    Below is an example of fine-tuning Llama 3.1 8B using [Axolotl](https://rocm.blogs.amd.com/artificial-intelligence/axolotl/README.html)
-    and the [tatsu-lab/alpaca](https://huggingface.co/datasets/tatsu-lab/alpaca)
-    dataset.
+ide: vscode
 
-    <div editor-title="train.dstack.yml">
+resources:
+  gpu: MI300X:1
+```
 
-    ```yaml
-    type: task
-    # The name is optional, if not specified, generated randomly
-    name: axolotl-amd-llama31-train
-
-    # Using Runpod's ROCm Docker image
-    image: runpod/pytorch:2.1.2-py3.10-rocm6.0.2-ubuntu22.04
-    # Required environment variables
-    env:
-      - HF_TOKEN
-      - WANDB_API_KEY
-      - WANDB_PROJECT
-      - WANDB_NAME=axolotl-amd-llama31-train
-      - HUB_MODEL_ID
-    # Commands of the task
-    commands:
-      - export PATH=/opt/conda/envs/py_3.10/bin:$PATH
-      - pip uninstall torch torchvision torchaudio -y
-      - python3 -m pip install --pre torch==2.3.0 torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.0/
-      - git clone https://github.com/OpenAccess-AI-Collective/axolotl
-      - cd axolotl
-      - git checkout d4f6c65
-      - pip install -e .
-      # Latest pynvml is not compatible with axolotl commit d4f6c65, so we need to fall back to version 11.5.3
-      - pip uninstall pynvml -y
-      - pip install pynvml==11.5.3
-      - cd ..
-      - wget https://dstack-binaries.s3.amazonaws.com/flash_attn-2.0.4-cp310-cp310-linux_x86_64.whl
-      - pip install flash_attn-2.0.4-cp310-cp310-linux_x86_64.whl
-      - wget https://dstack-binaries.s3.amazonaws.com/xformers-0.0.26-cp310-cp310-linux_x86_64.whl
-      - pip install xformers-0.0.26-cp310-cp310-linux_x86_64.whl
-      - git clone --recurse https://github.com/ROCm/bitsandbytes
-      - cd bitsandbytes
-      - git checkout rocm_enabled
-      - pip install -r requirements-dev.txt
-      - cmake -DBNB_ROCM_ARCH="gfx942" -DCOMPUTE_BACKEND=hip -S  .
-      - make
-      - pip install .
-      - cd ..
-      - accelerate launch -m axolotl.cli.train -- axolotl/examples/llama-3/fft-8b.yaml
-              --wandb-project "$WANDB_PROJECT"
-              --wandb-name "$WANDB_NAME"
-              --hub-model-id "$HUB_MODEL_ID"
+</div>
 
-    resources:
-      gpu: MI300X
-      disk: 150GB
-    ```
-    </div>
+## Docker image
 
-    Note, to support ROCm, we need to checkout to commit `d4f6c65`. This commit eliminates the need to manually modify the Axolotl source code to make xformers compatible with ROCm, as described in the [xformers workaround](https://docs.axolotl.ai/docs/amd_hpc.html#apply-xformers-workaround). This installation approach is also followed for building Axolotl ROCm docker image. [(See Dockerfile)](https://github.com/ROCm/rocm-blogs/blob/release/blogs/artificial-intelligence/axolotl/src/Dockerfile.rocm).
+> If you'd like a run to use AMD GPUs, make sure to specify `image`.
 
-    > To speed up installation of `flash-attention` and `xformers`, we use pre-built binaries uploaded to S3.
+The image's ROCm runtime must be compatible with the AMD GPUs the run will use.
+The image should also include the packages your workload needs.
 
-## Running a configuration
+## Metrics
 
-Once a configuration is ready, save it to a `.dstack.yml` file. If your
-configuration references environment variables such as `HF_TOKEN` or
-`WANDB_API_KEY`, export them first. Then run
-`dstack apply -f <configuration file>`, and `dstack` will automatically
-provision the cloud resources and run the configuration.
+Run and job [metrics](../../concepts/metrics.md) include CPU, memory, and GPU
+usage. They are available in the UI and via the CLI:
 
 <div class="termy">
 
 ```shell
-$ dstack apply -f <configuration file>
+$ dstack metrics &lt;run name&gt;
 ```
 
 </div>
 
+> AMD GPU metrics require `amd-smi` to be available in the run image. If it
+> isn't present, GPU metrics may be unavailable.
+
 ## What's next?
 
 1. Browse the dedicated [SGLang](../inference/sglang.md)
-   and [vLLM](../inference/vllm.md) examples, plus
-   [Axolotl](https://github.com/ROCm/rocm-blogs/tree/release/blogs/artificial-intelligence/axolotl),
-   [TRL](https://rocm.docs.amd.com/en/latest/how-to/llm-fine-tuning-optimization/fine-tuning-and-inference.html),
-   and [ROCm Bitsandbytes](https://github.com/ROCm/bitsandbytes)
-2. For multi-node training, run
-   [NCCL/RCCL tests](../clusters/nccl-rccl-tests.md)
-   to validate AMD cluster networking.
-3. Check [dev environments](../../concepts/dev-environments.md),
-   [tasks](../../concepts/tasks.md), and
-   [services](../../concepts/services.md).
+   and [vLLM](../inference/vllm.md) examples, plus the
+   [Qwen 3.6](../models/qwen36.md) model page.
+2. For multi-node inference, see
+   [SGLang PD disaggregation](../inference/sglang.md#pd-disaggregation).
+3. For cluster validation, run
+   [NCCL/RCCL tests](../clusters/nccl-rccl-tests.md).
+4. Check [dev environments](../../concepts/dev-environments.md),
+   [tasks](../../concepts/tasks.md), [services](../../concepts/services.md),
+   [fleets](../../concepts/fleets.md), and
+   [backends](../../concepts/backends.md).
diff --git a/mkdocs/docs/examples/training/axolotl.md b/mkdocs/docs/examples/training/axolotl.md
@@ -54,9 +54,6 @@ resources:
 
 The task uses Axolotl's Docker image, where Axolotl is already pre-installed.
 
-!!! info "AMD"
-    The example above uses NVIDIA accelerators. To use it with AMD, check out [AMD](../accelerators/amd.md#axolotl).
-
 ### Run the configuration
 
 Once the configuration is ready, run `dstack apply -f <configuration file>`, and `dstack` will automatically provision the
@@ -182,4 +179,3 @@ Provisioning...
 1. Check [dev environments](../../concepts/dev-environments.md), [tasks](../../concepts/tasks.md),
    [services](../../concepts/services.md), and [fleets](../../concepts/fleets.md)
 2. Read about [cluster placement](../../concepts/fleets.md#cluster-placement)
-3. See the [AMD](../accelerators/amd.md#axolotl) example
diff --git a/mkdocs/docs/examples/training/trl.md b/mkdocs/docs/examples/training/trl.md
@@ -75,9 +75,6 @@ resources:
 
 Change the `resources` property to specify more GPUs.
 
-!!! info "AMD"
-    The example above uses NVIDIA accelerators. To use it with AMD, check out [AMD](../accelerators/amd.md#trl).
-
 ??? info "DeepSpeed"
     For more memory-efficient use of multiple GPUs, consider using DeepSpeed and ZeRO Stage 3.
 
@@ -269,4 +266,3 @@ Provisioning...
 1. Check [dev environments](../../concepts/dev-environments.md), [tasks](../../concepts/tasks.md), 
    [services](../../concepts/services.md), and [fleets](../../concepts/fleets.md)
 2. Read about [cluster placement](../../concepts/fleets.md#cluster-placement)
-3. See the [AMD](../accelerators/amd.md#trl) example