DGX OpenClaw Stack

A one-command, production-grade local AI agent stack — OpenClaw + vLLM + bge-m3 multilingual embeddings + SearxNG private web search + hybrid (BM25 + vector) memory retrieval, wired together in a single docker compose file.

Calibrated for the NVIDIA GB10 "Grace-Blackwell" Superchip (NVIDIA DGX Spark, ASUS Ascent GB10) running Gemma 4 26B-A4B MoE NVFP4 and Gemma 4 31B IT NVFP4 dense side by side on separate ports (8004 / 8005, separate OpenClaw provider entries) — pick either model in the UI without restarting. Portable to other hardware — swap the LLM for whatever fits your GPU, or point OpenClaw at a cloud LLM API and keep everything else.

The default profile's tuning decisions — NVFP4 quantization, GPU memory split between LLM and embedding, FP8 KV cache, concurrency bands, context-window budgeting — are calibrated to the GB10 Superchip's specific hardware profile: 128 GB of unified LPDDR5X, 273 GB/s bandwidth, and native FP4 tensor-core acceleration (sm_120/sm_121). On a DGX Spark or ASUS Ascent GB10 you get those numbers out of the box. On other hardware everything except the LLM service is reusable as-is.

Who this is for

You are…	What you get	Time to working stack
A GB10 owner (DGX Spark, ASUS Ascent GB10)	The calibrated reference profile. Boot the stack and run Gemma 4 26B-A4B MoE NVFP4 (~25 tok/s decode single-stream, ~112 tok/s aggregate at 4-paralel) with multilingual embeddings, hybrid memory, private web search, bilingual TTS — on your hardware, no cloud. The dense 31B is preserved as a profile-gated alternative for parity testing.	~30 min, mostly model download
An x86_64 + NVIDIA GPU operator (RTX 4090, A6000, etc.)	Same wiring; swap `vllm-llm` for a model your VRAM holds (Gemma 4 12B BF16, Qwen 2.5, Llama 3.3). All non-LLM services transfer unchanged.	~30 min + tuning
A cloud-LLM user (OpenAI, Anthropic, OpenRouter, Bedrock, remote vLLM)	Park the local LLM service, point three env vars at your hosted endpoint. You still get the local agent stack: bge-m3 embeddings, SearxNG private search, hybrid memory, dreaming, heartbeat, TTS.	~10 min (no GPU)
A contributor or curious reader	A worked example of a deterministic, opinionated AI agent stack. Every wiring decision has a why in the comments; the patcher is small enough to read in one sitting.	n/a — start with `docs/ARCHITECTURE.md`

If none of those rows describe you, this repo probably isn't your fit — it's optimized for self-hosting on real hardware (or a real cloud LLM), not for trying out a chatbot on a laptop.

What you get

A fully local agent platform (or local-plus-cloud-LLM hybrid — your choice), with:

Component	What it does
Gemma 4 26B-A4B MoE (NVFP4)	25.2B-total / 3.8B-active Google Gemma 4 mixture-of-experts model (128 experts, top-8 routing), quantized by NVIDIA to NVFP4 with the `nvfp4_experts_only` recipe. Native tool calling, 256K context, multimodal (text + image), ~25 tok/s decode single-stream on GB10 with Marlin SM121 backend + CUDA graphs (~4× faster than dense; ~112 tok/s aggregate at 4-parallel).
Gemma 4 31B IT (NVFP4) — concurrent dense	31.3B dense Google Gemma 4 quantized to NVFP4. Runs side-by-side with the MoE on port 8005 (provider id `vllm-dense`), single-user / 256K context / ~6.9 tok/s decode profile. Pick either model in the OpenClaw UI without restarting; the dense backend exists for parity testing, multimodal-heavy workloads where dense quality matters, or as a fallback.
bge-m3 embeddings	BAAI/bge-m3 multilingual dense embeddings via vLLM. 100+ languages, 1024-dim, 8K context, EN↔HU cosine ≈ 0.88.
SearxNG meta-search	Self-hosted, privacy-respecting web search backend wired into OpenClaw's native `webSearch` provider. Strict engine whitelist (DuckDuckGo, Brave, Mojeek, Qwant, Startpage, Wikipedia family, Reddit, GitHub, arXiv) — queries never reach Google / Bing / Yandex / Yahoo / Baidu.
OpenClaw gateway	The open-source agent runtime: Chrome extension UI, CLI, persistent memory, heartbeat, multi-agent world-building.
Multilingual TTS (Fish Audio S2 Pro)	Single self-hosted OpenAI-compatible `/v1/audio/speech` service backed by `fishaudio/s2-pro` (5B param Qwen3-omni) served via SGLang-Omni on a custom CUDA 13 aarch64 image. 80+ languages from one checkpoint (English + Hungarian both supported), voice cloning from any 10-30 s mounted reference WAV+transcript, ~11 GB weights baked at build time. Wired into OpenClaw via `messages.tts.providers.openai`. License: Fish Audio Research License — non-commercial only. Wrapper code MIT.
Whisper STT (EN + HU, turbo)	OpenAI-compatible `/v1/audio/transcriptions` via `deepdml/faster-whisper-large-v3-turbo-ct2` on a self-built CUDA 13 image (~150 LOC FastAPI wrapper around `faster-whisper` — the upstream speaches image rejects Blackwell tensor-core compute types on sm_120, so we self-build to match the `vllm-llm` / `openclaw-tts-fish` wheel pattern). ~1.6 GB VRAM at float16, ~8× faster than vanilla large-v3 (pruned 4-layer decoder), autodetects language. Wired into OpenClaw's `tools.media.audio` pipeline — voice-note uploads in the Control UI chat, Discord voice channels, the VoiceCall CLI, and Talk / Voicewake nodes all transcribe through this service. MIT wrapper + MIT Whisper weights. Swap to `Trendency/whisper-large-v3-hu` via `STT_WHISPER_MODEL` for the HU-finetune (slower, more robust on noisy mic input).
Browser automation (opt-in)	OpenClaw's built-in `browser` tool attaches to a self-hosted Playwright Chromium cluster over Chrome DevTools Protocol — one warm Chromium per onboarded credential. 1x manual OAuth onboarding per service via a noVNC bridge (`./bootstrap-browser-login.sh github-user1`); afterwards the agent reaches authenticated content with no per-call re-auth until the upstream session expires (~14d GitHub, ~30d Notion, etc.). Activate via `--profile browser`. Apache 2.0. Limitation: passkey-only auth flows don't work over noVNC by W3C origin-bound spec — use password+TOTP or API tokens for those. Details in `docs/reference/browser-automation.md`.
Idempotent config patcher	A small Node script that makes your OpenClaw config deterministic — runs on every `up`, never clobbers onboarding choices it shouldn't. Wires hybrid (BM25 + vector) retrieval with MMR re-rank on top of `memorySearch`, flips the bundled SearxNG plugin on, points the openai TTS provider at the bundled router, upserts the STT entry into `tools.media.audio.models[]`, and writes one `browser.profiles.<name>.cdpUrl` per registered Chromium profile.

Everything lives in one Docker Compose file. No separate vLLM service definitions, no reverse-proxied DNS trickery, no host.docker.internal workarounds — containers reach each other by their compose service name on the default bridge network.

Hardware targets

The reference profile (docker compose up -d with no edits) is designed and tested on:

NVIDIA DGX Spark (GB10 Superchip, 128 GB unified LPDDR5X, 273 GB/s)
ASUS Ascent GB10 (same GB10 Superchip, same memory architecture)

Works unchanged on any future workstation built around the GB10 Superchip — the stack doesn't depend on DGX- or ASUS-specific firmware, only on the Blackwell datacenter compute capabilities (sm_120/sm_121) and the GB10's 128 GB unified memory budget.

The reference profile won't boot as-is on non-GB10 hardware — vllm/vllm-openai:gemma4-cu130 and the NVFP4 model both need Blackwell FP4 kernels. Two supported alternatives, sketched briefly in docs/CUSTOMIZATION.md:

Other NVIDIA GPU: switch to a stock vLLM image and a model that fits your VRAM (smaller Gemma 4 NVFP4 if you have a Blackwell desktop; Gemma 4 12B BF16 / Qwen 2.5 / Llama 3.3 elsewhere). The memory-split and concurrency constants in .env.example will need re-tuning for your card.
Cloud LLM: park the local vLLM services behind profiles: ["never"] and set OPENAI_BASE_URL / LLM_BASE_URL / EMBED_BASE_URL in .env to your hosted endpoints (cloud OpenAI-compatible API, remote vLLM on another box, etc.). bge-m3 stays local by default but can also be remoted. Everything downstream — gateway, SearxNG, hybrid retrieval, dreaming, heartbeat — is unchanged.

Performance (measured on GB10, single-shot generation)

Scenario	MoE 26B-A4B (default)	Dense 31B (alternative)
Decode throughput, 1 concurrent user	~24.9 tok/s sustained (NVFP4 + Marlin MoE backend on SM121 + CUDA graphs, measured 2026-05-08)	~6.9 tok/s sustained (NVFP4)
Aggregate throughput, 4 concurrent users	~112 tok/s (~28 tok/s per user — continuous-batching + CUDA graphs amortize kernel-launch overhead)	n/a (single-stream profile)
Stable context window, 1 concurrent user	256K reachable (prefill-bound past ~100K — 100K prompt + 200 gen ≈ 70s wall)	~220K tokens before vLLM preemption
Stable context window, paged 4 simul users	~4.3× concurrency at 256K (paging-runtime; full simul-256K = preempt)	~110K tokens each, continuous batching
Model footprint	~16.5 GB (NVFP4 weights, vision tower included)	~17 GB (NVFP4 weights, vision tower included)
Vision prefill per image	~280 vision tokens for a ≈ 512×512 region, sub-second encode (both)	—
First-boot cold start (after model download)	~3–4 min from `up` to gateway-ready (both)	—
KV cache	FP8 (halves cache footprint vs default BF16 KV cache)	FP8 (same)

Numbers come from a DGX Spark with 128 GB unified LPDDR5X. Single-prompt streaming with a warm KV cache; throughput drops with longer contexts and more concurrent users. Re-tune the LLM_GPU_MEM_UTIL / LLM_MAX_NUM_SEQS constants in .env for other hardware.

Quickstart

Four shell commands plus one in-browser onboarding step — that's the minimal path to a working default-profile install. First-boot is two-phase by design (the gateway waits for explicit OpenClaw onboarding before applying the wiring); skip the heads-up below at your peril.

git clone https://github.com/chestercs/dgx-openclaw-stack.git
cd dgx-openclaw-stack

./bootstrap.sh                              # interactive, non-destructive, idempotent
docker compose up -d                        # 10 default services; gateway will crash-loop
                                            #   with "Missing config" until you onboard

# Phase 2 — open the OpenClaw Chrome extension OR run `openclaw setup` in a
# shell on the host, pair with `ws://<your-host>:18789` using the gateway
# token printed by bootstrap. Onboarding writes openclaw.json.

docker compose up -d --force-recreate openclaw-config-init openclaw-gateway openclaw-cli

That's it — the patcher applies all wiring steps and the gateway goes healthy. Two-phase fresh-install onboarding (gateway crash-loop → onboarding → patcher applies wiring) is the OpenClaw security model, not a bug; details in SETUP.md. If anything goes sideways, the symptoms map directly onto entries in docs/TROUBLESHOOTING.md.

This brings up the 9 default services (LLM MoE + dense + embedding + gateway + cli + config-init + searxng + tts-fish + stt-whisper). Hungarian TTS is now built into the Fish Audio S2 Pro service — no separate --profile hu opt-in. Several capabilities are opt-in profiles that don't start with the default up:

--profile browser — Playwright Chromium for login-gated sites; per-credential 1× OAuth via the noVNC helper.
--profile python — Python code-execution sandbox (MCP).
Image generation lives in a separate compose file and proxies to your existing ComfyUI install (the repo ships no model weights).
Discord integration is a separate operator-side flow (Developer Portal app → bot token → openclaw channels add). The patcher handles every Discord-related field automatically once you've created the channel; see docs/discord-bot-setup.md and docs/reference/discord-config.md.

A more honest "what reproduces from a fresh clone vs what's manual" breakdown is in § Reproducibility from a fresh clone below.

Architecture at a glance

    ┌──────────────────────────────────────────────────────────────────────┐
    │  DGX Spark / ASUS GB10                                               │
    │                                                                       │
    │  Default profile (10 services, all on the compose bridge network)    │
    │  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐       │
    │  │ vllm-llm        │  │ vllm-llm-dense  │  │ vllm-embedding  │       │
    │  │ :8004 (MoE 26B) │  │ :8005 (dense 31)│  │ :8005 (bge-m3)  │       │
    │  └────────▲────────┘  └────────▲────────┘  └────────▲────────┘       │
    │  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐       │
    │  │ searxng         │  │ openclaw-tts-   │  │ openclaw-stt-   │       │
    │  │ :8080 privacy   │  │ fish :8091      │  │ whisper :8093   │       │
    │  │ meta-search     │  │ Fish S2 Pro     │  │ faster-whisper  │       │
    │  │                 │  │ (multilingual)  │  │ turbo CT2       │       │
    │  └────────▲────────┘  └────────▲────────┘  └────────▲────────┘       │
    │           │ compose DNS (service names)                              │
    │  ┌────────┴──────────────────────────────────────────────────────┐   │
    │  │ openclaw-gateway          :18789 (only published port)        │◀── Chrome ext.
    │  │   ├ openclaw-config-init  (one-shot patcher, runs every up)   │◀── CLI
    │  │   └ openclaw-cli          (always-up, shares gateway netns)   │   │
    │  └───────────────────────────────────────────────────────────────┘   │
    │                                                                       │
    │  Opt-in profiles (parked unless explicitly enabled)                  │
    │  ─ --profile browser  → openclaw-browser   (Chromium + noVNC)        │
    │  ─ --profile python   → openclaw-python-sandbox (MCP exec)           │
    │                                                                       │
    │  Separate compose (proxies to operator-side ComfyUI on host)         │
    │  ─ openclaw-image-comfyui/docker-compose.yml --profile image-gen     │
    └──────────────────────────────────────────────────────────────────────┘

All inter-container traffic is on the compose default bridge network; only port 18789 is published to the host. Put a reverse proxy (Nginx Proxy Manager, Caddy, Traefik, or a Cloudflared tunnel) in front for public access over wss://.

Deep dive: docs/ARCHITECTURE.md.

Features

One compose file for everything. LLM, embedding, web search, and agent stack in one docker compose up -d.
NVFP4-native. Ships with the official vllm/vllm-openai:gemma4-cu130 image; no custom build required.
True tool calling. The shipped tool_chat_template_gemma4.jinja plus --tool-call-parser gemma4 --enable-auto-tool-choice produces OpenAI-format tool_calls. OpenClaw uses them directly.
Multimodal. Gemma 4's vision tower is included. Drop an image into the chat; the model reads it at ~280 tokens per image by default.
Multilingual RAG built in. bge-m3 gives you high-quality cross-lingual embeddings for memorySearch out of the box.
Hybrid retrieval + MMR. memorySearch runs BM25 (SQLite FTS5) alongside vector similarity and re-ranks the candidate set with MMR for diversity — exact-keyword / ID matches stop falling through the cracks of pure cosine search.
Privacy-respecting web search. Self-hosted SearxNG wired into OpenClaw's native webSearch tool. No commercial search API, no query leak to Google / Bing / Yandex. Strict engine whitelist (DuckDuckGo, Brave, Mojeek, Qwant, Startpage + Wikipedia / Reddit / GitHub / arXiv).
Multilingual self-hosted TTS. Fish Audio S2 Pro (fishaudio/s2-pro) via SGLang-Omni in one container — 80+ languages from one checkpoint (EN + HU both supported), voice cloning from any 10-30 s mounted reference clip + transcript, ~11 GB weights baked at build time. Fish Audio Research License — non-commercial only (wrapper code MIT); see Fish Audio license note for commercial path.
Bilingual self-hosted STT. deepdml/faster-whisper-large-v3-turbo-ct2 (turbo, ~8× faster than vanilla large-v3) on a self-built CUDA 13 image (~150 LOC FastAPI wrapper — Blackwell compat ate the upstream speaches-ai image), autodetecting English and Hungarian. ~1.6 GB VRAM at float16, wired into OpenClaw's tools.media.audio pipeline — voice-note uploads, Discord voice, VoiceCall CLI, and Talk/Voicewake nodes all transcribe through it. Swap to Trendency/whisper-large-v3-hu via STT_WHISPER_MODEL for the accuracy-first HU finetune. Details in docs/reference/stt-stack.md.
Optional FLUX-Krea-dev image generation. The openclaw-image-comfyui MCP bridge (opt-in via --profile image-gen, separate compose file) drives the operator's existing ComfyUI install through flux-krea-2k (SFW, 1280×720 default) and flux-krea-2k-adult (same pipeline + flux-uncensored-v2 LoRA) workflow templates. The bridge ships no model weights; the recommended ~35 GB download is documented in docs/reference/image-comfyui-bridge.md. For 4K output, render at 2K native and upscale externally with ESRGAN — diffusion-based upscalers (SUPIR, UltimateSDUpscale tile pass) produce visible tile-seam artifacts on FLUX latents and were dropped.
Discord-ready out of the box. Once you create a bot in Discord's Developer Portal and run openclaw channels add --channel discord, the patcher writes 11 production-tested Discord overrides automatically (progressive streaming, slash-command authz for issue #19310, tool-surface widening for the Discord-routed agent, cron + browser cheatsheets in the workspace). Every override is env-gated and individually disable-able — see docs/reference/discord-config.md for the at-a-glance table.
Long context, honest numbers. 256K model max; realistic stable bands (per user count) are documented in the compose file.
Idempotent configuration. The patcher re-applies a known-good state on every up. Safe to run repeatedly.
Reverse-proxy ready. gateway.trustedProxies is pre-populated; add your LAN CIDR via OPENCLAW_LAN_CIDR if needed.
Non-destructive bootstrap. bootstrap.sh never overwrites an existing .env value or host directory.

Repository layout

dgx-openclaw-stack/
├─ docker-compose.yml           # default + opt-in profiles (hu, browser, python)
├─ patch-config.mjs             # idempotent OpenClaw config patcher (27+ steps,
│                               #   header docblock indexes every one)
├─ bootstrap.sh                 # non-destructive interactive first-time setup
├─ bootstrap-browser-login.sh   # 1x OAuth onboarding helper (noVNC bridge)
├─ rotate-secrets.sh            # rotate gateway / service tokens in place
├─ .env.example                 # documented env template (every tunable lives here)
├─ templates/
│  ├─ tool_chat_template_gemma4.jinja        # Gemma 4 tool-call chat template
│  ├─ discord-text-agent/AGENTS.md.example   # discord-friend agent template
│  └─ userscripts/                            # web chat UI userscripts (opt-in)
├─ searxng/
│  └─ settings/settings.yml     # SearxNG override: JSON API + strict engine whitelist
├─ vllm-llm/                    # custom vLLM image (gemma4 tool-call parser patch
│                               #   for colon namespaces — see Dockerfile)
├─ openclaw-base-ext/           # local extension of the openclaw image (adds ffmpeg)
├─ openclaw-tts-fish/           # Multilingual TTS (Fish Audio S2 Pro, SGLang-Omni)
│                               #   Fish Audio Research License (non-commercial)
├─ openclaw-stt-whisper/        # Self-built CUDA 13 STT image (faster-whisper turbo)
├─ openclaw-browser/            # OPT-IN browser automation (--profile browser)
├─ openclaw-python-sandbox/     # OPT-IN Python MCP exec sandbox (--profile python)
├─ openclaw-image-comfyui/      # OPT-IN image-gen MCP bridge — SEPARATE compose file
│                               #   (proxies to operator's existing ComfyUI install)
├─ docs/
│  ├─ ARCHITECTURE.md           # service-by-service design rationale
│  ├─ CUSTOMIZATION.md          # model swaps, remote backends, hardware retuning
│  ├─ TROUBLESHOOTING.md        # common failure modes and fixes
│  ├─ discord-bot-setup.md      # zero-to-bot Discord Developer Portal walkthrough
│  └─ reference/                # deeper reference docs (15+ files — see reference/README.md)
├─ README.md                    # you are here — pitch + quickstart
├─ SETUP.md                     # end-user first-boot walkthrough
├─ CHANGELOG.md                 # versioned release notes
├─ CLAUDE.md                    # contributor / coding-agent guide
├─ CONTRIBUTING.md              # how to file issues + send PRs
└─ LICENSE                      # MIT (model weights retain upstream licenses)

Reproducibility from a fresh clone

Honest scope: a git clone + ./bootstrap.sh + docker compose up -d followed by the onboarding handshake brings up the 10 default services and the full agent baseline (Gemma 4 MoE + dense, embedding, gateway, web search, EN TTS, STT, hybrid memory, all 27+ patcher overrides). The advanced surfaces (Discord, image-gen, Hungarian TTS, browser automation) need explicit operator steps — each is documented but none is a one-command install. The table below is the honest answer to "will my clone end up where the maintainer's deploy is?".

Layer	Reproduces from `compose up` alone?	What's needed beyond bootstrap
9 default services (LLM MoE + dense, embedding, gateway, cli, config-init, searxng, tts-fish, stt-whisper)	✅ after onboarding	Gateway is expected to crash-loop until you complete the Chrome-extension wizard or `openclaw onboard` — then re-run the patcher trio. SETUP.md §5–6b walks through this.
Gemma 4 NVFP4 weights	❌	HF account, accept the Gemma 4 license, put your `hf_…` token in `.env`. `bootstrap.sh` prompts for it.
Fish Audio S2 Pro weights	❌	Pulled automatically at build time from fishaudio/s2-pro. Building the image constitutes acceptance of the Fish Audio Research License (non-commercial) — see Fish Audio license note.
Browser automation	❌	`--profile browser`, then per-credential noVNC OAuth via `./bootstrap-browser-login.sh <profile>`. Each login is 1× manual (W3C origin-bound — passkeys don't work).
Python sandbox	❌	`--profile python`, secrets generated by `bootstrap.sh`.
Discord integration	❌	Discord Developer Portal (create app + bot token), `openclaw channels add --channel discord`, copy `templates/discord-text-agent/AGENTS.md.example` to `workspace-discord/AGENTS.md`. Walkthrough: `docs/discord-bot-setup.md`. The 11 patcher overrides (steps 20-30) auto-apply once the channel exists; the operator-tunable env knobs are catalogued in `docs/reference/discord-config.md`.
Image generation	❌	Separate compose at `openclaw-image-comfyui/docker-compose.yml` (`--profile image-gen`), plus your own ComfyUI install on `host.docker.internal:13036` (or LAN IP), plus model weights of your choice (FLUX Krea / SDXL fine-tunes). The repo ships no weights. See `docs/reference/image-comfyui-bridge.md`.
Memory contents	❌ (by design)	User's accumulated notes under `workspace/memory/*.md` are operator data, not code. Back up with `tar czf openclaw-$(date +%F).tar.gz -C $OPENCLAW_CONFIG_DIR .`

What the repo does guarantee: bit-stable wiring of every service it ships, deterministic patcher state on every up, pinned OPENCLAW_IMAGE_REF digest in .env.example, and idempotent secret generation in bootstrap.sh. The externals (HF model licences, Discord, browser OAuth, image-gen weights) are externalised precisely because they're decisions the operator must make — not oversights.

Fish Audio license note

The TTS surface uses Fish Audio S2 Pro (fishaudio/s2-pro, 5B param Qwen3-omni architecture). The model is distributed under the Fish Audio Research License — non-commercial use only. Building the openclaw-tts-fish image pulls the ~11 GB checkpoint from HuggingFace and constitutes acceptance of the upstream license.

The wrapper code in openclaw-tts-fish/ is MIT (matches the rest of this repo). This repo ships no model weights of any kind — they download at build time.

For commercial deployments, either contact Fish Audio ([email protected]) for a commercial license, or swap the FISH_REPO build arg in openclaw-tts-fish/server/Dockerfile to point at a checkpoint you have a commercial license to. Wrapper architecture is model-agnostic — the shim expects any SGLang-Omni-compatible TTS checkpoint with the same references[] voice-cloning schema.

Adding a custom voice (any language)

Voice cloning happens at request time from mounted reference files — no fine-tune, no re-build:

# 1. Record 10-30 s of clean mono speech (16/24 kHz preferred), write the
#    verbatim transcript to a sibling .txt:
#       myvoice.wav   (16-bit PCM mono, no music/noise, no echo)
#       myvoice.txt   (UTF-8 text, exactly what was said in the wav)
#
# 2. Drop both into the openclaw-tts-fish container's voice mount:
docker cp myvoice.wav ${CONTAINER_NAME_PREFIX:-}openclaw-tts-fish:/app/voices/
docker cp myvoice.txt ${CONTAINER_NAME_PREFIX:-}openclaw-tts-fish:/app/voices/

# 3. Request it (no restart needed):
curl -H "Authorization: Bearer $TTS_API_TOKEN" \
     -H "Content-Type: application/json" \
     -d '{"input":"Hello world.","voice":"myvoice"}' \
     http://127.0.0.1:8091/v1/audio/speech --output out.wav

Ships a 7-voice library out of the box: default_en + five more English timbres (US/UK, male/female — Kokoro 82M syntheses, Apache-2.0) and default_hu (LibriVox public domain). Friendly aliases (english, female, male, british, deep, soft, magyar) are wired by the config patcher. Details, provenance and endpoint reference in openclaw-tts-fish/README.md.

Why this stack

Running a useful local (or hybrid) agent on top of OpenClaw + vLLM is trickier than the surface picture suggests:

The OpenClaw onboarding wizard doesn't register NVFP4 models against a self-hosted vLLM provider, leaves memorySearch disabled, ships an empty gateway.trustedProxies, and writes a placeholder API key — all of which silently break things later.
Gemma 4 tool calling requires a specific chat template that isn't in the official vLLM image, plus a one-line fix to the upstream gemma4 tool-call parser so colon namespaces like discord:add_reaction aren't rejected by the regex — both ship as part of the local vllm-llm/ image build.
The bundled OpenClaw searxng plugin ships default-disabled — webSearch looks wired up but doesn't actually fire until you flip it on.
Hybrid (BM25 + vector) retrieval and MMR re-rank are native OpenClaw features but aren't on by default.
Discord slash commands silently fail in guilds because of an upstream dual-permission check; the auto-ack reaction has a known cycle bug; the coding tool profile (default for non-main agents) excludes browser, tts, and canvas so a Discord-routed agent can't reach for half the tools the main agent uses. The patcher fixes all three by default — and every override is env-gated so you can disable any of them.
On GB10 specifically, unified-memory GPU budgeting between two concurrent vLLM processes needs care (LLM_GPU_MEM_UTIL vs EMBED_GPU_MEM_UTIL).

This repo captures a known-good wiring for all of the above in a single deterministic docker compose up. The patch-config.mjs patcher re-applies its 27+ steps on every restart so the wiring survives onboarding-wizard reruns, image upgrades, and manual edits — every step is logged with a [patch-config] line and gated by user-managed protection (your hand-edits to openclaw.json are preserved).

License

MIT. Model weights retain their upstream licenses:

Gemma 4: Apache 2.0
bge-m3: MIT
Whisper turbo (deepdml/faster-whisper-large-v3-turbo-ct2): MIT
Fish Audio S2 Pro (fishaudio/s2-pro): Fish Audio Research License — non-commercial use only. Building openclaw-tts-fish triggers the ~11 GB weight download from HuggingFace and constitutes acceptance of the upstream license. See Fish Audio license note for the commercial path and the FISH_REPO build-arg override.

Contributing

Pull requests welcome. See docs/CUSTOMIZATION.md for the extension points that matter (model swap, quantization swap, custom agents). For issues that aren't about this stack itself, please file them upstream at vllm-project/vllm or openclaw/openclaw.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

DGX OpenClaw Stack

Who this is for

What you get

Hardware targets

Performance (measured on GB10, single-shot generation)

Quickstart

Architecture at a glance

Features

Repository layout

Reproducibility from a fresh clone

Fish Audio license note

Adding a custom voice (any language)

Why this stack

License

Contributing

About

Uh oh!

Releases 13

Sponsor this project

Uh oh!

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 289 Commits
.github		.github
docs		docs
openclaw-base-ext		openclaw-base-ext
openclaw-browser		openclaw-browser
openclaw-claw-img-bot		openclaw-claw-img-bot
openclaw-image-comfyui		openclaw-image-comfyui
openclaw-python-sandbox		openclaw-python-sandbox
openclaw-stt-whisper		openclaw-stt-whisper
openclaw-tts-fish		openclaw-tts-fish
scripts		scripts
searxng/settings		searxng/settings
templates		templates
vllm-llm		vllm-llm
.env.example		.env.example
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SETUP.md		SETUP.md
bootstrap-browser-login.sh		bootstrap-browser-login.sh
bootstrap.sh		bootstrap.sh
docker-compose.yml		docker-compose.yml
overnight-healthcheck.sh		overnight-healthcheck.sh
patch-config.mjs		patch-config.mjs
rotate-secrets.sh		rotate-secrets.sh
smoke-final-2026-05-19.md		smoke-final-2026-05-19.md
smoke-fish-switchover-2026-06-10.md		smoke-fish-switchover-2026-06-10.md
smoke-overnight-2026-06-09.md		smoke-overnight-2026-06-09.md

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

DGX OpenClaw Stack

Who this is for

What you get

Hardware targets

Performance (measured on GB10, single-shot generation)

Quickstart

Architecture at a glance

Features

Repository layout

Reproducibility from a fresh clone

Fish Audio license note

Adding a custom voice (any language)

Why this stack

License

Contributing

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 13

Sponsor this project

Uh oh!

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages