Container builds, deployment recipes, and routing configuration for
the LLM serving stack behind InferNode's llmsrv / lucibridge /
serve-llm.sh.
Status: WIP. Repo bootstrapped 2026-05-14 to productize findings
from INFR-68 (SGLang-on-Jetson spike). Read
docs/SGLANG-ADOPTION-NOTES.md for
the full measured-results writeup and the working installation
recipe.
infernode-os/infernodeis the OS / runtime (Limbo, emu, Veltro, lucifer). Docker/CI scaffolding for external serving stacks doesn't belong inside the OS source.pdfinn/infernode-os-llmis the training pipeline (corpus → harvest → train → eval → adapter). Its lifecycle moves with model training cycles.- This repo owns the serving runtime that consumes IOL's adapters: Jetson-targeted container builds, lucibridge per-tool routing configs, deployment runbooks. Lifecycle is tied to JetPack releases / SGLang versions / hardware generations, not to model training cycles.
- Jetson Orin AGX (
sm_87, JetPack 6.x): current production. The measured 3× concurrent-throughput advantage of SGLang over Ollama was characterized on this hardware. - Jetson Thor (
sm_103, JetPack 7.x): forward-looking. NVIDIA ships official NGC SGLang containers for Thor (per the Run SGLang in Thor forum thread). Cleaner upstream than Orin's community-maintained path.
serving/
├── README.md
├── LICENSE (MIT)
├── docs/
│ ├── SGLANG-ADOPTION-NOTES.md working/measured notes (moved from IOL)
│ └── ...
├── sglang/ fork of dusty-nv/jetson-containers/packages/llm/sglang/
│ ├── orin/ pinned config for Orin (sm_87)
│ └── thor/ pinned config for Thor (sm_103) — when we have a Thor box
├── runbooks/
│ ├── hephaestus-deploy.md
│ └── ...
└── .github/workflows/
└── build-sglang.yml GitHub-hosted ubuntu-24.04-arm runner
The repo is public (changed 2026-05-14 to qualify for free
ubuntu-24.04-arm GitHub-hosted runner minutes). No secrets or
credentials in the tree; .gitignore excludes anything credential-
shaped.
Builds run on GitHub-hosted ubuntu-24.04-arm (Graviton-class
SBSA). Native aarch64 — no QEMU — and nvcc cross-compiles for
sm_87 (Orin) and sm_103 (Thor) via TORCH_CUDA_ARCH_LIST at build
time. The output is a Jetson-Tegra-targeted image; GitHub's hosted
runners have no Jetson hardware, so end-to-end smoke testing (CUDA
paths actually executing) happens manually on Hephaestus after each
successful CI build. Hephaestus must never be configured as a
GitHub self-hosted runner, and the remote CI must not hold any
secrets that link back to the device — the public CI surface stays
strictly isolated from the dev box.
The sglang/ subtree is intended to vendor the canonical
dusty-nv/jetson-containers
recipe with our pinned SGLang version (≥ 0.5.x so gpt_oss.py is in
the model registry).
- dusty-nv / jetson-containers — https://github.com/dusty-nv/jetson-containers (the canonical Jetson container build framework; NVIDIA-DevRel-maintained, MIT licensed)
- sgl-project / sglang — https://github.com/sgl-project/sglang (upstream; lmsysorg's official images are SBSA — datacenter ARM, not Jetson Tegra)
- NGC SGLang catalog — https://catalog.ngc.nvidia.com/orgs/nvidia/containers/sglang (NVIDIA-published; recent tags target Thor sm_103)
- dustynv/sglang on Docker Hub —
https://hub.docker.com/r/dustynv/sglang (pre-built artifacts of
jetson-containers; community-maintained; what INFR-68 spike used —
r36.4.0tag with SGLang 0.4.1)
See docs/SGLANG-ADOPTION-NOTES.md for:
- The spike attempts that didn't work and why (PyTorch wheel
USE_DISTRIBUTEDvssm_87constraint) - The working recipe (extract dustynv container via crane onto
orin-ssd, host Python 3.10 with patched
LD_LIBRARY_PATH) - Measured bake-off results — SGLang vs Ollama on Llama-3.1-8B (TL;DR: SGLang scales to ~78 tok/s at N=8 concurrent vs Ollama's ~23 tok/s plateau; tied at single-user)
- Operations: where things live on Hephaestus, start/stop, verify
- Known gaps: SGLang 0.4.1's GGUF tokenizer doesn't recognize Llama
3 special tokens (needs HF tokenizer dir); no
gpt_oss.pyin 0.4.1's model registry (needs SGLang 0.5.x bump)
Hephaestus is the Jetson Orin AGX dev box. Its root partition is
deliberately constrained to emulate a production single-disk
node (OS + TAK + NERVA via Docker + Ollama binary). The 916 GB
/mnt/orin-ssd is the dev indulgence; serving-spike artifacts live
there.
Do not migrate Docker / containerd state from root to orin-ssd.
That would move TAK/NERVA images onto the dev-only disk and break
the production emulation. The Jetson-container build needs to either
fit in root partition's residual space, or use a daemonless build
path (we did extraction-only via
crane on the
spike; expect to reuse).
Work in this repo is tracked under the INFR Jira project's "Productize SGLang serving" epic. See INFR-68 (the original spike) and its child tickets.