runed — Shared embedding daemon

runed is a Go daemon that wraps llama.cpp llama-server to serve Qwen3-Embedding-0.6B embeddings via gRPC over a UNIX domain socket. It is designed as a shared singleton process per machine so that multiple client sessions do not each load their own ~400 MB embedding model.

Installation

runed is normally installed via the rune CLI:

rune install

This places the binary at ~/.runed/bin/runed with a manifest URL baked in at build time (via -ldflags). End users don't need to set any environment variable — the first launch downloads the llama-server release tarball and the embedding GGUF described by the manifest into ~/.runed/{bin/llama-cpp,models}/. Subsequent launches reuse the installed artifacts as long as the manifest SHA-256s still match.

For standalone testing of the daemon without rune, see scripts/dev_standalone.sh.

Usage

runed is a passive gRPC daemon. It is normally spawned on demand by a client (e.g. rune-mcp, rundemo, or any program that imports the client/ package) the first time an embedding request arrives:

client.Connect()
    socket dial fails
    → spawn.EnsureDaemon execs ~/.runed/bin/runed (detached)
       → paths.Resolve / EnsureDirs
       → socket probe (exit 0 if another daemon already listening)
       → self-bootstrap (manifest fetch → llama-server tarball + GGUF
         download on first boot, SHA cache check on subsequent boots)
       → backend.Start (llama-server child, port 0)
       → listen on ~/.runed/embedding.sock
    client reconnects → embed RPC → result

Direct foreground launch is supported for development and debugging:

./bin/runed

After RUNED_IDLE_TIMEOUT (default 10m) of no RPC activity, the daemon stops its llama-server child to release the ~470MB+ of model weights from memory — but runed itself stays up with the gRPC socket still listening. The next Embed/EmbedBatch RPC transparently restarts the backend (cold-start latency paid only on that single request). The daemon process exits only on SIGINT/SIGTERM/SIGHUP or a Shutdown RPC.

Configuration

Environment

Variable	Purpose	Default
`RUNED_HOME`	Data directory	`~/.runed`
`RUNED_MANIFEST`	Manifest URL for self-bootstrap	`DefaultManifestURL` (build-time ldflags)
`RUNED_LLAMA_SERVER`	Skip self-bootstrap; use this binary	(unset)
`RUNED_MODEL`	Skip self-bootstrap; use this GGUF	(unset)
`RUNED_MODEL_VARIANT`	Pick a non-default model from `manifest.models`	`manifest.default_model`
`RUNED_CTX_SIZE`	`llama-server --ctx-size` (max input length in tokens)	2048
`RUNED_IDLE_TIMEOUT`	After this much idle, stop the llama-server child to free model memory. `runed` itself stays up; the next Embed RPC resurrects the backend. `"0"` disables suspend	10m

RUNED_MANIFEST should be HTTPS in production. HTTP is permitted (for private networks) but emits a warning at startup — a MITM that rewrites the manifest can also rewrite the per-artifact SHA256s, so artifact integrity collapses to "trust the manifest channel" alone.

The DefaultManifestURL referenced in the table above is injected at build time via -ldflags; see CONTRIBUTING.md for the relevant make build flags.

Manifest format

runed reads only a subset of the manifest; extra keys (used by the companion rune installer) are ignored:

{
  "version": 1,
  "platforms": {
    "darwin-arm64": {
      "llama_server": {
        "url":     "https://.../llama-mac-arm64.tar.gz",
        "sha256":  "...",
        "size":    22000000,
        "extract": "tar.gz",
        "exec":    "llama-server"
      }
    }
  },
  "models": {
    "qwen3-embedding-0.6b.q6_K": {
      "url":    "https://huggingface.co/.../qwen3-embedding-0.6b-q6_k.gguf",
      "sha256": "...",
      "size":   472000000
    }
  },
  "default_model": "qwen3-embedding-0.6b.q6_K"
}

extract: "" (raw binary placed at LlamaDir/<exec>) or "tar.gz" (extracted into LlamaDir, exec path resolved inside).
A sidecar marker ~/.runed/bin/llama-cpp/.llama_server.sha256 tracks the last-installed tarball hash so repeat boots don't re-extract.
Models are verified by hashing the on-disk file against models[<variant>].sha256 — no marker file.

config.json (`~/.runed/config.json`, optional)

This file is read by two components with overlapping schemas; leave any field you don't need unset. Unknown fields are ignored.

Field	Reader	Purpose
`version`	both	Schema version (currently `1`)
`llama_server`	spawn	If set, skip daemon self-bootstrap and use this binary
`model`	spawn	If set, skip daemon self-bootstrap and use this GGUF
`runed_binary`	spawn	Path to the daemon binary (fallback to PATH / `DefaultRunedBinary`)
`idle_timeout`	spawn	Propagated to the spawned daemon as `RUNED_IDLE_TIMEOUT`
`model_variant`	bootstrap	Pick a manifest model variant (overrides `manifest.default_model`)

Example — minimal config telling the spawn layer to use a custom runed_binary but leaving artifact resolution to self-bootstrap:

{
  "version": 1,
  "runed_binary": "/opt/runed/bin/runed",
  "idle_timeout": "1m"
}

Model variants

The f16 GGUF is the parity reference — cosine similarity ≥ 0.9999 against sentence-transformers on all 8 fixture texts. Quantized variants trade parity for size/latency.

Variant	Size	Mean cosine	Verdict
f16	1.1 GB	≈ 0.99999	parity reference
q8_0	610 MB	≈ 0.9993	high-fidelity alternative
q6_K	472 MB	0.994	production default
q5_K_M	424 MB	0.990	borderline; natural-language only
q4_K_M	378 MB	0.971	rerank-only or dev staging; not sole retrieval backbone

Production default is q6_K (models/qwen3-embedding-0.6b.q6_K.gguf) — 23% smaller than q8_0 with mean cosine 0.994 against the f16 reference. f16 is reserved for parity verification against sentence-transformers. Set the daemon's RUNED_MODEL env var to the desired GGUF path.

License

runed Go code: MIT (LICENSE file forthcoming).
Qwen3-Embedding-0.6B: Apache 2.0.
llama.cpp: MIT.

Redistribution of the bundled llama-server binary and GGUF model files is permitted under the respective upstream licenses; CryptoLab makes no independent claims on those artifacts.

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
.github/workflows		.github/workflows
ci		ci
client		client
cmd		cmd
gen/runed/v1		gen/runed/v1
internal		internal
proto/runed/v1		proto/runed/v1
scripts		scripts
.embedding-model.yaml		.embedding-model.yaml
.gitignore		.gitignore
.llama-cpp-version		.llama-cpp-version
CONTRIBUTING.md		CONTRIBUTING.md
Makefile		Makefile
README.md		README.md
buf.gen.yaml		buf.gen.yaml
buf.yaml		buf.yaml
go.mod		go.mod
go.sum		go.sum

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

runed — Shared embedding daemon

Installation

Usage

Configuration

Environment

Manifest format

config.json (`~/.runed/config.json`, optional)

Model variants

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

runed — Shared embedding daemon

Installation

Usage

Configuration

Environment

Manifest format

config.json (~/.runed/config.json, optional)

Model variants

License

About

Resources

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

config.json (`~/.runed/config.json`, optional)

Packages