Skip to content

clbrge/mohdel

Repository files navigation

Mohdel

Self-hosted LLM gateway and SDK for Node — think LiteLLM, for the JS world. One answer() call for 13 providers; swap models by changing one string; get real per-call USD cost back on every result, with OpenTelemetry built in and process isolation when you need it. Your keys, your infra, no SaaS proxy in the path.

npm install -g mohdel
mo                                     # interactive setup — pick a provider, paste your API key
mo ask gemini/gemini-3-flash-preview "why is the sky blue"

Providers: Anthropic, OpenAI, Gemini, Mistral, Groq, xAI, Cerebras, Fireworks, DeepSeek, Qwen Cloud, Xiaomi, OpenRouter, Novita. Node 22+, ES modules.

Why mohdel

  • Real numbers on every call. Token counts and per-call USD cost computed from your own pricing catalog (curated.json) — not estimates, not provider-specific shapes. Bill tenants, alert on spend, reconcile invoices. See docs/CATALOG.md for the catalog format.
  • One interface across providers. Same answer() call, same event stream, same { status, output, inputTokens, outputTokens, cost } result. Switching from anthropic/claude-sonnet-4-6 to openai/gpt-5.4-mini is one string change — adapter differences stay inside mohdel.
  • Self-hosted, no vendor in the path. API keys live in ~/.config/mohdel/. Mohdel calls provider APIs directly; nothing routes through a third party, nothing marks up your tokens, no extra hop of availability risk.
  • Observability without instrumentation. OpenTelemetry spans, trace-linked logs, and OTLP metrics over one endpoint. Set OTEL_EXPORTER_OTLP_ENDPOINT; everything else is wired.
  • Two integration paths, same API. In-process factory for CLI tools, scripts, single-process services. Optional thin-gate subprocess for fault isolation, cross-process quota, and any-language HTTP callers — no code change to switch.

How it compares

The one-paragraph version: LiteLLM is the closest analog but lives in Python; Vercel AI SDK is an application toolkit, not an infra layer; OpenRouter is the same one-API promise as a SaaS in your request path; raw provider SDKs are N different shapes with no cost accounting.

mohdel LiteLLM Vercel AI SDK OpenRouter Raw SDKs
Runs in a Node stack natively yes Python service yes n/a (SaaS) yes
Per-call USD cost on the result yes yes no yes no
Self-hosted, keys never leave your infra yes yes yes no yes
Provider-SDK process isolation yes (thin-gate) proxy only no n/a no
OTel spans + metrics out of the box yes via callbacks no no no
UI streaming helpers, structured output, agents no — by design no yes no varies
  • vs LiteLLM — same core promise (unified calls, cost tracking, self-hosted gateway), but Node-native: if your stack is JS, there's no Python sidecar to deploy, version, and monitor. The honest gap: LiteLLM's proxy exposes an OpenAI-compatible endpoint and admin features (virtual keys, budgets); thin-gate speaks its own wire protocol — callers use the JS client or implement the protocol.
  • vs Vercel AI SDK — different layer, not a rival. The AI SDK is an application toolkit (UI streaming, structured outputs, agent loops) with no per-call cost, no gateway, no process isolation. Use it above mohdel if you like it — mohdel is the inference primitive underneath.
  • vs OpenRouter — the self-hosted version of the same idea. With a SaaS router you accept their uptime, their markup, and your prompts transiting their infra. Mohdel goes direct to providers with your keys — and ships an openrouter adapter for when you want both.
  • vs raw provider SDKs — no abstraction tax to escape later: mohdel's envelope is flat and close to the SDKs underneath, and cost/tokens come back normalized so you never parse five different usage shapes.

Documentation

  • INTEGRATION.md — JS library guide (factory, client, answer options, tools, streaming, vision, transcription, errors, OTel)
  • docs/COOKBOOK.md — copy-paste recipes (summarize a file, stream, swap providers, tools, vision, batch + cost)
  • docs/CATALOG.mdcurated.json walkthrough with worked examples
  • docs/GLOSSARY.md — short definitions for envelope, thin-gate, session, creator vs provider, status, …
  • ARCHITECTURE.md — design rationale, three-plane architecture
  • PROTOCOL.md — wire format for porting clients/sessions to other languages
  • LOGGING.md — log levels, prefixes, pino integration

Quick Start

The three lines at the top of this README are the whole onboarding: install, run mo to pick a provider and paste your API key, then mo ask. Gemini, Groq, and Cerebras all have free tiers — start there if you don't already have a paid key.

Model IDs always use the <provider>/<model> format:

gemini/gemini-3-flash-preview
anthropic/claude-sonnet-4-6
openai/gpt-5.4-mini
groq/llama-4-scout-17b-16e-instruct

What mohdel is not

Scope-capping is deliberate. If you're shopping for any of the following, mohdel is the wrong layer — use it alongside your framework of choice, not instead of it.

  • Not an orchestrator. No chains, no agents, no memory, no prompt templates, no retrieval. Wrap mohdel with LangChain, LangGraph, LlamaIndex, Vercel AI SDK, or your own tool loop — mohdel exposes the inference primitive, orchestration stays in your application.
  • Not a retry / fallback engine. Errors are classified (retryable, severity, type) so the caller can decide, but mohdel never retries or swaps models silently. Silent model-swapping would conflict with existing multi-model logic upstream; the caller owns the retry budget and fallback choice.
  • Not a response cache. The cache: true flag on envelopes is for provider-side prompt caching (Anthropic, OpenAI) — not mohdel-level memoization. Caching inference results is orchestration-policy territory and depends on invariants only the caller knows.
  • Not a context-window / token manager. No pre-call token count, no projected-cost guard. The caller owns what goes in the prompt and is the source of truth for what counts.
  • Not a SaaS proxy. Self-hosted. Your API keys, your infra. No routing through a third party, no vendor lock-in.

See ARCHITECTURE.md §Design principles for the full rationale behind each.

CLI

# One-shot inference — pipeable
mo ask anthropic/claude-sonnet-4-6 "explain monads"
cat article.txt | mo ask openai/gpt-5.4 "summarize in 3 bullets"
echo "hello" | mo ask gemini/gemini-3-flash-preview --json | jq .cost

# Streaming
mo ask anthropic/claude-sonnet-4-6 --stream "write a haiku about recursion"

# With thinking effort
mo ask anthropic/claude-opus-4-6 --effort high "prove P != NP"

# Speech → text from an audio file
mo transcribe groq/whisper-large-v3-turbo meeting.mp3
mo transcribe mistral/voxtral-mini-transcribe interview.wav --language fr

# Browse the model catalog
mo ls                                  # list all curated models
mo ls --sort price                     # sorted by input price
mo search sonnet                       # filter by name/label
mo show anthropic/claude-sonnet-4-6    # model details
mo stats                               # catalog summary
mo providers                           # providers with key status & rate limits

# Rank models by benchmarks
mo rank                                # curated models, balanced weights
mo rank --use-case tool-loop           # weighted for tool reliability
mo rank --json                         # machine-readable

# Manage the catalog
mo curate anthropic                    # add new models from a provider
mo setup anthropic                     # configure API key
mo model add fireworks/deepseek-r1     # add a model manually
mo model set <model> <key> <value>     # set any field on a model
mo model rm <model> <key>              # remove a field
mo check                               # validate schema + upstream drift

# Rate limits
mo rl show anthropic                   # provider or model limits
mo rl set anthropic/claude-sonnet-4-6 60 100000

# Benchmark with live inference
mo bench anthropic/claude-sonnet-4-6   # single model
mo bench --tag fast --effort low       # suite by tag

All list/show commands support --json [fields] — bare --json lists available fields (like gh).

Library Usage

Two integration paths, same adapters underneath: start with the in-process factory; graduate to the cross-process client when you want gateway-grade isolation.

Factory — in-process (start here)

import mohdel from 'mohdel'

const mo = await mohdel()
const result = await mo.use('anthropic/claude-sonnet-4-6').answer('Hello')
console.log(result.output, result.cost)

No subprocess, no setup beyond your API key. Right for CLI tools (mo ask), scripts, tests, and single-process services — which is most projects.

Client — cross-process (the production gateway)

import { call } from 'mohdel/client'

const envelope = {
  callId: 'c-1', authId: 'u-1', auth: { key: process.env.ANTHROPIC_API_SK },
  model: 'anthropic/claude-haiku-4-5', prompt: 'Hello'
}

for await (const ev of call(envelope, { socketPath: '/tmp/mohdel-data.sock' })) {
  if (ev.type === 'delta') process.stdout.write(ev.delta.delta)
  else if (ev.type === 'done') console.log('\n→', ev.result.cost)
}

Same API, but inference runs in a pooled subprocess behind the thin-gate supervisor (Rust): a crashing provider SDK can't take your service down, quota is enforced across processes, and non-JS callers can speak the same wire. Switching from factory to client is a configuration change, not a rewrite. See INTEGRATION.md §Client for setup.

For the full API — initialization, alias resolution, answer options, response shape, tool use, streaming, vision, error handling, OpenTelemetry, sub-path exports — see INTEGRATION.md.

Observability

Every call emits:

  • OpenTelemetry span (mohdel.session.answer) under the caller's traceparent, with GenAI semantic-convention attributes (gen_ai.request.model, gen_ai.system, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens) plus mohdel's own (mohdel.status, mohdel.cost, mohdel.thinking_tokens, mohdel.time_to_first_token_ms, mohdel.cooldown on fast-fail).
  • Trace-linked logs — every stderr log line carries {traceId, spanId, callId, authId, provider, model}. Dump logs + traces into the same collector (SigNoz, Honeycomb, Jaeger + Loki) and they're correlated for free. No per-call instrumentation code.
  • Gate-side OTLP metrics (when running thin-gate): mohdel.sessions.{alive,respawned,spawn_failures}, mohdel.calls{provider,status}, mohdel.call.duration_ms, mohdel.cooldown.rejections, mohdel.quota.rejections, mohdel.policy.errors.

One endpoint for everything: set OTEL_EXPORTER_OTLP_ENDPOINT and spans + metrics flow to it over gRPC. No-op when unset — zero overhead for callers who aren't wired. See INTEGRATION.md §OpenTelemetry and LOGGING.md for details.

The OTel SDK packages (@opentelemetry/sdk-node, @opentelemetry/exporter-trace-otlp-grpc) are optionalDependencies — installed by default, but npm install --omit=optional skips them (along with their gRPC transitive tree). If you do that and later want trace export, install them explicitly:

npm install @opentelemetry/sdk-node @opentelemetry/exporter-trace-otlp-grpc

@opentelemetry/api stays in dependencies — the no-op tracer needs it regardless of whether export is wired.

Architecture

Mohdel splits into three planes that can be deployed independently:

            ┌──────────┐ unix   ┌─────────────┐  stdin/stdout  ┌──────────┐
            │  client  │ socket │  thin-gate  │    NDJSON      │  session │  × N
 caller ──► │   (JS)   │ ─HTTP─►│   (Rust)    │ ─────────────► │   (JS)   │
            └──────────┘        └─────────────┘                └──────────┘
                                        │
                                        ▼ admin plane (unix socket, HTTP)
                                  GET /v1/health
  • mohdel/client (JS) — thin stub that callers import. Opens a unix socket to thin-gate, sends a CallEnvelope, receives an async-iterable of Events. Zero transitive provider-SDK imports — caller-side code stays light.
  • mohdel-thin-gate (Rust binary, prebuilt and shipped via the mohdel-thin-gate-<platform> npm sub-packages) — scheduler / state owner / supervisor. Binds the data-plane socket, validates the envelope, dispatches to a pooled session subprocess, relays events back, handles graceful cancellation on client disconnect. Binds the admin plane for GET /v1/health. Pushes OTLP metrics (sessions alive/respawned, calls by provider/status, call-duration histogram, cooldown / quota / policy rejections) when OTEL_EXPORTER_OTLP_ENDPOINT is set. Internal trait hooks (RoutePolicy, QuotaPolicy, ConfigSource, CachePolicy) make the crate testable and fork-friendly for deployments that need bespoke policy — not a published-library surface.
  • mohdel/session (JS subprocess) — provider executor. Spawned by thin-gate, reads envelopes from stdin, dispatches to the matching adapter, writes events to stdout. A napi-rs addon was scoped for hot-loop optimization but current benchmarks show per-call JS CPU is not the bottleneck; the stub stays under rust/napi-addon/ for future reactivation.

Running thin-gate

cargo run --bin mohdel-thin-gate /tmp/mohdel-data.sock /tmp/mohdel-admin.sock /path/to/js/session/bin.js

# or with a pre-built release binary:
./target/release/mohdel-thin-gate /tmp/mohdel-data.sock /tmp/mohdel-admin.sock ./js/session/bin.js

Positional args are optional (data socket, admin socket, session bin). Env overrides:

  • MOHDEL_SESSION_BIN — path to session entrypoint (defaults to none; if unset, data plane returns synthetic events)
  • MOHDEL_SESSION_POOL_SIZE — pre-warmed sessions (default 2)

With no session-bin configured, thin-gate runs in demo mode: POST /v1/call returns a synthetic echo event sequence. Useful for health-checking the HTTP layer without a runtime dependency on Node.

Calling from JS

The client snippet under Library Usage above is the full surface: call(envelope, { socketPath, signal? }) returns an async iterable of events. Pass an AbortSignal to cancel in flight; thin-gate forwards a cancel control message to the session and reuses it on the pool. The envelope is the flat answer(prompt, options) surface plus transport metadata (callId, authId, auth.key, optional traceparent); see js/core/envelope.js for the full field list.

Canonical types (frozen wire contract)

Wire format is JSON over NDJSON frames, camelCase. Types are defined in js/core/ (JSDoc) and mirrored in rust/thin-gate/src/protocol.rs (serde). Cross-language conformance tests enforce round-trip fidelity. The session-side protocol (envelopes in, events out, cancel control messages) is specified in PROTOCOL.md — read that to implement a session in another language.

  • CallEnvelope — flat answer() options plus transport metadata: callId, authId, auth.key, traceparent?, baggage?, provider, model, prompt, outputBudget?, outputType?, outputStyle?, outputEffort?, images?, videos?, cache?, tools?, toolChoice?, parallelToolCalls?, identifier?.
  • Event — three-variant union discriminated on type:
    • { type: 'delta', delta: { type: 'message' | 'function_call', delta: string } }
    • { type: 'done', result: AnswerResult }
    • { type: 'error', error: TypedError }
  • AnswerResultstatus, output, inputTokens, outputTokens, thinkingTokens, cost (single number), timestamps, warning?, toolCalls?.
  • Status'completed' | 'tool_use' | 'incomplete'.
  • Warning — additive string union: 'insufficientOutputBudget', 'cancelled', ...
  • TypedError{ message, detail?, severity, retryable, type }. message is a stable machine key; detail is user-facing context; severity is 'trace' | 'debug' | 'info' | 'warn' | 'error' | 'fatal'; type is an optional canonical tag (e.g. 'AUTH_INVALID', 'PROVIDER_COOLDOWN').

A cancel control message { op: "cancel", callId } on session stdin aborts the matching in-flight call.

Extending the frozen wire types is breaking — additive changes only on trait method sets and non-frozen internals. See ARCHITECTURE.md §What isn't frozen for the refinable-vs-frozen split.

Adding a new provider adapter

See CONTRIBUTING.md. Short version:

  1. Create js/session/adapters/<provider>.js exporting async function* <provider>(envelope, { client?, signal? }).
  2. Map provider-native events to the canonical Event union.
  3. Pass { signal } to the SDK's streaming method so cancellation aborts in-flight HTTP.
  4. On SDK throw: if signal?.aborted, return silently (run() emits call.cancelled); else yield call.error via classifyProviderError(e) from ./_errors.js.
  5. Register in js/session/adapters/index.js.
  6. Write unit tests with a dependency-injected mock client.
  7. Optionally add a gated live test in test/live/<provider>.live.test.js.

Configuration

API keys live in ~/.config/mohdel/environment (one KEY=value per line, loaded automatically):

ANTHROPIC_API_SK=sk-ant-...
OPENAI_API_SK=sk-...
GEMINI_API_SK=AI...
GROQ_API_SK=gsk_...
XAI_API_SK=xai-...
CEREBRAS_API_SK=csk-...
MISTRAL_API_SK=...
FIREWORKS_API_SK=fw_...
DEEPSEEK_API_SK=sk-...
OPENROUTER_API_SK=sk-or-...
NOVITA_API_SK=...

Only set keys for providers you use. Run mo with no arguments for interactive setup.

File locations

Path Purpose
~/.config/mohdel/environment API keys
~/.config/mohdel/default.json Default model selection
~/.config/mohdel/curated.json Model catalog with metadata, tags, pricing
~/.config/mohdel/providers.json Provider-level rate limits
~/.config/mohdel/excluded.json Excluded models
~/.cache/mohdel/uploaded-files.json Gemini file upload cache

Paths follow the XDG convention via env-paths.

Provider Matrix

What each provider supports through mohdel's unified interface:

Provider Streaming Tools Vision Video Thinking Notes
Anthropic Yes Yes Yes No Yes (adaptive / budget) identifiermetadata.user_id
OpenAI Yes Yes Yes No Yes (o-series) GPT-5 verbosity via outputStyle
Gemini Yes Yes Yes Yes Yes (thinkingLevel / thinkingBudget) Auto-uploads large videos; content-hashed cache
Cerebras No Yes Yes No Yes (reasoning_effort or zai disable_reasoning) Non-streaming chat completions
Groq No Yes Yes No No Non-streaming; shared chat-completions path
xAI Yes Yes Yes No Auto OpenAI Responses API over api.x.ai/v1
DeepSeek No Yes Yes No No DSML tool-call fallback when model emits tags in content
Fireworks Yes Yes Yes No Yes (reasoning_effort) OpenAI SDK + baseURL; model id auto-prefixed
Mistral No Yes Yes No No tool_choice: "any" = required
Qwen Cloud No Yes No No Yes (enable_thinking + thinking_budget) Alibaba DashScope intl; hybrid models think by default — effort none sends explicit off
Xiaomi No Yes Yes No Auto MiMo; shared chat-completions path, reasoning_content captured
OpenRouter Yes Yes Yes No Varies Meta-provider; providerOptions.openrouter for routing prefs
Novita No No No No No Image generation only

Adapter capability ≠ model capability — whether a given model accepts images, tools, or thinking effort depends on the model spec in curated.json. The adapter passes through what the envelope supplies; the provider rejects unsupported combos.

Local Development

git clone <repo> && cd mohdel
npm install
npm test                          # unit tests, no API keys

Rust tests

cargo test --workspace            # thin-gate + napi-addon
cargo build --release --bin mohdel-thin-gate

Test files under rust/thin-gate/tests/:

File Coverage
conformance.rs JS↔Rust protocol round-trip
protocol.rs serde (de)serialization of envelope/events/results
server.rs HTTP layer, synthetic dispatch, 404/400 paths
session_dispatch.rs real node js/session/bin.js spawn + dispatch + graceful cancel
policy.rs RoutePolicy + QuotaPolicy + Enforcer end-to-end
config.rs TOML ConfigSource parsing, defaults, malformed, env override
supervision.rs readiness ping/pong + readiness timeout + garbage-response handling
stress.rs 100 concurrent calls, cancel storm, session-death-under-load

Spawning tests require node in PATH.

Provider integration tests

These hit real provider APIs. Models are drawn from your local curated.json — one per provider. Each provider block is skipped automatically when its API key is missing.

npm run test:provider             # all providers via the factory path
TAG=fast npm run test:provider    # filter by model tag
npm run test:multiturn            # multi-turn conversation tests (incl. tool round-trip)
npm run test:vision               # image input tests

Live adapter tests

Exercise the session adapters directly against real provider APIs. Gated on env keys; skipped cleanly when keys are absent. See test/live/README.md for details.

ANTHROPIC_API_SK=sk-ant-... npm run test:live
OPENAI_API_SK=sk-... npm run test:live

Scenario-driven testing (the fake provider)

For deterministic stress, benchmark, and bug-repro work, register provider: "fake" in the envelope with a JSON prompt that drives the scenario:

{ mode: 'volume',       tokens: 1000 }              // throughput stress
{ mode: 'slow',         tokens: 50, delayMs: 100 }  // streaming cadence
{ mode: 'error',        type: 'AUTH_INVALID' }      // error classification
{ mode: 'hang' }                                    // cancel / timeout plumbing
{ mode: 'tool',         name: 'f', args: { x: 1 } } // tool round-trip
{ mode: 'incomplete' }                              // status contract
{ mode: 'crash' }                                   // process isolation (exits the adapter process)
{ mode: 'cancel_after', tokens: 5 }                 // cancel mid-stream

All modes honor AbortSignal. The benchmarks in bench/ use this to pin adapter work to a fixed shape and isolate what's being measured — see bench/bench.js (throughput) and bench/isolation.js (crash containment).

npm scripts

Command Description
npm test Unit tests (vitest)
npm run test:provider Provider integration via the factory — real API calls
npm run test:live Live session-adapter tests (env-key gated)
npm run lint StandardJS lint
npm run cli Interactive model picker
cargo test --workspace Rust tests (thin-gate + protocol + policy + stress + ...)
node bench/bench.js In-process vs via-gate throughput benchmark
node bench/isolation.js Crash-isolation demo (in-process dies, via-gate contains)

Contributing

Fork the repository and submit a pull request. Code style: Node 22+, ES modules, no semicolons, 2-space indent, single quotes (StandardJS). See CONTRIBUTING.md for details.

Mohdel's wire is language-agnostic. The JS client is the first implementation, not the only one — a Python / Go / Ruby / Swift / Elixir / ... client is a great starter contribution. See CONTRIBUTING.md §Porting a client to another language and PROTOCOL.md.

License

MIT. See LICENSE.

About

Self-hosted LLM gateway + SDK for Node (LiteLLM-style) — 11 providers, per-call USD cost, OpenTelemetry, process isolation. Your keys, no SaaS in the path

Topics

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors