termlings eval is an operator-facing benchmark harness for measuring verified outcomes per token.
It is meant for:
- Termlings maintainers
- power users
- teams tuning prompts, app access, and workflow strategy
It is always hidden from normal agent sessions.
termlings eval list
termlings eval show <task-id>
termlings eval strategies
termlings eval run <task-id> [--strategy <id>]
termlings eval compare <strategy-a> <strategy-b> [--task <task-id>]
termlings eval report [--last 20].termlings/store/evals/
tasks/
strategies.json
runs/
reports/
The first time you use eval, Termlings seeds:
- default strategies
- one runnable smoke task
- a set of editable benchmark templates
Tasks live in:
.termlings/store/evals/tasks/*.json
Default strategies include:
full-briefconcise-app-scopedpm-with-delegate
These are exported into each eval run environment so commands and scripts can adapt behavior.
Eval runs expose useful env vars like:
TERMLINGS_EVAL_RUN_IDTERMLINGS_EVAL_TASK_IDTERMLINGS_EVAL_STRATEGY_IDTERMLINGS_EVAL_RUN_DIRTERMLINGS_EVAL_ARTIFACTS_DIRTERMLINGS_EVAL_METRICS_PATHTERMLINGS_EVAL_VERIFICATION_PATHTERMLINGS_EVAL_BRIEF_MODETERMLINGS_EVAL_SYSTEM_CONTEXTTERMLINGS_EVAL_ACTIVITY_LEVELTERMLINGS_EVAL_DELEGATIONTERMLINGS_EVAL_MEMORY_MODE
If a task command writes JSON metrics to TERMLINGS_EVAL_METRICS_PATH, those metrics are folded into the run record.
V1 supports:
scriptfilejsonmanual
The verification result is the source of truth, not the agent's self-report.
termlings eval list
termlings eval run brief-json-smoke --strategy concise-app-scoped
termlings eval compare concise-app-scoped full-brief --task brief-json-smoke
termlings eval report --last 20evalis operator-only by design.- It does not appear in agent app help or system context.
- Seeded templates are intentionally editable.
- V1 is command-driven and file-backed.