Add pluggable LLM API provider hook with tool calling#337
Conversation
Introduce a vendor-neutral BaseLLMProvider interface (osprey.worker.lib.llm) designed for tool calling from day one: tool definitions in, tool-call requests out, tool results back in. Add a register_llm_provider pluggy hookspec (firstresult=True) and a bootstrap_llm_provider() helper, mirroring the existing single-provider hooks (register_input_stream, register_execution_result_store). Add an example provider in example_plugins that talks to the Anthropic Messages API directly, demonstrating the request/response and tool_use translation. The anthropic SDK is imported lazily and is not declared as a workspace dependency (it conflicts with the pinned typing-extensions), so it stays optional and CI needs no SDK, API key, or network. Includes unit tests (dataclass invariants + a full tool-call cycle against a fake Anthropic client) and docs for the new hook in docs/DEVELOPMENT.md.
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Plus Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
The RuntimeError referenced an example_plugins[llm] extra that no longer exists (anthropic isn't a declared dependency due to the typing-extensions conflict). Point users at the manual install instead.
- Preserve system-prompt cache_control: _build_system now emits the Anthropic block form (list of text blocks with cache_control) when any system part carries a breakpoint, instead of always flattening to a string that drops it. - Avoid falsy-or traps: model/max_tokens use 'is not None' so explicit '' / 0 pass through instead of silently becoming defaults (matches temperature). - Degrade gracefully on a non-dict tool_use 'input' instead of crashing. - Extract _cache_control_dict helper (de-dupes the message/system cache mapping) and note the extended-cache-ttl beta-header requirement for non-default ttl. - Clarify bootstrap_llm_provider docstring: registered != ready (SDK/creds may be validated lazily on first chat()). - Make the missing-SDK test deterministic via monkeypatch (sys.modules) so it no longer depends on anthropic being absent now that example_plugins is in CI testpaths; add regression tests for the three fixes above.
- Module docstring no longer references the removed example_plugins[llm] extra; points to manual 'uv pip install anthropic' and the pyproject NOTE. - base.py chat() docstring clarifies that to cache the system prompt you pass a role='system' message with cache_control (the plain system= arg can't).
- Make anthropic a real dependency of example_plugins by bumping the root
typing-extensions pin (==4.6.3 -> >=4.14.0; resolves to 4.15.0). Removes the
'not declared / install manually' workaround (pyproject NOTE, mypy override
comment, anthropic.* mypy entry, DEVELOPMENT.md manual-install note).
- Use the standard OSPREY_ prefix for config keys
(OSPREY_LLM_ANTHROPIC_{API_KEY,MODEL,MAX_TOKENS}).
- Default to claude-sonnet-4-6 (verified current Sonnet 4.6 API id).
- Don't register an LLM provider by default; an LLM provider is optional and
may be unregistered (bootstrap_llm_provider now returns None here).
- base.py chat(): raise NotImplementedError instead of a no-effect '...'
(CodeQL 'statement has no effect').
- Drop 'mirroring other hooks' language from the hookspec docstring; drop a
stray 'e.g.' in DEVELOPMENT.md.
- Tests updated for OSPREY_ keys; drop the now-obsolete missing-SDK test.
Vendor-neutral sugar on top of the provider interface, in osprey.worker.lib.llm: - tools.py: ToolParameter + a ToolRegistry.@tool decorator that compiles to the existing ToolDefinition (JSON Schema). Handlers are plain sync callables (no imposed context object); dispatch(ToolCall) runs the handler and captures unknown-tool / bad-args / handler errors as ToolResult(is_error=True) to feed back to the model. registry.definitions() feeds provider.chat(tools=...). - loop.py: run_tool_loop drives the standard exchange (chat -> dispatch tool_calls -> append role='assistant' tool-use + role='tool' results -> repeat) until the model returns a final answer, or raises ToolLoopLimitExceeded past max_iterations. Input messages are not mutated. - Exported from llm/__init__.py; documented in DEVELOPMENT.md; unit tests for schema building, registration, dispatch (success/error/unknown), and the loop (tool cycle, no-op, limit, no-mutation).
- Move the LLM provider hook + tool-calling docs out of DEVELOPMENT.md into a dedicated docs/llm.md (linked from the hook table and added to SUMMARY.md). - Trim the register_plugins note to just state an LLM provider can be registered here, dropping the rationale.
- loop.py: don't dispatch tools on the final, limit-tripping iteration — side effects whose results can never be sent back to the model are no longer run before raising ToolLoopLimitExceeded. - tools.py: serialize the handler result inside dispatch's try/except so a non-serialisable/circular return becomes is_error instead of crashing the loop; document that error text reaches the model (no secrets) and the default=None sentinel. - typing-extensions: pin exact ==4.15.0 (matches repo convention + uv.lock) instead of an unbounded floor. - anthropic_provider: correct stale cache-ttl comment (1h TTL is GA, no beta header needed). - Tests: loop does not dispatch on the final iteration; dispatch captures an unserialisable result as is_error.
haileyok
left a comment
There was a problem hiding this comment.
Claude:3 Code review — two-reviewer pass (standard + adversarial), iterated to clean.
Verdict: No outstanding bugs. Both reviewers independently approved the final state; all Medium findings from the first pass are fixed with regression tests.
Findings & resolutions
| Severity | Where | Finding | Resolution |
|---|---|---|---|
| Medium | lib/llm/loop.py |
run_tool_loop dispatched tools on the final, limit-tripping iteration and discarded the results — running side effects that can never be sent back to the model |
Fixed: break before dispatching on the last allowed iteration, then raise ToolLoopLimitExceeded; regression test added |
| Medium | lib/llm/tools.py |
dispatch serialized the handler result outside the try/except, so a non-serialisable/circular return crashed run_tool_loop instead of becoming is_error |
Fixed: serialize inside the try; regression test added |
| Medium | pyproject.toml |
typing-extensions was changed to an unbounded floor (>=4.14.0) — a shared, monorepo-wide dep, inconsistent with the exact-pin convention |
Pinned exact ==4.15.0 (matches uv.lock) |
| Low | anthropic_provider.py |
Stale comment claimed the 1h cache TTL needs a beta header | Corrected — 1h TTL is GA on the Claude API (set via the ttl field, no header) |
| Low | lib/llm/tools.py |
Handler exception text is echoed to the model (secret-leak vector); default=None sentinel undocumented |
Documented both |
| Low (accepted) | anthropic_provider.py |
Benign lazy-client init race under gevent (last-writer-wins) | Accepted as harmless |
| Low (accepted) | example_plugins/pyproject.toml |
anthropic>=0.40.0 floor vs exact-pin convention |
Left as a floor: example-plugin dep, reproducibility comes from uv.lock |
What's good (both reviewers)
Clean vendor-neutral interface; tests pin real contracts (exact Anthropic request shape, role mapping, full tool_use→tool_result cycle, usage remap, non-dict input degradation, system cache_control block form); correct firstresult=True/None bootstrap semantics; the tool loop doesn't mutate the caller's messages and terminates correctly; well-scoped dependency/mypy changes.
All local CI gates pass: ruff, ruff format, mypy, full pre-commit, fawltydeps, and the example-provider unit tests.
Reviewers ran on the general model (the configured gpt-5.5 reviewer route was failing auth at review time).
What
Adds a new, vendor-neutral LLM API provider plugin hook to Osprey, designed for tool calling from day one.
This is the first chunk of making LLM access pluggable in Osprey. It is deliberately upstream-only and adds a new public plugin hook, so I'd love interface feedback before this lands (opening as draft for that reason). Relates to the Jan 9 2026 working-group "pluginizing" discussion (#104).
Changes
osprey.worker.lib.llm— newBaseLLMProviderinterface plus small vendor-neutral dataclasses:LLMMessage,ToolDefinition,ToolCall,ToolResult,LLMResponse,LLMUsage,CacheControl. Tool definitions go in; tool-call requests come back out; tool results are fed back in on the nextchat()call. Streaming is intentionally out of scope (room left for a futurechat_stream).register_llm_providerhookspec (firstresult=True) +bootstrap_llm_provider(config)helper, mirroring the existing single-provider hooks (register_input_stream,register_execution_result_store).bootstrap_llm_providerreturnsNonewhen no plugin registers one, so callers null-check.example_plugins(llm/anthropic_provider.py) that talks to the Anthropic Messages API directly, includingtool_userequest/response translation and optional prompt-cache control. Registered viaregister_llm_providerinexample_plugins/src/register_plugins.py.osprey_worker/.../lib/llm/tests/test_base.py), and a full tool-call cycle against a fake Anthropic client — no SDK, API key, or network (example_plugins/src/llm/tests/test_anthropic_provider.py).example_plugins/srcis added topytesttestpathsso example tests run.docs/DEVELOPMENT.md.Design notes / open questions
CacheControlgenerality. Included because a downstream consumer wants to keep Anthropic prompt-cache breakpoints (ephemeral + TTL). It's optional and vendor-neutral; if folks prefer, it could move into provider-specific**paramsinstead.firstresult=True. Only one LLM provider can register, matchingregister_input_stream. Reasonable for now; noted in the docstring.Validation
Locally green:
ruff check,ruff format --diff,mypy .(no issues), fullpre-commit run --all-files(prettier skipped, as in CI), andfawltydeps --check-unused. The example provider tests pass directly; thelib/llmtests run under the docker integration harness like the rest ofosprey_worker.