Skip to content

Add pluggable LLM API provider hook with tool calling#337

Draft
haileyok wants to merge 8 commits into
mainfrom
llm-provider-hook
Draft

Add pluggable LLM API provider hook with tool calling#337
haileyok wants to merge 8 commits into
mainfrom
llm-provider-hook

Conversation

@haileyok

@haileyok haileyok commented Jun 4, 2026

Copy link
Copy Markdown
Member

What

Adds a new, vendor-neutral LLM API provider plugin hook to Osprey, designed for tool calling from day one.

This is the first chunk of making LLM access pluggable in Osprey. It is deliberately upstream-only and adds a new public plugin hook, so I'd love interface feedback before this lands (opening as draft for that reason). Relates to the Jan 9 2026 working-group "pluginizing" discussion (#104).

Changes

  • osprey.worker.lib.llm — new BaseLLMProvider interface plus small vendor-neutral dataclasses: LLMMessage, ToolDefinition, ToolCall, ToolResult, LLMResponse, LLMUsage, CacheControl. Tool definitions go in; tool-call requests come back out; tool results are fed back in on the next chat() call. Streaming is intentionally out of scope (room left for a future chat_stream).
  • register_llm_provider hookspec (firstresult=True) + bootstrap_llm_provider(config) helper, mirroring the existing single-provider hooks (register_input_stream, register_execution_result_store). bootstrap_llm_provider returns None when no plugin registers one, so callers null-check.
  • Example provider in example_plugins (llm/anthropic_provider.py) that talks to the Anthropic Messages API directly, including tool_use request/response translation and optional prompt-cache control. Registered via register_llm_provider in example_plugins/src/register_plugins.py.
  • Tests: dataclass invariants + interface contract (osprey_worker/.../lib/llm/tests/test_base.py), and a full tool-call cycle against a fake Anthropic client — no SDK, API key, or network (example_plugins/src/llm/tests/test_anthropic_provider.py). example_plugins/src is added to pytest testpaths so example tests run.
  • Docs: hook menu table + an "LLM provider hook" section in docs/DEVELOPMENT.md.

Design notes / open questions

  • Tool calling shape. The dataclasses are a proposal; happy to refine names/shape. The key constraint was making tool calling first-class rather than bolted on later.
  • CacheControl generality. Included because a downstream consumer wants to keep Anthropic prompt-cache breakpoints (ephemeral + TTL). It's optional and vendor-neutral; if folks prefer, it could move into provider-specific **params instead.
  • firstresult=True. Only one LLM provider can register, matching register_input_stream. Reasonable for now; noted in the docstring.

Validation

Locally green: ruff check, ruff format --diff, mypy . (no issues), full pre-commit run --all-files (prettier skipped, as in CI), and fawltydeps --check-unused. The example provider tests pass directly; the lib/llm tests run under the docker integration harness like the rest of osprey_worker.

Introduce a vendor-neutral BaseLLMProvider interface (osprey.worker.lib.llm)
designed for tool calling from day one: tool definitions in, tool-call
requests out, tool results back in. Add a register_llm_provider pluggy
hookspec (firstresult=True) and a bootstrap_llm_provider() helper, mirroring
the existing single-provider hooks (register_input_stream,
register_execution_result_store).

Add an example provider in example_plugins that talks to the Anthropic
Messages API directly, demonstrating the request/response and tool_use
translation. The anthropic SDK is imported lazily and is not declared as a
workspace dependency (it conflicts with the pinned typing-extensions), so it
stays optional and CI needs no SDK, API key, or network.

Includes unit tests (dataclass invariants + a full tool-call cycle against a
fake Anthropic client) and docs for the new hook in docs/DEVELOPMENT.md.
@coderabbitai

coderabbitai Bot commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 4b80f64a-9deb-4a09-bee1-edc1fbddd533

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch llm-provider-hook

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Comment thread osprey_worker/src/osprey/worker/lib/llm/base.py Fixed
haileyok added 3 commits June 4, 2026 07:15
The RuntimeError referenced an example_plugins[llm] extra that no longer
exists (anthropic isn't a declared dependency due to the typing-extensions
conflict). Point users at the manual install instead.
- Preserve system-prompt cache_control: _build_system now emits the Anthropic
  block form (list of text blocks with cache_control) when any system part
  carries a breakpoint, instead of always flattening to a string that drops it.
- Avoid falsy-or traps: model/max_tokens use 'is not None' so explicit '' / 0
  pass through instead of silently becoming defaults (matches temperature).
- Degrade gracefully on a non-dict tool_use 'input' instead of crashing.
- Extract _cache_control_dict helper (de-dupes the message/system cache mapping)
  and note the extended-cache-ttl beta-header requirement for non-default ttl.
- Clarify bootstrap_llm_provider docstring: registered != ready (SDK/creds may
  be validated lazily on first chat()).
- Make the missing-SDK test deterministic via monkeypatch (sys.modules) so it
  no longer depends on anthropic being absent now that example_plugins is in
  CI testpaths; add regression tests for the three fixes above.
- Module docstring no longer references the removed example_plugins[llm]
  extra; points to manual 'uv pip install anthropic' and the pyproject NOTE.
- base.py chat() docstring clarifies that to cache the system prompt you pass
  a role='system' message with cache_control (the plain system= arg can't).
Comment thread osprey_worker/src/osprey/worker/lib/llm/base.py Fixed
Comment thread docs/DEVELOPMENT.md Outdated
Comment thread docs/DEVELOPMENT.md Outdated
Comment thread example_plugins/src/llm/anthropic_provider.py
Comment thread example_plugins/src/llm/anthropic_provider.py Outdated
Comment thread example_plugins/pyproject.toml Outdated
Comment thread example_plugins/src/register_plugins.py Outdated
Comment thread osprey_worker/src/osprey/worker/adaptor/hookspecs/osprey_hooks.py Outdated
Comment thread pyproject.toml Outdated
haileyok added 2 commits June 4, 2026 18:32
- Make anthropic a real dependency of example_plugins by bumping the root
  typing-extensions pin (==4.6.3 -> >=4.14.0; resolves to 4.15.0). Removes the
  'not declared / install manually' workaround (pyproject NOTE, mypy override
  comment, anthropic.* mypy entry, DEVELOPMENT.md manual-install note).
- Use the standard OSPREY_ prefix for config keys
  (OSPREY_LLM_ANTHROPIC_{API_KEY,MODEL,MAX_TOKENS}).
- Default to claude-sonnet-4-6 (verified current Sonnet 4.6 API id).
- Don't register an LLM provider by default; an LLM provider is optional and
  may be unregistered (bootstrap_llm_provider now returns None here).
- base.py chat(): raise NotImplementedError instead of a no-effect '...'
  (CodeQL 'statement has no effect').
- Drop 'mirroring other hooks' language from the hookspec docstring; drop a
  stray 'e.g.' in DEVELOPMENT.md.
- Tests updated for OSPREY_ keys; drop the now-obsolete missing-SDK test.
Vendor-neutral sugar on top of the provider interface, in osprey.worker.lib.llm:

- tools.py: ToolParameter + a ToolRegistry.@tool decorator that compiles to the
  existing ToolDefinition (JSON Schema). Handlers are plain sync callables (no
  imposed context object); dispatch(ToolCall) runs the handler and captures
  unknown-tool / bad-args / handler errors as ToolResult(is_error=True) to feed
  back to the model. registry.definitions() feeds provider.chat(tools=...).
- loop.py: run_tool_loop drives the standard exchange (chat -> dispatch tool_calls
  -> append role='assistant' tool-use + role='tool' results -> repeat) until the
  model returns a final answer, or raises ToolLoopLimitExceeded past max_iterations.
  Input messages are not mutated.
- Exported from llm/__init__.py; documented in DEVELOPMENT.md; unit tests for
  schema building, registration, dispatch (success/error/unknown), and the loop
  (tool cycle, no-op, limit, no-mutation).
Comment thread docs/DEVELOPMENT.md Outdated
Comment thread example_plugins/src/register_plugins.py Outdated
haileyok added 2 commits June 4, 2026 19:09
- Move the LLM provider hook + tool-calling docs out of DEVELOPMENT.md into a
  dedicated docs/llm.md (linked from the hook table and added to SUMMARY.md).
- Trim the register_plugins note to just state an LLM provider can be registered
  here, dropping the rationale.
- loop.py: don't dispatch tools on the final, limit-tripping iteration —
  side effects whose results can never be sent back to the model are no longer
  run before raising ToolLoopLimitExceeded.
- tools.py: serialize the handler result inside dispatch's try/except so a
  non-serialisable/circular return becomes is_error instead of crashing the loop;
  document that error text reaches the model (no secrets) and the default=None
  sentinel.
- typing-extensions: pin exact ==4.15.0 (matches repo convention + uv.lock)
  instead of an unbounded floor.
- anthropic_provider: correct stale cache-ttl comment (1h TTL is GA, no beta
  header needed).
- Tests: loop does not dispatch on the final iteration; dispatch captures an
  unserialisable result as is_error.

@haileyok haileyok left a comment

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude:3 Code review — two-reviewer pass (standard + adversarial), iterated to clean.

Verdict: No outstanding bugs. Both reviewers independently approved the final state; all Medium findings from the first pass are fixed with regression tests.

Findings & resolutions

Severity Where Finding Resolution
Medium lib/llm/loop.py run_tool_loop dispatched tools on the final, limit-tripping iteration and discarded the results — running side effects that can never be sent back to the model Fixed: break before dispatching on the last allowed iteration, then raise ToolLoopLimitExceeded; regression test added
Medium lib/llm/tools.py dispatch serialized the handler result outside the try/except, so a non-serialisable/circular return crashed run_tool_loop instead of becoming is_error Fixed: serialize inside the try; regression test added
Medium pyproject.toml typing-extensions was changed to an unbounded floor (>=4.14.0) — a shared, monorepo-wide dep, inconsistent with the exact-pin convention Pinned exact ==4.15.0 (matches uv.lock)
Low anthropic_provider.py Stale comment claimed the 1h cache TTL needs a beta header Corrected — 1h TTL is GA on the Claude API (set via the ttl field, no header)
Low lib/llm/tools.py Handler exception text is echoed to the model (secret-leak vector); default=None sentinel undocumented Documented both
Low (accepted) anthropic_provider.py Benign lazy-client init race under gevent (last-writer-wins) Accepted as harmless
Low (accepted) example_plugins/pyproject.toml anthropic>=0.40.0 floor vs exact-pin convention Left as a floor: example-plugin dep, reproducibility comes from uv.lock

What's good (both reviewers)

Clean vendor-neutral interface; tests pin real contracts (exact Anthropic request shape, role mapping, full tool_use→tool_result cycle, usage remap, non-dict input degradation, system cache_control block form); correct firstresult=True/None bootstrap semantics; the tool loop doesn't mutate the caller's messages and terminates correctly; well-scoped dependency/mypy changes.

All local CI gates pass: ruff, ruff format, mypy, full pre-commit, fawltydeps, and the example-provider unit tests.

Reviewers ran on the general model (the configured gpt-5.5 reviewer route was failing auth at review time).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant