diff --git a/CHANGELOG.md b/CHANGELOG.md index c7afba2..2690200 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -14,10 +14,11 @@ The format follows [Keep a Changelog](https://keepachangelog.com/en/1.1.0/). The - **Per-fetch prompt cache control: `cache_ttl_seconds`** (proposal 0072, prompt-management §5 / §6, spec v0.63.0). `PromptBackend.fetch`, `PromptManager.fetch`, and `PromptManager.get` gain an optional `cache_ttl_seconds` read-side control: `None` preserves current behavior, `0` forces a fresh read past any client-side cache, and `N > 0` bounds a served entry's staleness to N seconds; a negative value is rejected at the manager. It governs only which cached entry may be served, not whether or how results are cached. The bundled filesystem backend is cacheless and ignores it; the bundled Langfuse backend forwards it to the Langfuse SDK's `get_prompt` cache. Conformance fixtures 033/034 run through a caching harness backend (conformance-adapter §6.8: `source_read_count` plus a controllable `advance_clock`). - **Failure-isolation `catch` gate + cause-chain classification primitive** (proposal 0074, pipeline-utilities §6.3 / §6.4, spec v0.65.0). `FailureIsolationMiddleware` gains an optional `catch`: a set of error categories. An exception is caught only if the *derived category* of its cause chain (the outermost non-carrier link's category, resolved through the engine's `node_exception` carriers, the same value reported as `caught_exception.category`) is in the set. This closes a degrade-into-crash footgun: at a wrapping placement (subgraph, fan-out instance, branch) the engine wraps the originating failure in a carrier, so a `predicate` inspecting the surface exception sees only the carrier and misses it, whereas `catch` classifies through the carrier. `catch` composes with `predicate` as a conjunction; both default permissive (both unset stays catch-all), and a null derived category never matches a non-empty set. The carrier-skipping walk behind `catch` and `caught_exception` is promoted to a public primitive, `classify_cause_chain(exc) -> CaughtException` (the ordered `chain`, the derived `category`, and its `message` — the same record the event carries), exported from `openarmature.graph` for use in a custom `predicate`, a router, a metric, or a full-chain retry classifier. The default retry classifier stays deliberately single-level (it classifies at re-attempt granularity); this is now documented, with no behavior change. Conformance fixture 072 (catch matches through an instance-placement carrier and degrades; a non-matching catch propagates with no event). The optional native-exception-type `catch` form (spec MAY) is not shipped. - **Inline-callable parallel branches and conditional `when`** (proposal 0075, pipeline-utilities §11, spec v0.66.0). `ParallelBranchesNode` gains two additive branch forms. A branch may now give its work as `call`, an inline async function over the parent state returning a parent-shaped partial update, instead of a compiled `subgraph` with its own state schema and `inputs` / `outputs` projection; the returned partial is the branch's contribution directly, merged via the parent reducer with no projection. This makes the primitive adoptable for the "M heterogeneous lightweight parallel calls over shared state, each independently failure-isolated" shape (hybrid recall, paired reads) that previously dropped to a hand-rolled gather, while reusing the existing concurrency, fail-fast cancellation, per-branch failure isolation, and reducer fan-in. A branch gives its work as exactly one of `subgraph` / `call`, and a callable branch declares no `inputs` / `outputs`, else a new compile-time `ParallelBranchesInvalidBranchSpec`; a node may mix the two forms freely. A branch (either form) may also carry an optional `when` predicate over the parent state, evaluated once at dispatch: a `False` result skips the branch entirely (no dispatch, contribution, observer events, or span), and an all-skipped node is a valid no-op distinct from the compile-time `ParallelBranchesNoBranches`. A callable branch is the unit of work, so it emits one `started` / `completed` observer pair keyed by `branch_name` (rendered as a single branch span); a skipped branch emits nothing. `ParallelBranchesInvalidBranchSpec` is exported from `openarmature.graph`. Conformance fixtures 073 (two callable branches merge to disjoint fields), 074 (conditional `when` skips / dispatches), and 075 (callable branch failure-isolation degrade) run in `test_pipeline_utilities`. +- **Tool-call request observability on LLM spans** (proposal 0076, observability §5.5.1 / §5.5.10 / §5.5.5, spec v0.67.0). The tool calls a model requests in its completion now have an output-side home on the `openarmature.llm.complete` span, closing the gap where they surfaced only incidentally on the next turn's input history. *Which* tools were requested renders by default as three ungated identity projections (the class of `openarmature.llm.model`): `openarmature.llm.output.tool_calls.count`, `.names`, and `.ids`, with `.names` and `.ids` index-aligned in request order and `.count` equal to their length. The full request, arguments included, renders as the payload-gated `openarmature.llm.output.tool_calls`, a JSON `[{id, name, arguments}]` array reusing the input tool-call encoding, surfaced only with `disable_provider_payload=False`. The whole family is emitted only on a tool-calling completion; a completion that requests no tools emits none of it (absence, not `count = 0`). The typed `LlmCompletionEvent` gains an additive `output_tool_calls` field carrying the `ToolCall` records, the source the span attributes render from (in python the OTel span renders from the per-attempt `LlmRetryAttemptEvent`, which carries the field too). This is the request side; the tool-execution complement (a separate `openarmature.tool.call` span) is a later proposal, joined to this one by the `ToolCall.id`. A Langfuse request-side mapping is out of scope. Conformance fixtures 085 (two requested calls surface count / names / ids), 086 (no calls, family absent), and 087 (payload gating: identity survives payload-off while the full serialization is suppressed) run in `test_observability`. ### Changed -- **Pinned spec advances v0.60.0 → v0.66.1** across the v0.15.0 cycle: v0.61.0 (proposal 0061, the detached-trace invocation span above), v0.62.0 (proposal 0064, the Langfuse session/user population above), v0.63.0 (proposal 0072, the prompt cache control above), the v0.63.1 patch (pipeline-utilities coverage fixtures 070/071 for the already-implemented 0069 / 0070 behavior, no new proposal), and v0.64.0 (proposal 0073, GenAI semconv adoption reconciliation: OA retains `gen_ai.system` despite the upstream rename to `gen_ai.provider.name`; textual-only, with no emitted-attribute or fixture change, so the existing `gen_ai.*` fixtures stand as the retention regression), v0.65.0 (proposal 0074, the failure-isolation `catch` gate above), v0.66.0 (proposal 0075, the inline-callable parallel branches and conditional `when` above), and the v0.66.1 patch (an observability §8 call-level-retry Langfuse-mapping clarification reconciling §8 with the per-attempt §5.5 spans: one terminal Generation per `complete()` call, not one per attempt, which the Langfuse observer already renders by driving the Generation from the terminal `LlmCompletionEvent` / `LlmFailedEvent` and skipping the per-attempt `LlmRetryAttemptEvent`; no behavior or fixture change). `conformance.toml` records 0061 / 0072 / 0074 / 0075 `implemented`, 0064 `partial` (its `sessionId` half is dormant pending the sessions capability), and 0073 `textual-only`. Proposal 0050 needed no pin bump of its own (it was already within the pin from its v0.42.0 acceptance); its v0.14.0 `partial` entry flips to `implemented` with the per-attempt span surface above. +- **Pinned spec advances v0.60.0 → v0.67.0** across the v0.15.0 cycle: v0.61.0 (proposal 0061, the detached-trace invocation span above), v0.62.0 (proposal 0064, the Langfuse session/user population above), v0.63.0 (proposal 0072, the prompt cache control above), the v0.63.1 patch (pipeline-utilities coverage fixtures 070/071 for the already-implemented 0069 / 0070 behavior, no new proposal), and v0.64.0 (proposal 0073, GenAI semconv adoption reconciliation: OA retains `gen_ai.system` despite the upstream rename to `gen_ai.provider.name`; textual-only, with no emitted-attribute or fixture change, so the existing `gen_ai.*` fixtures stand as the retention regression), v0.65.0 (proposal 0074, the failure-isolation `catch` gate above), v0.66.0 (proposal 0075, the inline-callable parallel branches and conditional `when` above), the v0.66.1 patch (an observability §8 call-level-retry Langfuse-mapping clarification reconciling §8 with the per-attempt §5.5 spans: one terminal Generation per `complete()` call, not one per attempt, which the Langfuse observer already renders by driving the Generation from the terminal `LlmCompletionEvent` / `LlmFailedEvent` and skipping the per-attempt `LlmRetryAttemptEvent`; no behavior or fixture change), and v0.67.0 (proposal 0076, the tool-call request observability above). `conformance.toml` records 0061 / 0072 / 0074 / 0075 / 0076 `implemented`, 0064 `partial` (its `sessionId` half is dormant pending the sessions capability), and 0073 `textual-only`. Proposal 0050 needed no pin bump of its own (it was already within the pin from its v0.42.0 acceptance); its v0.14.0 `partial` entry flips to `implemented` with the per-attempt span surface above. ## [0.14.0] — 2026-06-17 diff --git a/conformance.toml b/conformance.toml index 59098a9..c536ba9 100644 --- a/conformance.toml +++ b/conformance.toml @@ -32,7 +32,7 @@ [manifest] implementation = "openarmature-python" -spec_pin = "v0.66.1" +spec_pin = "v0.67.0" # Status values: # implemented — shipped behavior matches the proposal's contract @@ -731,3 +731,11 @@ note = "FailureIsolationMiddleware gains an optional `catch` set of error catego status = "implemented" since = "0.15.0" note = "ParallelBranchesNode gains two additive branch forms. (1) Inline-callable branches (§11.1.1): a BranchSpec may give its work as `call` (an async function over the parent state returning a parent-shaped partial update) instead of a compiled `subgraph` + inputs/outputs projection; the contribution is the returned partial directly, merged via the parent reducer with no projection (§11.4). Exactly one of subgraph/call per branch, and a callable branch declares no inputs/outputs, else parallel_branches_invalid_branch_spec (a new compile-time category); a node MAY mix subgraph and callable branches. Per-leg failure isolation on a callable branch is the existing §11.7 branch-middleware contract (wrap the callable in FailureIsolationMiddleware). (2) Conditional branches (§11.10): a BranchSpec may carry an optional `when` predicate (parent_state) -> bool, evaluated once at dispatch; false skips the branch entirely (no dispatch, contribution, observer events, or span). All-branches-skipped is a valid no-op, distinct from the compile-time parallel_branches_no_branches (empty declared mapping). graph-engine §6 / observability §5.7: a callable branch is the unit -- it emits one started/completed pair keyed by branch_name (rendered as a branch span via the existing §5.7 machinery), a skipped branch emits nothing. Fixtures 073 (two callable branches merge to disjoint fields), 074 (when false skips / true dispatches), 075 (callable branch + FailureIsolationMiddleware degrades, sibling completes, category resolves through the chain)." + +# Spec v0.67.0 (proposal 0076). Tool-call request observability on the +# LLM completion span (observability §5.5.1 / §5.5.10 / §5.5.5, +# graph-engine §6). +[proposals."0076"] +status = "implemented" +since = "0.15.0" +note = "The model's output tool calls get an output-side home on the openarmature.llm.complete span. observability §5.5.10 adds the UNGATED identity projections openarmature.llm.output.tool_calls.count / .names / .ids (the class of openarmature.llm.model / attempt_index; emitted only on a tool-calling completion, omitted entirely otherwise -- not count=0); .names and .ids are index-aligned in request order, .count equals their length. §5.5.1 adds the GATED openarmature.llm.output.tool_calls, the full [{id, name, arguments}] serialization (reusing the §5.5.5 input tool-call encoding) carrying the arguments, suppressed under disable_provider_payload and subject to the truncation contract. graph-engine §6: LlmCompletionEvent gains an output_tool_calls field (the ToolCall records, populated unconditionally). python carries the field on BOTH the terminal LlmCompletionEvent (spec-conformance + the Langfuse/consumer path) and the python-internal per-attempt LlmRetryAttemptEvent, and the OTel observer renders the span attributes from the per-attempt event (the LLM-span source since 0050) -- mirroring how output_content already works. OA-namespace, no gen_ai.* mirror (the attempt_index precedent). Langfuse request-side mapping is OUT OF SCOPE (proposal defers it as future work); no Langfuse change. Fixtures 085 (two calls -> count/names/ids), 086 (no calls -> family absent), 087 (payload-gating: identity survives off, gated full present only on)." diff --git a/docs/concepts/observability.md b/docs/concepts/observability.md index 0d3f63e..0df2254 100644 --- a/docs/concepts/observability.md +++ b/docs/concepts/observability.md @@ -739,13 +739,19 @@ observer = OTelObserver( ) ``` -This surfaces three attributes: +This surfaces four attributes: - `openarmature.llm.input.messages`: JSON-encoded message array (the spec §3 message shape: `{role, content, tool_calls?, …}`). - `openarmature.llm.output.content`: the assistant's response content string verbatim. Omitted for tool-call-only responses with empty content. +- `openarmature.llm.output.tool_calls`: JSON-encoded `[{id, name, + arguments}]` array of the tool calls the model requested (the same + encoding `tool_calls` uses inside `input.messages`). This is the + output-side home for the request, including the call arguments, so + it is payload-gated. Emitted only when the response requests tool + calls. - `openarmature.llm.request.extras`: JSON-encoded `RuntimeConfig` extras bag (provider-specific pass-through fields like `repetition_penalty` for vLLM, or `top_k` for HuggingFace @@ -757,6 +763,29 @@ observability. The flag name keeps symmetry with `disable_llm_spans`: the default value (`True`) reads as "the observer disables payload emission by default." +#### Output tool-call identity (ungated) + +The full `openarmature.llm.output.tool_calls` carries the arguments, so +it is payload-gated. But *which* tools the model asked for (their +names and ids) is identity, not payload, the same class as +`openarmature.llm.model`. So three identity projections render +**regardless** of `disable_provider_payload`, surfacing the request +under the default payload-off posture and queryable without parsing +JSON: + +- `openarmature.llm.output.tool_calls.count`: the number of tool calls + requested (an int, equal to the length of `.names`). +- `openarmature.llm.output.tool_calls.names`: the requested tool names, + in request order. +- `openarmature.llm.output.tool_calls.ids`: the requested `ToolCall` + ids, index-aligned with `.names` (`names[i]` / `ids[i]` describe the + same call), the linkage to a downstream tool execution. + +The whole family (these three plus the gated full serialization) is +emitted **only** on a tool-calling completion. A completion that +requests no tools emits none of them; absence means "no tools +requested", distinct from `count = 0`. + #### Truncation Each payload attribute is capped at `payload_max_bytes` UTF-8 bytes diff --git a/docs/model-providers/authoring.md b/docs/model-providers/authoring.md index 361b38f..56ccdd8 100644 --- a/docs/model-providers/authoring.md +++ b/docs/model-providers/authoring.md @@ -305,6 +305,7 @@ of: finish_reason=response.finish_reason, input_messages=serialized_messages, output_content=response.message.content or None, + output_tool_calls=list(response.message.tool_calls or []), request_params=request_params, request_extras=request_extras, active_prompt=None, diff --git a/openarmature-spec b/openarmature-spec index 451a579..f68c64a 160000 --- a/openarmature-spec +++ b/openarmature-spec @@ -1 +1 @@ -Subproject commit 451a5799ad81b57f3f5479bc694917a66fa6eaa7 +Subproject commit f68c64a19b44461708b9310a00012771e70e279b diff --git a/pyproject.toml b/pyproject.toml index df6178b..ec0a334 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -63,7 +63,7 @@ Specification = "https://github.com/LunarCommand/openarmature-spec" openarmature = "openarmature.cli:main" [tool.openarmature] -spec_version = "0.66.1" +spec_version = "0.67.0" [dependency-groups] dev = [ diff --git a/src/openarmature/AGENTS.md b/src/openarmature/AGENTS.md index c2aa44d..4621a0e 100644 --- a/src/openarmature/AGENTS.md +++ b/src/openarmature/AGENTS.md @@ -1,6 +1,6 @@ # OpenArmature — Agent documentation -*This is the agent guide bundled with the openarmature Python package, version 0.14.0 (spec v0.66.1). For the full docs site see [openarmature.ai](https://openarmature.ai). For the canonical spec text see [openarmature.org/capabilities](https://openarmature.org/capabilities/). For project-specific conventions for the code you're editing, see the host project's `AGENTS.md` or `CLAUDE.md`.* +*This is the agent guide bundled with the openarmature Python package, version 0.14.0 (spec v0.67.0). For the full docs site see [openarmature.ai](https://openarmature.ai). For the canonical spec text see [openarmature.org/capabilities](https://openarmature.org/capabilities/). For project-specific conventions for the code you're editing, see the host project's `AGENTS.md` or `CLAUDE.md`.* ## TL;DR @@ -10,7 +10,7 @@ OpenArmature is a workflow framework for LLM pipelines and tool-calling agents: ## Capability contracts -_Sourced from openarmature-spec v0.66.1. Each entry below reproduces §1 (Purpose) and §2 (Concepts) of the capability's `spec.md` verbatim — including additions from accepted proposals that this Python implementation may not yet ship. For per-proposal implementation status (implemented / partial / textual-only / not-yet), see the `conformance.toml` manifest at the repo root. For the full spec text (execution model, error semantics, determinism, observer hooks, etc.) see the linked docs site._ +_Sourced from openarmature-spec v0.67.0. Each entry below reproduces §1 (Purpose) and §2 (Concepts) of the capability's `spec.md` verbatim — including additions from accepted proposals that this Python implementation may not yet ship. For per-proposal implementation status (implemented / partial / textual-only / not-yet), see the `conformance.toml` manifest at the repo root. For the full spec text (execution model, error semantics, determinism, observer hooks, etc.) see the linked docs site._ ### Capability: `graph-engine` diff --git a/src/openarmature/__init__.py b/src/openarmature/__init__.py index e4302f3..276de75 100644 --- a/src/openarmature/__init__.py +++ b/src/openarmature/__init__.py @@ -25,7 +25,7 @@ """ __version__ = "0.14.0" -__spec_version__ = "0.66.1" +__spec_version__ = "0.67.0" # Proposal 0052 (spec observability §5.1 / §8.4.1): canonical # package-registry name for this implementation. Surfaces on every # OTel invocation span as ``openarmature.implementation.name`` and on diff --git a/src/openarmature/graph/events.py b/src/openarmature/graph/events.py index 580ea6a..efb2df6 100644 --- a/src/openarmature/graph/events.py +++ b/src/openarmature/graph/events.py @@ -30,6 +30,7 @@ # plus a string annotation on LlmCompletionEvent.usage avoids the # circular runtime import while keeping pyright type-safe. if TYPE_CHECKING: + from openarmature.llm.messages import ToolCall from openarmature.llm.response import Usage # Sentinel empty metadata mapping for events constructed without a @@ -512,6 +513,11 @@ class LlmCompletionEvent: from the response. ``None`` on tool-call-only responses (the structured-response and tool-call paths are mutually exclusive at the response level). + - ``output_tool_calls``: the assistant message's output tool + calls (the ``ToolCall`` records). Populated unconditionally; + empty list when the response carried no tool calls. The output + tool calls live here rather than in ``output_content`` (which + is the response text and is empty on a tool-call-only response). - ``request_params``: the GenAI request-parameter set the caller supplied. Absence-is-meaningful: only caller-supplied keys appear; empty mapping when none supplied. Keys are the @@ -576,6 +582,17 @@ class LlmCompletionEvent: active_prompt_group: Any call_id: str caller_invocation_metadata: Mapping[str, AttributeValue] | None = None + # Proposal 0076 (spec v0.67.0): the assistant message's output tool + # calls in typed-event-native form (the ToolCall records, not a + # pre-serialized shape — they carry no inline-image bytes, so the + # input_messages redaction-driven pre-serialization doesn't apply). + # Populated unconditionally by the provider; empty list when the + # response carried no tool calls. Source for the §5.5.1 gated + # ``openarmature.llm.output.tool_calls`` serialization + the §5.5.10 + # ungated ``.count`` / ``.names`` / ``.ids`` identity projections. + # Defaulted (default_factory) so existing kwargs-constructors that + # predate this field keep working; the provider always populates it. + output_tool_calls: list["ToolCall"] = field(default_factory=list["ToolCall"]) # Spec: realizes proposal 0058's second spec-normatively-typed event @@ -688,8 +705,11 @@ class LlmRetryAttemptEvent: ``request_extras`` / ``active_prompt`` / ``active_prompt_group``) mirror :class:`LlmCompletionEvent`, carried on every attempt. - response side (``response_id`` / ``response_model`` / ``usage`` / - ``finish_reason`` / ``output_content``): populated on a successful - attempt; ``None`` on a failed attempt. + ``finish_reason`` / ``output_content`` / ``output_tool_calls``): + populated on a successful attempt; ``None`` / empty list on a + failed attempt. ``output_tool_calls`` is the source the OTel + observer renders the §5.5.1 / §5.5.10 output tool-call attributes + from (this is the per-attempt event that drives the LLM span). - failure side (``error_category`` / ``error_message`` / ``error_type``): populated on a failed attempt; ``None`` on a successful one. @@ -721,6 +741,12 @@ class LlmRetryAttemptEvent: error_message: str | None = None error_type: str | None = None caller_invocation_metadata: Mapping[str, AttributeValue] | None = None + # Proposal 0076: the attempt's output tool calls (ToolCall records), + # mirroring LlmCompletionEvent.output_tool_calls. Populated on a + # successful attempt; empty list on a failed one (no response). The + # OTel observer renders the output tool-call span attributes from + # this field (the per-attempt event is the LLM-span source). + output_tool_calls: list["ToolCall"] = field(default_factory=list["ToolCall"]) # Spec: realizes pipeline-utilities §6.3 failure-isolation middleware diff --git a/src/openarmature/llm/providers/openai.py b/src/openarmature/llm/providers/openai.py index 32eb00d..f0357b9 100644 --- a/src/openarmature/llm/providers/openai.py +++ b/src/openarmature/llm/providers/openai.py @@ -78,6 +78,7 @@ current_invocation_id, current_namespace_prefix, ) +from openarmature.observability.llm_event import serialize_tool_calls from openarmature.observability.metadata import AttributeValue, current_invocation_metadata # ``current_prompt_group`` / ``current_prompt_result`` are imported @@ -706,6 +707,9 @@ def _build_llm_completion_event( finish_reason=response.finish_reason, input_messages=input_messages, output_content=output_content, + # Proposal 0076: the model's output tool calls, populated + # unconditionally (empty list on a no-tool completion). + output_tool_calls=list(response.message.tool_calls or []), request_params=request_params, request_extras=request_extras, active_prompt=active_prompt, @@ -830,6 +834,10 @@ def _build_llm_retry_attempt_event( usage=response.usage, finish_reason=response.finish_reason, output_content=response.message.content or None, + # Proposal 0076: the attempt's output tool calls — the + # OTel observer renders the output tool-call span + # attributes from this per-attempt event. + output_tool_calls=list(response.message.tool_calls or []), ) if exc is None: raise ValueError("_build_llm_retry_attempt_event requires response or exc") @@ -1689,9 +1697,7 @@ def _serialize_messages_for_payload(messages: Sequence[Message]) -> list[dict[st elif isinstance(msg, AssistantMessage): entry: dict[str, Any] = {"role": "assistant", "content": msg.content} if msg.tool_calls: - entry["tool_calls"] = [ - {"id": tc.id, "name": tc.name, "arguments": tc.arguments} for tc in msg.tool_calls - ] + entry["tool_calls"] = serialize_tool_calls(msg.tool_calls) out.append(entry) else: # ToolMessage out.append({"role": "tool", "content": msg.content, "tool_call_id": msg.tool_call_id}) diff --git a/src/openarmature/observability/llm_event.py b/src/openarmature/observability/llm_event.py index 600e146..7e29a38 100644 --- a/src/openarmature/observability/llm_event.py +++ b/src/openarmature/observability/llm_event.py @@ -35,10 +35,15 @@ from __future__ import annotations -from typing import Any +from typing import TYPE_CHECKING, Any from pydantic import BaseModel, ConfigDict, Field +if TYPE_CHECKING: + from collections.abc import Sequence + + from openarmature.llm.messages import ToolCall + # Sentinel namespace the LLM provider emits to signal "this is an LLM # event, not a regular node event." Backend mappings (the OTel observer # in this repo, future Langfuse / Datadog adapters) recognise this @@ -142,4 +147,19 @@ class LlmEventPayload(BaseModel): caller_invocation_metadata: dict[str, Any] = Field(default_factory=dict) -__all__ = ["LLM_NAMESPACE", "LlmEventPayload"] +def serialize_tool_calls(tool_calls: Sequence[ToolCall]) -> list[dict[str, Any]]: + """The observability §5.5.5 tool-call serialization, + ``[{id, name, arguments}, ...]``. + + The single home for the encoding, shared by the input-message + payload (the provider's ``input.messages`` serialization, where the + model's tool calls appear inside replayed assistant history) and the + output tool-call attribute (the OTel observer's gated + ``openarmature.llm.output.tool_calls``). Lives here rather than in a + provider or observer module so both sides import one definition and + the encoding can't drift between them. + """ + return [{"id": tc.id, "name": tc.name, "arguments": tc.arguments} for tc in tool_calls] + + +__all__ = ["LLM_NAMESPACE", "LlmEventPayload", "serialize_tool_calls"] diff --git a/src/openarmature/observability/otel/observer.py b/src/openarmature/observability/otel/observer.py index 364bee0..7fd5e6b 100644 --- a/src/openarmature/observability/otel/observer.py +++ b/src/openarmature/observability/otel/observer.py @@ -108,6 +108,7 @@ NodeEvent, ) from openarmature.observability.lineage import is_strict_prefix +from openarmature.observability.llm_event import serialize_tool_calls # Span-stack key shape: # ``(namespace, attempt_index, fan_out_index, branch_name)`` — these @@ -1359,6 +1360,37 @@ def _handle_typed_llm_retry_attempt(self, event: LlmRetryAttemptEvent) -> None: if not self.disable_provider_payload and event.output_content: attrs_out = _truncate_for_attribute(event.output_content, self.payload_max_bytes) span.set_attribute("openarmature.llm.output.content", attrs_out) + # §5.5.10 ungated tool-call identity + §5.5.1 gated full + # serialization (proposal 0076). The identity projections + # (count / names / ids) are identifiers, not payload, so they + # render regardless of disable_provider_payload; the full + # [{id, name, arguments}] serialization carries the arguments and + # is gated. The whole family emits only on a tool-calling + # completion (>= 1 call) — absence means "no tools requested", + # per the §5.5 omit-when-empty convention. + output_tool_calls = event.output_tool_calls + if output_tool_calls: + # .count / .names / .ids are identity, NOT payload, so they + # are deliberately untruncated: truncating would break the + # count == len(.names) invariant and the .names/.ids index- + # alignment, or sever a .id from its downstream tool execution. + # The backstop for a pathological call count is the OTel SDK's + # own SpanLimits, applied uniformly across all attributes. + span.set_attribute("openarmature.llm.output.tool_calls.count", len(output_tool_calls)) + span.set_attribute( + "openarmature.llm.output.tool_calls.names", + [tc.name for tc in output_tool_calls], + ) + span.set_attribute( + "openarmature.llm.output.tool_calls.ids", + [tc.id for tc in output_tool_calls], + ) + if not self.disable_provider_payload: + serialized_calls = _serialize_for_attribute(serialize_tool_calls(output_tool_calls)) + span.set_attribute( + "openarmature.llm.output.tool_calls", + _truncate_for_attribute(serialized_calls, self.payload_max_bytes), + ) span.set_status(Status(StatusCode.OK)) self._run_enrichers(span, event) span.end(end_time=end_time_ns) diff --git a/tests/conformance/test_observability.py b/tests/conformance/test_observability.py index 485576d..b4d8f01 100644 --- a/tests/conformance/test_observability.py +++ b/tests/conformance/test_observability.py @@ -158,6 +158,14 @@ def _reset_otel_global_tracer_provider(restore_to: object) -> None: # run; session-bound cases 1/5 defer until the sessions capability # (0020) supplies openarmature.session_id. "084-langfuse-session-user-promotion", + # v0.67.0 — proposal 0076 (tool-call request observability on the + # LLM span). The model's output tool calls surface as ungated + # identity (count / names / ids) plus a gated full serialization + # on openarmature.llm.complete. Driven through the generic + # LLM-payload fixture runner. + "085-llm-tool-call-request-attributes", + "086-llm-tool-call-request-absent", + "087-llm-tool-call-request-survives-payload-gating", } ) @@ -267,6 +275,9 @@ async def test_observability_fixture(fixture_path: Path) -> None: "025-otel-llm-request-params-extended", "026-otel-caller-supplied-metadata", "057-llm-attempt-index-single-attempt-default", + "085-llm-tool-call-request-attributes", + "086-llm-tool-call-request-absent", + "087-llm-tool-call-request-survives-payload-gating", }: await _run_llm_payload_fixture(spec) else: @@ -2937,6 +2948,17 @@ def _walk(expected_entries: list[dict[str, Any]]) -> None: absent = entry.get("attributes_absent") if absent: assert_attributes_absent(attrs, cast("list[str]", absent)) + # ``attributes_present:`` list of names that MUST appear + # (presence-only, value not asserted). Fixture 087 case 2 + # uses this for the gated openarmature.llm.output.tool_calls, + # whose serialized value is checked structurally in the + # mirror unit test rather than bytewise here. + present = entry.get("attributes_present") + if present: + for attr_name in cast("list[str]", present): + assert attr_name in attrs, ( + f"span {name!r} MUST carry attribute {attr_name!r}; got {sorted(attrs)}" + ) # ``attribute_parses_as_messages:`` shape assertion. parses_as_messages = entry.get("attribute_parses_as_messages") if parses_as_messages: diff --git a/tests/test_smoke.py b/tests/test_smoke.py index d85f3a3..7ee0fb9 100644 --- a/tests/test_smoke.py +++ b/tests/test_smoke.py @@ -9,7 +9,7 @@ def test_package_versions() -> None: assert openarmature.__version__ == "0.14.0" - assert openarmature.__spec_version__ == "0.66.1" + assert openarmature.__spec_version__ == "0.67.0" def test_spec_version_matches_pyproject() -> None: diff --git a/tests/unit/test_llm_provider.py b/tests/unit/test_llm_provider.py index 05dc039..5a696fc 100644 --- a/tests/unit/test_llm_provider.py +++ b/tests/unit/test_llm_provider.py @@ -1337,6 +1337,68 @@ def _503(_req: httpx.Request) -> httpx.Response: assert failed_events[0].error_type == "ProviderUnavailable" +async def test_complete_populates_output_tool_calls_on_typed_events() -> None: + # Proposal 0076: provider.complete() populates output_tool_calls + # (the ToolCall records) on BOTH the terminal LlmCompletionEvent and + # the per-attempt LlmRetryAttemptEvent — the source the OTel + # observer renders the §5.5.1 / §5.5.10 output tool-call attributes + # from. The per-attempt event drives the LLM span; the terminal + # event carries the field for spec-conformance + the Langfuse path. + def _tool_call_response(_req: httpx.Request) -> httpx.Response: + return httpx.Response( + 200, + json={ + "id": "cc-0076", + "object": "chat.completion", + "created": 1700000000, + "model": "m", + "choices": [ + { + "index": 0, + "message": { + "role": "assistant", + "content": None, + "tool_calls": [ + { + "id": "call_a", + "type": "function", + "function": {"name": "get_weather", "arguments": '{"city": "NYC"}'}, + }, + { + "id": "call_b", + "type": "function", + "function": {"name": "get_time", "arguments": '{"tz": "EST"}'}, + }, + ], + }, + "finish_reason": "tool_calls", + } + ], + "usage": {"prompt_tokens": 8, "completion_tokens": 12, "total_tokens": 20}, + }, + ) + + events, token = _collecting_dispatch() + provider = OpenAIProvider( + base_url="http://test", model="m", api_key="k", transport=httpx.MockTransport(_tool_call_response) + ) + try: + await provider.complete([UserMessage(content="weather and time?")]) + finally: + await provider.aclose() + _release_dispatch(token) + + completion = next(e for e in events if isinstance(e, LlmCompletionEvent)) + attempt = next(e for e in events if isinstance(e, LlmRetryAttemptEvent)) + # output_content is None on a tool-call-only response; the calls + # live in output_tool_calls instead. + assert completion.output_content is None + for ev in (completion, attempt): + assert [tc.name for tc in ev.output_tool_calls] == ["get_weather", "get_time"] + assert [tc.id for tc in ev.output_tool_calls] == ["call_a", "call_b"] + assert ev.output_tool_calls[0].arguments == {"city": "NYC"} + + # --------------------------------------------------------------------------- # Call-level retry (proposal 0050) # --------------------------------------------------------------------------- diff --git a/tests/unit/test_observability_otel.py b/tests/unit/test_observability_otel.py index 3a9448d..2fa80d6 100644 --- a/tests/unit/test_observability_otel.py +++ b/tests/unit/test_observability_otel.py @@ -765,6 +765,106 @@ async def test_llm_span_duration_matches_typed_event_latency() -> None: assert abs(duration_ms - latency_ms) < 1.0 +async def _drive_llm_span_with_tool_calls( + tool_calls: list[Any], + *, + disable_provider_payload: bool = True, +) -> dict[str, Any]: + """Drive one per-attempt LLM event carrying ``output_tool_calls`` + through the OTel observer; return the openarmature.llm.complete + span's attribute dict. ``disable_provider_payload`` mirrors the + observer's default-on payload gate (the OTel span renders from the + per-attempt LlmRetryAttemptEvent).""" + from openarmature.observability.correlation import ( + _reset_invocation_id, + _set_invocation_id, + ) + from tests._helpers.typed_event import make_retry_attempt_event + + exporter = InMemorySpanExporter() + observer = OTelObserver( + span_processor=SimpleSpanProcessor(exporter), + disable_provider_payload=disable_provider_payload, + ) + token = _set_invocation_id("inv-tool-calls") + try: + await observer( + make_retry_attempt_event( + finish_reason="tool_calls" if tool_calls else "stop", + output_tool_calls=tool_calls, + ) + ) + finally: + _reset_invocation_id(token) + observer.shutdown() + llm_spans = [s for s in exporter.get_finished_spans() if s.name == "openarmature.llm.complete"] + assert len(llm_spans) == 1 + return dict(llm_spans[0].attributes or {}) + + +async def test_llm_span_emits_output_tool_call_identity_projections() -> None: + # Proposal 0076 §5.5.10 (mirrors fixture 085): a completion + # requesting two tools emits count / names / ids on the span, + # index-aligned and in request order. The default payload-off + # posture applies, so the gated full serialization is absent. + from openarmature.llm.messages import ToolCall + + attrs = await _drive_llm_span_with_tool_calls( + [ + ToolCall(id="call_a", name="get_weather", arguments={"city": "NYC"}), + ToolCall(id="call_b", name="get_time", arguments={"tz": "EST"}), + ] + ) + assert attrs.get("openarmature.llm.output.tool_calls.count") == 2 + assert list(attrs.get("openarmature.llm.output.tool_calls.names") or ()) == ["get_weather", "get_time"] + assert list(attrs.get("openarmature.llm.output.tool_calls.ids") or ()) == ["call_a", "call_b"] + assert "openarmature.llm.output.tool_calls" not in attrs + + +async def test_llm_span_omits_output_tool_calls_when_none_requested() -> None: + # Proposal 0076 (mirrors fixture 086): a completion with no tool + # calls emits NONE of the family — absence means "no tools + # requested", distinct from count = 0 / empty arrays. + attrs = await _drive_llm_span_with_tool_calls([]) + for name in ( + "openarmature.llm.output.tool_calls", + "openarmature.llm.output.tool_calls.count", + "openarmature.llm.output.tool_calls.names", + "openarmature.llm.output.tool_calls.ids", + ): + assert name not in attrs + + +async def test_llm_span_output_tool_calls_payload_gating() -> None: + # Proposal 0076 §5.5.1 / §5.5.10 (mirrors fixture 087): the identity + # projections are ungated (render with payload off); the gated full + # [{id, name, arguments}] serialization is suppressed with payload + # off and present (carrying the arguments) with payload on. + import json + + from openarmature.llm.messages import ToolCall + + calls = [ToolCall(id="call_x", name="search_db", arguments={"q": "secret query"})] + + off = await _drive_llm_span_with_tool_calls(calls, disable_provider_payload=True) + assert off.get("openarmature.llm.output.tool_calls.count") == 1 + assert list(off.get("openarmature.llm.output.tool_calls.names") or ()) == ["search_db"] + assert list(off.get("openarmature.llm.output.tool_calls.ids") or ()) == ["call_x"] + assert "openarmature.llm.output.tool_calls" not in off + + on = await _drive_llm_span_with_tool_calls(calls, disable_provider_payload=False) + assert on.get("openarmature.llm.output.tool_calls.count") == 1 + assert list(on.get("openarmature.llm.output.tool_calls.names") or ()) == ["search_db"] + assert list(on.get("openarmature.llm.output.tool_calls.ids") or ()) == ["call_x"] + serialized = on.get("openarmature.llm.output.tool_calls") + assert isinstance(serialized, str) + # Parses to the §5.5.5 [{id, name, arguments}] encoding (structure, + # not bytewise — _serialize_for_attribute sorts keys). + assert json.loads(serialized) == [ + {"id": "call_x", "name": "search_db", "arguments": {"q": "secret query"}} + ] + + async def test_llm_span_zero_duration_when_latency_missing() -> None: # When the typed event omits latency_ms (None), the handler falls # back to a zero-duration span at end_time rather than guessing