Skip to content

feat(db): capture raw model output + accurate failure messages#72

Merged
Colinho22 merged 1 commit into
mainfrom
fix-tier3-issues
Jun 20, 2026
Merged

feat(db): capture raw model output + accurate failure messages#72
Colinho22 merged 1 commit into
mainfrom
fix-tier3-issues

Conversation

@Colinho22

@Colinho22 Colinho22 commented Jun 20, 2026

Copy link
Copy Markdown
Owner

Summary

A tier-3 cross-provider check surfaced that failed cells discarded the model
output that caused the failure (e.g. gemini-3.5-flash returning malformed JSON
on large extractions), leaving only the error string. The actual output was
only recoverable from the provider's own console. This PR persists it.

Changes

  • raw_response column on run_results and sub_results: the unprocessed
    model output, retained even when the cell fails (when output_diagram_code /
    output_text is None) and across retries. Populated by all five providers
    and all three multi-step strategies. This makes every failure analysable
    after the run without re-calling the model.
  • Accurate empty-output failures: a provider returning success=False with
    no error string (an empty/blank diagram) is now recorded as "empty output
    from provider" instead of the misleading "No attempts executed".

Design notes

  • raw_response stores the model text only, not the request envelope. HTTP
    status (2xx/4xx/5xx) is deliberately omitted: the SDKs do not expose it on the
    200-success path, and the failures of interest (malformed content) are HTTP
    200s, so a status column would be inferred and misleading. raw_response plus
    the existing error string fully capture each failure.
  • Schema is code; a fresh DB gets the columns automatically. The full 6000-cell
    DB is ~70-90 MB (current is 2.3 MB for 382 cells), well within SQLite's range.

Testing

  • New test_raw_response_survives_a_failed_cell: a failed cell round-trips
    through the DB keeping its raw output while the cleaned output is None.
  • Full suite: 259 passed. ruff clean.

Note

This is a schema change. An existing DB without the column will reject inserts
(init_db uses CREATE TABLE IF NOT EXISTS and will not alter an existing
table), so the pre-change DB must be moved aside before a run. This coincides
with the version-bump re-baseline already required by the RC contract changes.

Summary by CodeRabbit

  • New Features

    • The system now captures and stores unprocessed AI provider responses for enhanced post-failure analysis and debugging. This allows access to raw outputs (including malformed content) when standard processing fails.
  • Tests

    • Added test coverage to validate raw response persistence across database operations and retry scenarios.

A failed cell previously kept only the error string; the model output that
caused the failure (malformed JSON, broken Mermaid) was discarded, so a
failure could not be inspected after the run. Add a nullable raw_response
column to run_results and sub_results, populated by every provider and
retained across retries even when the cleaned output is None. Also name
empty-output provider failures accurately instead of "No attempts executed".

raw_response is the model text only (no request envelope); HTTP status is
omitted because the SDKs do not expose it on the 200-success path where the
malformed-content failures occur. ~70-90MB for the full matrix DB.
@Colinho22 Colinho22 added this to the 🧪 Experimental Artifact milestone Jun 20, 2026
@Colinho22 Colinho22 self-assigned this Jun 20, 2026
@Colinho22 Colinho22 added the enhancement New feature or request label Jun 20, 2026
@coderabbitai

coderabbitai Bot commented Jun 20, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 2e9df8bf-ef21-4de7-9b59-db60a3ef878a

📥 Commits

Reviewing files that changed from the base of the PR and between 557ff65 and 9d284ef.

📒 Files selected for processing (11)
  • src/maestro/db/client.py
  • src/maestro/db/queries.py
  • src/maestro/providers/anthropic.py
  • src/maestro/providers/gemini.py
  • src/maestro/providers/mistral.py
  • src/maestro/providers/openai.py
  • src/maestro/schemas.py
  • src/maestro/strategies/crew.py
  • src/maestro/strategies/langgraph.py
  • src/maestro/strategies/sop.py
  • tests/db/test_client.py

📝 Walkthrough

Walkthrough

Adds a raw_response: str | None field to both RunResult and SubResult schemas to preserve unprocessed LLM output for post-failure analysis. All four providers set this field on the success path; all three strategy retry loops track the latest raw output via a last_raw accumulator and attach it to SubResult on both success and failure returns. The field is persisted to the corresponding SQLite tables with inline schema documentation and a new round-trip integration test.

Changes

raw_response diagnostic field end-to-end

Layer / File(s) Summary
Schema contracts: raw_response on RunResult and SubResult
src/maestro/schemas.py
RunResult and SubResult each gain a nullable raw_response: str | None field documenting that it holds unprocessed provider text when output parsing fails.
Provider success paths populate raw_response
src/maestro/providers/anthropic.py, src/maestro/providers/gemini.py, src/maestro/providers/mistral.py, src/maestro/providers/openai.py
All four providers add raw_response=output to the success-path RunResult construction; error handling paths are unchanged.
Strategy retry loops track last_raw and wire SubResult.raw_response
src/maestro/strategies/crew.py, src/maestro/strategies/langgraph.py, src/maestro/strategies/sop.py
Each strategy's step-execution method introduces a last_raw accumulator updated every attempt, propagated into SubResult.raw_response on both success and final failure. LangGraph and SOP also add the fallback error string "empty output from provider" when result.error is absent.
DB schema docs, query persistence, and integration test
src/maestro/db/client.py, src/maestro/db/queries.py, tests/db/test_client.py
SCHEMA gains inline comments documenting raw_response in both tables; insert_run_result and insert_sub_result extend their INSERT column and parameter lists; a new test asserts raw_response survives a full write/read cycle while cleaned output columns remain None.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Possibly related PRs

  • Colinho22/maestro#34: Follows the same pattern of adding a new field (retry_count) to RunResult, extending provider complete() returns, and updating insert_run_result in queries.py — the same files and architectural path modified here for raw_response.

Poem

🐇 When the JSON arrives in a mangled heap,
No output to parse, no diagram to keep,
I tuck the raw text in a field of its own,
So even in failure, the truth can be shown.
raw_response lives on — no clue left unsown! 🌿

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat(db): capture raw model output + accurate failure messages' directly and concisely summarizes the main changes: adding raw_response capture to the database and improving failure message accuracy across providers.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Docstring Coverage (Src Only) ✅ Passed All 33 public module-level entities across changed src/ files have docstrings (100% coverage), exceeding the 80% threshold.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix-tier3-issues

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@Colinho22 Colinho22 merged commit da31f8e into main Jun 20, 2026
2 checks passed
@Colinho22 Colinho22 deleted the fix-tier3-issues branch June 20, 2026 23:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant