Skip to content

Disable LLM "thinking"/reasoning for budget extraction (fixes Gemini 2.5 timeout → LLM_UNREACHABLE) #1701

@steilerDev

Description

@steilerDev

Bug: Gemini 2.5 invoice extraction times out (LLM_UNREACHABLE) because model "thinking" is enabled by default

Type: Bug
Priority: Should Have
Component: Budget invoice auto-itemize (LLM budget extraction)
Parent Epic: None — auto-itemize was delivered as standalone stories (#1545#1547); related epics (EPIC-15, EPIC-05) are closed.

Problem

When the budget invoice auto-itemize feature is configured with Gemini 2.5 Flash via the OpenAI-compatible endpoint https://generativelanguage.googleapis.com/v1beta/openai/, invoice extraction requests time out after LLM_REQUEST_TIMEOUT_MS (default 30000ms) and surface to the user as:

{ "error": { "code": "LLM_UNREACHABLE", "message": "LLM provider is unreachable" } }

The server log shows the underlying cause is a clean AbortError with no network innerCause — i.e., the request connects fine and the provider is reachable, but the model does not respond within the timeout. The LLM_UNREACHABLE code is misleading: the provider is reachable; the model is simply too slow.

Root cause

Gemini 2.5 models (including Flash) have "thinking"/reasoning enabled by default with a dynamic thinking budget. The extraction call is non-streaming and LLM_MAX_TOKENS defaults to 16384, so the model spends significant time on chain-of-thought reasoning before emitting the structured JSON, pushing total latency past 30s on real-world invoices.

The request body is assembled in buildRequestBody() in server/src/services/budgetExtraction/providerProfiles.ts. It currently sends only model, messages, temperature: 0, max_tokens, and (per provider) response_formatno parameter to disable model thinking. The timeout/abort is then mapped to LlmUnreachableError / LLM_UNREACHABLE in server/src/services/budgetExtraction/openAICompatibleProvider.ts (the catch block around the fetch call).

Expected behavior

  • Invoice line-item extraction is a structured-extraction task that does not require chain-of-thought reasoning. Model "thinking"/reasoning should be disabled for the budget-extraction call, reducing latency (fixing the timeout at its source) and lowering token cost.
  • A real Gemini 2.5 Flash invoice extraction completes within the default LLM_REQUEST_TIMEOUT_MS (30000ms) for typical construction invoices.

Actual behavior

  • Gemini 2.5 Flash spends its thinking budget reasoning before producing JSON; the non-streaming call exceeds 30s, the AbortController fires, and the user sees LLM_UNREACHABLE even though the provider is reachable.

Critical nuance (must not regress other providers)

The disable mechanism must be applied only where it is safe/valid per provider:

  • Gemini: supports disabling thinking (the target of this fix).
  • OpenAI non-reasoning models (e.g., gpt-4o): reject a reasoning_effort parameter with HTTP 400. The disable parameter must NOT be sent to such models, or extraction will break with a validation error.
  • Anthropic: requires no change — extended thinking is opt-in / off by default.
  • Ollama / generic: must continue to work unchanged (no spurious parameters that an unknown provider might reject).

Reproduction steps

  1. Configure the LLM gateway for auto-itemize with:
    • LLM_BASE_URL=https://generativelanguage.googleapis.com/v1beta/openai/
    • LLM_MODEL=gemini-2.5-flash
    • LLM_API_KEY=<a valid Google AI key>
    • Leave LLM_REQUEST_TIMEOUT_MS and LLM_MAX_TOKENS at their defaults (30000 / 16384).
  2. Open a real (multi-line) construction invoice in Paperless and trigger auto-itemize on it.
  3. Observe the request hang ~30s, then fail with {"error":{"code":"LLM_UNREACHABLE",...}}.
  4. Inspect the server log: the cause is a bare AbortError with no network innerCause (provider was reachable; the model didn't respond in time).

Acceptance Criteria

  • AC1 — Given the provider is gemini, When buildRequestBody() constructs the extraction request, Then the request body includes the provider-appropriate parameter that disables model thinking/reasoning (e.g., reasoning_effort set to the provider's "off"/minimal value, or Gemini's thinking-budget-zero equivalent for the OpenAI-compat layer).
  • AC2 — Given the provider is openai, When buildRequestBody() constructs the extraction request, Then NO reasoning_effort (or other thinking-disable) parameter is sent that would cause a non-reasoning model (e.g., gpt-4o) to return HTTP 400. (Either omit it for OpenAI, or only emit it where OpenAI documents support.)
  • AC3 — Given the provider is anthropic, When buildRequestBody() constructs the extraction request, Then the request body is unchanged from current behavior (extended thinking is off by default; no thinking parameter is added).
  • AC4 — Given the provider is ollama or generic, When buildRequestBody() constructs the extraction request, Then no thinking-disable parameter is added that an unknown/local model could reject (behavior is unchanged unless a safe, provider-detected mechanism exists).
  • AC5 — Given a Gemini 2.5 Flash request body produced by buildRequestBody(), When the body is inspected in a unit test, Then it asserts the thinking-disable parameter is present and correctly valued, and the existing response_format/max_tokens/temperature fields are preserved.
  • AC6 — Given the existing provider-shaping unit tests, When the suite runs, Then all per-provider buildRequestBody() assertions still pass (no regression to response_format selection for openai/anthropic/gemini/ollama/generic).
  • AC7 — Given a real Gemini 2.5 Flash extraction with thinking disabled, When a typical multi-line construction invoice is processed at the default timeout (30000ms), Then the extraction completes successfully and returns structured line items (no LLM_UNREACHABLE).

Notes

  • Scope is limited to request-body shaping in providerProfiles.ts (and any minimal supporting change in openAICompatibleProvider.ts if needed to thread provider info). It does NOT change the timeout default, the LLM_MAX_TOKENS default, or the user-facing error mapping.
  • Architect input is welcome on the exact per-provider parameter mapping for the OpenAI-compatible layer (Gemini thinking-budget-zero vs. reasoning_effort), since this is the first parameter we shape conditionally on reasoning capability. Flag if a LLM_REASONING / disable-thinking config knob is warranted, but the default for extraction should be "thinking off."
  • Security: no new outbound surface; only an additional field on an existing request body. No new env vars required for the core fix.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    Status
    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions