You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Bug: Gemini 2.5 invoice extraction times out (LLM_UNREACHABLE) because model "thinking" is enabled by default
Type: Bug Priority: Should Have Component: Budget invoice auto-itemize (LLM budget extraction) Parent Epic: None — auto-itemize was delivered as standalone stories (#1545–#1547); related epics (EPIC-15, EPIC-05) are closed.
Problem
When the budget invoice auto-itemize feature is configured with Gemini 2.5 Flash via the OpenAI-compatible endpoint https://generativelanguage.googleapis.com/v1beta/openai/, invoice extraction requests time out after LLM_REQUEST_TIMEOUT_MS (default 30000ms) and surface to the user as:
The server log shows the underlying cause is a clean AbortError with no network innerCause — i.e., the request connects fine and the provider is reachable, but the model does not respond within the timeout. The LLM_UNREACHABLE code is misleading: the provider is reachable; the model is simply too slow.
Root cause
Gemini 2.5 models (including Flash) have "thinking"/reasoning enabled by default with a dynamic thinking budget. The extraction call is non-streaming and LLM_MAX_TOKENS defaults to 16384, so the model spends significant time on chain-of-thought reasoning before emitting the structured JSON, pushing total latency past 30s on real-world invoices.
The request body is assembled in buildRequestBody() in server/src/services/budgetExtraction/providerProfiles.ts. It currently sends only model, messages, temperature: 0, max_tokens, and (per provider) response_format — no parameter to disable model thinking. The timeout/abort is then mapped to LlmUnreachableError / LLM_UNREACHABLE in server/src/services/budgetExtraction/openAICompatibleProvider.ts (the catch block around the fetch call).
Expected behavior
Invoice line-item extraction is a structured-extraction task that does not require chain-of-thought reasoning. Model "thinking"/reasoning should be disabled for the budget-extraction call, reducing latency (fixing the timeout at its source) and lowering token cost.
A real Gemini 2.5 Flash invoice extraction completes within the default LLM_REQUEST_TIMEOUT_MS (30000ms) for typical construction invoices.
Actual behavior
Gemini 2.5 Flash spends its thinking budget reasoning before producing JSON; the non-streaming call exceeds 30s, the AbortController fires, and the user sees LLM_UNREACHABLE even though the provider is reachable.
Critical nuance (must not regress other providers)
The disable mechanism must be applied only where it is safe/valid per provider:
Gemini: supports disabling thinking (the target of this fix).
OpenAI non-reasoning models (e.g., gpt-4o): reject a reasoning_effort parameter with HTTP 400. The disable parameter must NOT be sent to such models, or extraction will break with a validation error.
Anthropic: requires no change — extended thinking is opt-in / off by default.
Ollama / generic: must continue to work unchanged (no spurious parameters that an unknown provider might reject).
Leave LLM_REQUEST_TIMEOUT_MS and LLM_MAX_TOKENS at their defaults (30000 / 16384).
Open a real (multi-line) construction invoice in Paperless and trigger auto-itemize on it.
Observe the request hang ~30s, then fail with {"error":{"code":"LLM_UNREACHABLE",...}}.
Inspect the server log: the cause is a bare AbortError with no network innerCause (provider was reachable; the model didn't respond in time).
Acceptance Criteria
AC1 — Given the provider is gemini, When buildRequestBody() constructs the extraction request, Then the request body includes the provider-appropriate parameter that disables model thinking/reasoning (e.g., reasoning_effort set to the provider's "off"/minimal value, or Gemini's thinking-budget-zero equivalent for the OpenAI-compat layer).
AC2 — Given the provider is openai, When buildRequestBody() constructs the extraction request, Then NO reasoning_effort (or other thinking-disable) parameter is sent that would cause a non-reasoning model (e.g., gpt-4o) to return HTTP 400. (Either omit it for OpenAI, or only emit it where OpenAI documents support.)
AC3 — Given the provider is anthropic, When buildRequestBody() constructs the extraction request, Then the request body is unchanged from current behavior (extended thinking is off by default; no thinking parameter is added).
AC4 — Given the provider is ollama or generic, When buildRequestBody() constructs the extraction request, Then no thinking-disable parameter is added that an unknown/local model could reject (behavior is unchanged unless a safe, provider-detected mechanism exists).
AC5 — Given a Gemini 2.5 Flash request body produced by buildRequestBody(), When the body is inspected in a unit test, Then it asserts the thinking-disable parameter is present and correctly valued, and the existing response_format/max_tokens/temperature fields are preserved.
AC6 — Given the existing provider-shaping unit tests, When the suite runs, Then all per-provider buildRequestBody() assertions still pass (no regression to response_format selection for openai/anthropic/gemini/ollama/generic).
AC7 — Given a real Gemini 2.5 Flash extraction with thinking disabled, When a typical multi-line construction invoice is processed at the default timeout (30000ms), Then the extraction completes successfully and returns structured line items (no LLM_UNREACHABLE).
Notes
Scope is limited to request-body shaping in providerProfiles.ts (and any minimal supporting change in openAICompatibleProvider.ts if needed to thread provider info). It does NOT change the timeout default, the LLM_MAX_TOKENS default, or the user-facing error mapping.
Architect input is welcome on the exact per-provider parameter mapping for the OpenAI-compatible layer (Gemini thinking-budget-zero vs. reasoning_effort), since this is the first parameter we shape conditionally on reasoning capability. Flag if a LLM_REASONING / disable-thinking config knob is warranted, but the default for extraction should be "thinking off."
Security: no new outbound surface; only an additional field on an existing request body. No new env vars required for the core fix.
Bug: Gemini 2.5 invoice extraction times out (LLM_UNREACHABLE) because model "thinking" is enabled by default
Type: Bug
Priority: Should Have
Component: Budget invoice auto-itemize (LLM budget extraction)
Parent Epic: None — auto-itemize was delivered as standalone stories (#1545–#1547); related epics (EPIC-15, EPIC-05) are closed.
Problem
When the budget invoice auto-itemize feature is configured with Gemini 2.5 Flash via the OpenAI-compatible endpoint
https://generativelanguage.googleapis.com/v1beta/openai/, invoice extraction requests time out afterLLM_REQUEST_TIMEOUT_MS(default 30000ms) and surface to the user as:{ "error": { "code": "LLM_UNREACHABLE", "message": "LLM provider is unreachable" } }The server log shows the underlying cause is a clean
AbortErrorwith no networkinnerCause— i.e., the request connects fine and the provider is reachable, but the model does not respond within the timeout. TheLLM_UNREACHABLEcode is misleading: the provider is reachable; the model is simply too slow.Root cause
Gemini 2.5 models (including Flash) have "thinking"/reasoning enabled by default with a dynamic thinking budget. The extraction call is non-streaming and
LLM_MAX_TOKENSdefaults to 16384, so the model spends significant time on chain-of-thought reasoning before emitting the structured JSON, pushing total latency past 30s on real-world invoices.The request body is assembled in
buildRequestBody()inserver/src/services/budgetExtraction/providerProfiles.ts. It currently sends onlymodel,messages,temperature: 0,max_tokens, and (per provider)response_format— no parameter to disable model thinking. The timeout/abort is then mapped toLlmUnreachableError/LLM_UNREACHABLEinserver/src/services/budgetExtraction/openAICompatibleProvider.ts(thecatchblock around thefetchcall).Expected behavior
LLM_REQUEST_TIMEOUT_MS(30000ms) for typical construction invoices.Actual behavior
AbortControllerfires, and the user seesLLM_UNREACHABLEeven though the provider is reachable.Critical nuance (must not regress other providers)
The disable mechanism must be applied only where it is safe/valid per provider:
gpt-4o): reject areasoning_effortparameter with HTTP 400. The disable parameter must NOT be sent to such models, or extraction will break with a validation error.Reproduction steps
LLM_BASE_URL=https://generativelanguage.googleapis.com/v1beta/openai/LLM_MODEL=gemini-2.5-flashLLM_API_KEY=<a valid Google AI key>LLM_REQUEST_TIMEOUT_MSandLLM_MAX_TOKENSat their defaults (30000 / 16384).{"error":{"code":"LLM_UNREACHABLE",...}}.AbortErrorwith no networkinnerCause(provider was reachable; the model didn't respond in time).Acceptance Criteria
gemini, WhenbuildRequestBody()constructs the extraction request, Then the request body includes the provider-appropriate parameter that disables model thinking/reasoning (e.g.,reasoning_effortset to the provider's "off"/minimal value, or Gemini's thinking-budget-zero equivalent for the OpenAI-compat layer).openai, WhenbuildRequestBody()constructs the extraction request, Then NOreasoning_effort(or other thinking-disable) parameter is sent that would cause a non-reasoning model (e.g.,gpt-4o) to return HTTP 400. (Either omit it for OpenAI, or only emit it where OpenAI documents support.)anthropic, WhenbuildRequestBody()constructs the extraction request, Then the request body is unchanged from current behavior (extended thinking is off by default; no thinking parameter is added).ollamaorgeneric, WhenbuildRequestBody()constructs the extraction request, Then no thinking-disable parameter is added that an unknown/local model could reject (behavior is unchanged unless a safe, provider-detected mechanism exists).buildRequestBody(), When the body is inspected in a unit test, Then it asserts the thinking-disable parameter is present and correctly valued, and the existingresponse_format/max_tokens/temperaturefields are preserved.buildRequestBody()assertions still pass (no regression toresponse_formatselection for openai/anthropic/gemini/ollama/generic).LLM_UNREACHABLE).Notes
providerProfiles.ts(and any minimal supporting change inopenAICompatibleProvider.tsif needed to thread provider info). It does NOT change the timeout default, theLLM_MAX_TOKENSdefault, or the user-facing error mapping.reasoning_effort), since this is the first parameter we shape conditionally on reasoning capability. Flag if aLLM_REASONING/ disable-thinking config knob is warranted, but the default for extraction should be "thinking off."