Disable LLM "thinking"/reasoning for budget extraction (fixes Gemini 2.5 timeout → LLM_UNREACHABLE)

## Bug: Gemini 2.5 invoice extraction times out (LLM_UNREACHABLE) because model "thinking" is enabled by default

**Type**: Bug
**Priority**: Should Have
**Component**: Budget invoice auto-itemize (LLM budget extraction)
**Parent Epic**: None — auto-itemize was delivered as standalone stories (#1545–#1547); related epics (EPIC-15, EPIC-05) are closed.

### Problem

When the budget invoice auto-itemize feature is configured with **Gemini 2.5 Flash** via the OpenAI-compatible endpoint `https://generativelanguage.googleapis.com/v1beta/openai/`, invoice extraction requests time out after `LLM_REQUEST_TIMEOUT_MS` (default 30000ms) and surface to the user as:

```json
{ "error": { "code": "LLM_UNREACHABLE", "message": "LLM provider is unreachable" } }
```

The server log shows the underlying cause is a clean `AbortError` with **no network `innerCause`** — i.e., the request connects fine and the provider is reachable, but the model does not respond within the timeout. The `LLM_UNREACHABLE` code is misleading: the provider is reachable; the model is simply too slow.

### Root cause

Gemini 2.5 models (including Flash) have **"thinking"/reasoning enabled by default** with a dynamic thinking budget. The extraction call is **non-streaming** and `LLM_MAX_TOKENS` defaults to **16384**, so the model spends significant time on chain-of-thought reasoning before emitting the structured JSON, pushing total latency past 30s on real-world invoices.

The request body is assembled in `buildRequestBody()` in `server/src/services/budgetExtraction/providerProfiles.ts`. It currently sends only `model`, `messages`, `temperature: 0`, `max_tokens`, and (per provider) `response_format` — **no parameter to disable model thinking**. The timeout/abort is then mapped to `LlmUnreachableError` / `LLM_UNREACHABLE` in `server/src/services/budgetExtraction/openAICompatibleProvider.ts` (the `catch` block around the `fetch` call).

### Expected behavior

- Invoice line-item extraction is a **structured-extraction task** that does not require chain-of-thought reasoning. Model "thinking"/reasoning should be **disabled** for the budget-extraction call, reducing latency (fixing the timeout at its source) and lowering token cost.
- A real Gemini 2.5 Flash invoice extraction completes within the default `LLM_REQUEST_TIMEOUT_MS` (30000ms) for typical construction invoices.

### Actual behavior

- Gemini 2.5 Flash spends its thinking budget reasoning before producing JSON; the non-streaming call exceeds 30s, the `AbortController` fires, and the user sees `LLM_UNREACHABLE` even though the provider is reachable.

### Critical nuance (must not regress other providers)

The disable mechanism must be applied **only where it is safe/valid per provider**:

- **Gemini**: supports disabling thinking (the target of this fix).
- **OpenAI non-reasoning models** (e.g., `gpt-4o`): **reject** a `reasoning_effort` parameter with **HTTP 400**. The disable parameter must NOT be sent to such models, or extraction will break with a validation error.
- **Anthropic**: requires **no change** — extended thinking is opt-in / off by default.
- **Ollama / generic**: must continue to work unchanged (no spurious parameters that an unknown provider might reject).

### Reproduction steps

1. Configure the LLM gateway for auto-itemize with:
   - `LLM_BASE_URL=https://generativelanguage.googleapis.com/v1beta/openai/`
   - `LLM_MODEL=gemini-2.5-flash`
   - `LLM_API_KEY=<a valid Google AI key>`
   - Leave `LLM_REQUEST_TIMEOUT_MS` and `LLM_MAX_TOKENS` at their defaults (30000 / 16384).
2. Open a real (multi-line) construction invoice in Paperless and trigger auto-itemize on it.
3. Observe the request hang ~30s, then fail with `{"error":{"code":"LLM_UNREACHABLE",...}}`.
4. Inspect the server log: the cause is a bare `AbortError` with no network `innerCause` (provider was reachable; the model didn't respond in time).

### Acceptance Criteria

- [ ] **AC1** — Given the provider is `gemini`, When `buildRequestBody()` constructs the extraction request, Then the request body includes the provider-appropriate parameter that disables model thinking/reasoning (e.g., `reasoning_effort` set to the provider's "off"/minimal value, or Gemini's thinking-budget-zero equivalent for the OpenAI-compat layer).
- [ ] **AC2** — Given the provider is `openai`, When `buildRequestBody()` constructs the extraction request, Then NO `reasoning_effort` (or other thinking-disable) parameter is sent that would cause a non-reasoning model (e.g., `gpt-4o`) to return HTTP 400. (Either omit it for OpenAI, or only emit it where OpenAI documents support.)
- [ ] **AC3** — Given the provider is `anthropic`, When `buildRequestBody()` constructs the extraction request, Then the request body is unchanged from current behavior (extended thinking is off by default; no thinking parameter is added).
- [ ] **AC4** — Given the provider is `ollama` or `generic`, When `buildRequestBody()` constructs the extraction request, Then no thinking-disable parameter is added that an unknown/local model could reject (behavior is unchanged unless a safe, provider-detected mechanism exists).
- [ ] **AC5** — Given a Gemini 2.5 Flash request body produced by `buildRequestBody()`, When the body is inspected in a unit test, Then it asserts the thinking-disable parameter is present and correctly valued, and the existing `response_format`/`max_tokens`/`temperature` fields are preserved.
- [ ] **AC6** — Given the existing provider-shaping unit tests, When the suite runs, Then all per-provider `buildRequestBody()` assertions still pass (no regression to `response_format` selection for openai/anthropic/gemini/ollama/generic).
- [ ] **AC7** — Given a real Gemini 2.5 Flash extraction with thinking disabled, When a typical multi-line construction invoice is processed at the default timeout (30000ms), Then the extraction completes successfully and returns structured line items (no `LLM_UNREACHABLE`).

### Notes

- Scope is limited to request-body shaping in `providerProfiles.ts` (and any minimal supporting change in `openAICompatibleProvider.ts` if needed to thread provider info). It does NOT change the timeout default, the `LLM_MAX_TOKENS` default, or the user-facing error mapping.
- Architect input is welcome on the exact per-provider parameter mapping for the OpenAI-compatible layer (Gemini thinking-budget-zero vs. `reasoning_effort`), since this is the first parameter we shape conditionally on reasoning capability. Flag if a `LLM_REASONING` / disable-thinking config knob is warranted, but the default for extraction should be "thinking off."
- Security: no new outbound surface; only an additional field on an existing request body. No new env vars required for the core fix.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Disable LLM "thinking"/reasoning for budget extraction (fixes Gemini 2.5 timeout → LLM_UNREACHABLE) #1701

Bug: Gemini 2.5 invoice extraction times out (LLM_UNREACHABLE) because model "thinking" is enabled by default

Problem

Root cause

Expected behavior

Actual behavior

Critical nuance (must not regress other providers)

Reproduction steps

Acceptance Criteria

Notes

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Disable LLM "thinking"/reasoning for budget extraction (fixes Gemini 2.5 timeout → LLM_UNREACHABLE) #1701

Description

Bug: Gemini 2.5 invoice extraction times out (LLM_UNREACHABLE) because model "thinking" is enabled by default

Problem

Root cause

Expected behavior

Actual behavior

Critical nuance (must not regress other providers)

Reproduction steps

Acceptance Criteria

Notes

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions