Fix/worker crash orphan recovery#3
Merged
Merged
Conversation
- Fix health endpoint paths (/api/health not /health) - Remove local Qdrant container (use Qdrant Cloud to save 42MB) - Remove --workers 2 from uvicorn (was causing child process crashes) - Fix Caddyfile route ordering (auth before /api/*) - Pin all dependency versions - Add standalone output to Next.js config - Add healthchecks to all services - Add .dockerignore for lean images Co-Authored-By: Claude Opus 4.6 <[email protected]>
Co-Authored-By: Claude Opus 4.6 <[email protected]>
NEXT_PUBLIC_* vars are baked into client bundle at build time, not runtime. This ensures the browser gets the correct API URL. Co-Authored-By: Claude Opus 4.6 <[email protected]>
When the worker process crashes mid-job (OOM, SIGTERM, etc.), jobs are left in "running" state indefinitely. The frontend SSE stream receives no further events and the report stays locked with no error shown. On startup, scan for any jobs still in "running" state, mark them failed, and publish job_error to their Redis channels so the frontend SSE or polling fallback surfaces the error immediately. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
Two root causes for the worker dying right after lead plan finalization: 1. Worker container had a 256MB Docker memory limit. Loading sentence-transformers (all-MiniLM-L6-v2) + PyTorch at the topic cache check requires ~500MB, causing Docker to OOM-kill the container. Raised worker limit to 1GB (t3.small has 2GB total; all other services combined use ~836MB leaving headroom). 2. find_cached_run() had no error handling — any Qdrant connection failure (unreachable server, timeout, bad credentials) would propagate as an unhandled exception and kill the job. Wrapped in try-except to degrade gracefully as a cache miss so the job continues. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
QDRANT_URL defaults to localhost:6333 but no Qdrant container is running. Set QDRANT_FORCE_IN_MEMORY=1 on the worker so the pipeline uses an in-memory instance per job. Topic-cache cross-run deduplication is disabled but all retrieval and writing works correctly. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
Tools making HTTP requests (pdf_reader, semantic_scholar, etc.) had no timeout, causing the worker to hang silently at 0% CPU when a request stalled. Fixed in two layers: 1. tools/base.py: wrap every call_with_retry attempt in asyncio.wait_for (default 60s) so any tool that hangs is cancelled and retried/failed. 2. tools/pdf_reader.py: add aiohttp.ClientTimeout(total=30) so the HTTP download itself is bounded independently of the outer timeout. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
A stuck worker (hanging HTTP request, not crashed) keeps the job in "running" state indefinitely. The frontend was silently spinning with no feedback to the user. Add client-side elapsed-time detection: if the job has been running for more than 10 minutes, show a yellow warning box with the elapsed time and a back button, so the user knows something is wrong and can retry instead of waiting forever. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
The code read QDRANT_URL but the .env file sets QDRANT_LOCATION. Fall back through both names before defaulting to localhost so existing deployments using either convention work without changing their env. Remove QDRANT_FORCE_IN_MEMORY=1 from the worker — Qdrant Cloud is properly configured so in-memory mode is no longer needed. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
The createJobMutation had no onError handler. When the API returned 429 (rate limit exceeded), TanStack Query had no error path to execute, causing an internal crash reading .payload on undefined in core.js. Add onError to show a human-readable message below the search bar, with a specific "please wait and retry" message for 429 responses. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
…r credibility filter
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.