feat(crawler): session queuing and prioritization (WIP — closes #14)#37
Open
mudassaralichouhan wants to merge 2 commits into
Open
Conversation
…tems#14) Plumbing change for session-level queuing and prioritization (issue hatefsystems#14). - Extract CrawlPriority enum into shared header include/search_engine/crawler/CrawlPriority.h so both URLFrontier (URL-level scheduling) and CrawlerManager (session-level scheduling) share a single source of truth. - URLFrontier.h now includes the shared enum (removed local duplicate). - CrawlSession struct carries a CrawlPriority (default NORMAL). - CrawlerManager::startCrawl() accepts an optional CrawlPriority parameter (default NORMAL) — fully backward compatible with all existing callers (SearchController, tests, etc.). - Session startup log now reports the priority for observability. Refs hatefsystems#14
…tefsystems#14) - Introduced SessionPriorityQueue for managing pending crawl sessions with priority. - Updated CrawlerManager to handle session queuing when concurrency limits are reached. - Enhanced CrawlConfig to support session-level retry policies. - Modified SearchController to accept priority and retry parameters for crawl requests. - Added new API endpoint to retrieve the current queue of pending sessions. These changes improve the crawler's ability to manage sessions efficiently, allowing for prioritized execution and better handling of retries.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implements session-level queuing and prioritization for the crawler. Closes #14.
Acceptance criteria
LOW,NORMAL,HIGH)"queued"statusArchitecture
SessionPriorityQueue(new, header-only)Thread-safe priority queue of
PendingSessionEntry. Ordering:(priority desc, queuedAt asc). Entries gate onreadyAtfor retry backoff. Storage-agnostic — unit-testable without any backend.CrawlerManagerrewritestartCrawl()queues instead of throwing when at capacity. Returns sessionId immediately; status reports"queued"until dispatched.startCrawlInternal()(new private) does the actual crawler/thread spin-up. Used by both fresh starts and queue dispatch.tryDispatchPending()runs after every session completion and on every cleanup-worker tick, so retry entries whose backoff elapsed get picked up automatically.stopCrawl()cancels pending sessions too.include/search_engine/crawler/CrawlerManager.h— the duplicate atsrc/crawler/CrawlerManager.h(an ODR hazard) is gone.Session-level retry policy
On thread-level failure, the session is re-enqueued with
CrawlPriority::RETRY(jumps the queue),retryCountincremented,readyAt = now + base * 2^retryCount(capped). Config:maxSessionRetries,sessionRetryBaseDelay,sessionRetryMaxDelayonCrawlConfig.API
POST /api/crawl/add-siteaccepts"priority": "low" | "normal" | "high"and"maxSessionRetries","sessionRetryBaseDelayMs".GET /api/crawl/queuereturns{ active, maxConcurrent, pendingCount, pending[] }.GET /api/crawl/statusreturns"queued"for pending sessions.API examples
Submit a HIGH-priority crawl with session retries