Skip to content

feat(crawler): session queuing and prioritization (WIP — closes #14)#37

Open
mudassaralichouhan wants to merge 2 commits into
hatefsystems:masterfrom
mudassaralichouhan:feature/session-priority-queue
Open

feat(crawler): session queuing and prioritization (WIP — closes #14)#37
mudassaralichouhan wants to merge 2 commits into
hatefsystems:masterfrom
mudassaralichouhan:feature/session-priority-queue

Conversation

@mudassaralichouhan

@mudassaralichouhan mudassaralichouhan commented May 24, 2026

Copy link
Copy Markdown

Summary

Implements session-level queuing and prioritization for the crawler. Closes #14.

Acceptance criteria

  • Sessions carry a priority (LOW, NORMAL, HIGH)
  • Sessions are queued (priority queue) when concurrency is saturated — no more 500s
  • Retry queue with exponential backoff at session level
  • Resource limits handled gracefully
  • API exposes priority + queue + "queued" status
  • Unit tests for prioritization, queuing, retry timing

Architecture

SessionPriorityQueue (new, header-only)

Thread-safe priority queue of PendingSessionEntry. Ordering: (priority desc, queuedAt asc). Entries gate on readyAt for retry backoff. Storage-agnostic — unit-testable without any backend.

CrawlerManager rewrite

  • startCrawl() queues instead of throwing when at capacity. Returns sessionId immediately; status reports "queued" until dispatched.
  • startCrawlInternal() (new private) does the actual crawler/thread spin-up. Used by both fresh starts and queue dispatch.
  • tryDispatchPending() runs after every session completion and on every cleanup-worker tick, so retry entries whose backoff elapsed get picked up automatically.
  • stopCrawl() cancels pending sessions too.
  • Single canonical header at include/search_engine/crawler/CrawlerManager.h — the duplicate at src/crawler/CrawlerManager.h (an ODR hazard) is gone.

Session-level retry policy

On thread-level failure, the session is re-enqueued with CrawlPriority::RETRY (jumps the queue), retryCount incremented, readyAt = now + base * 2^retryCount (capped). Config: maxSessionRetries, sessionRetryBaseDelay, sessionRetryMaxDelay on CrawlConfig.

API

  • POST /api/crawl/add-site accepts "priority": "low" | "normal" | "high" and "maxSessionRetries", "sessionRetryBaseDelayMs".
  • GET /api/crawl/queue returns { active, maxConcurrent, pendingCount, pending[] }.
  • GET /api/crawl/status returns "queued" for pending sessions.

API examples

Submit a HIGH-priority crawl with session retries

curl -X POST http://localhost:3000/api/crawl/add-site \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "maxPages": 100,
    "priority": "high",
    "maxSessionRetries": 3,
    "sessionRetryBaseDelayMs": 30000
  }'

…tems#14)

Plumbing change for session-level queuing and prioritization (issue hatefsystems#14).

- Extract CrawlPriority enum into shared header
  include/search_engine/crawler/CrawlPriority.h so both URLFrontier
  (URL-level scheduling) and CrawlerManager (session-level scheduling)
  share a single source of truth.
- URLFrontier.h now includes the shared enum (removed local duplicate).
- CrawlSession struct carries a CrawlPriority (default NORMAL).
- CrawlerManager::startCrawl() accepts an optional CrawlPriority
  parameter (default NORMAL) — fully backward compatible with all
  existing callers (SearchController, tests, etc.).
- Session startup log now reports the priority for observability.

Refs hatefsystems#14
…tefsystems#14)

- Introduced SessionPriorityQueue for managing pending crawl sessions with priority.
- Updated CrawlerManager to handle session queuing when concurrency limits are reached.
- Enhanced CrawlConfig to support session-level retry policies.
- Modified SearchController to accept priority and retry parameters for crawl requests.
- Added new API endpoint to retrieve the current queue of pending sessions.

These changes improve the crawler's ability to manage sessions efficiently, allowing for prioritized execution and better handling of retries.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement Session Queuing and Prioritization for Resource-Constrained Environments

2 participants