Skip to content

feat(crawler): advanced session analytics with historical metrics (closes #15)#39

Open
mudassaralichouhan wants to merge 1 commit into
hatefsystems:masterfrom
mudassaralichouhan:feature/session-analytics
Open

feat(crawler): advanced session analytics with historical metrics (closes #15)#39
mudassaralichouhan wants to merge 1 commit into
hatefsystems:masterfrom
mudassaralichouhan:feature/session-analytics

Conversation

@mudassaralichouhan

Copy link
Copy Markdown

Implements advanced session analytics for the crawler, as requested in #15.
Every crawl session — success or failure — now produces a persisted metrics
record. New API endpoints expose historical data, session comparisons, and
time-bucketed trend reports.

Closes #15

Acceptance criteria (from the issue)

  • Historical session metrics are stored and accessible
  • Support comparison of multiple sessions (success/failure rates, retries)
  • Generate trend reports over time
  • Metrics are available via API endpoints
  • Unit and integration tests for analytics correctness

Architecture

New module: search_engine/crawler

File Role
SessionMetricsRecord.h Plain value type: counts, success/failure/retry rates, total bytes, status-code histogram, failure-type histogram, latency percentiles (avg/p50/p95/p99/max), duration, throughput, seed url/domain
SessionAnalytics.h Pure functions — buildFromResults, summarize, compare, trends (time-bucketed). No I/O, no threads → trivially unit-testable
SessionAnalyticsStore.h ISessionAnalyticsStore interface + thread-safe InMemorySessionAnalyticsStore (capacity-bounded, FIFO eviction, idempotent put)

CrawlerManager hookup

  • CrawlerManager owns an InMemorySessionAnalyticsStore (capacity 10k),
    exposed via getAnalyticsStore().
  • CrawlSession carries seedUrl, seedDomain, startedAt.
  • recordSessionAnalytics() runs at session completion, builds a
    SessionMetricsRecord from the final CrawlResult vector, and stores it.
    Errors are swallowed — telemetry never breaks a crawl.

API endpoints

Method Path Purpose
GET /api/analytics/sessions[?limit=N] List records + aggregate summary
GET /api/analytics/sessions/detail?sessionId=... One session's metrics (404 if absent)
GET /api/analytics/sessions/compare?ids=a,b,c Aggregate + pairwise deltas vs baseline
GET /api/analytics/sessions/trends?windowMs=86400000&bucketMs=3600000 Time-bucketed summaries (default: last 24h, hourly)

Example

# List recent sessions with a roll-up summary
curl http://localhost:3000/api/analytics/sessions?limit=20

# Compare three sessions against the first
curl "http://localhost:3000/api/analytics/sessions/compare?ids=crawl_1,crawl_2,crawl_3"

# Hourly trend over the last 24 hours
curl "http://localhost:3000/api/analytics/sessions/trends?windowMs=86400000&bucketMs=3600000"

Tests

tests/crawler/session_analytics_tests.cpp — 13 cases:

  • buildFromResults counts/rates correctness
  • latency percentile computation (p50/p95/max)
  • empty-results edge case
  • summarize roll-up across records + empty input
  • compare B−A delta semantics
  • trends bucketing into hourly slices with window filtering
  • trends zero-width-bucket guard
  • InMemoryStore put/get/getAll, idempotent put, FIFO eviction,
    getInWindow filtering, clear

Run: ./build/tests/crawler/crawler_tests "[SessionAnalytics]"

Notes

  • The analytics store is in-memory by design for this PR. The
    ISessionAnalyticsStore interface lets a MongoDB-backed implementation
    drop in later without touching callers.
  • SessionAnalytics helpers are storage-agnostic — unit-testable without
    MongoDB or any heavy dependency.
  • Backward compatible: no change to existing crawl behavior; analytics are
    recorded as a passive side effect at completion.

- Added a new analytics system to track and record metrics for crawl sessions, including success rates, latency, and error counts.
- Introduced `SessionMetricsRecord` to encapsulate session data and `ISessionAnalyticsStore` for storing metrics.
- Enhanced `CrawlerManager` to capture seed URL and domain, and record analytics upon session completion.
- Implemented new API endpoints in `SearchController` for retrieving session analytics, including session details, comparisons, and trends.
- Added unit tests for the analytics functionality to ensure accuracy and reliability.

These changes significantly improve the monitoring and reporting capabilities of the crawling process, enabling better insights into performance and issues.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement Advanced Session Analytics with Detailed Performance Metrics

1 participant