feat: beyond-parity advanced features (WARC, parallel CDX, search, content tracking, color output) by ronaldtse · Pull Request #13 · riboseinc/archaeo

ronaldtse · 2026-05-13T07:41:05Z

Summary

Implements all remaining TODO 13 "Beyond Parity" advanced features, completing the full TODO.parity roadmap.

New Features

WARC Support — and for WARC 1.0 format interoperability (.warc, .warc.gz)
Parallel CDX Fetching — uses thread pool for concurrent CDX page queries
Content Tracker — detects changed/new/removed URLs over time ranges
Archive Search — for full-text search across archived snapshots
Color CLI Output — with ANSI colors, respects NO_COLOR/TERM=dumb

New CLI Commands

archaeo search URL QUERY — search archived snapshots for text
archaeo track-changes URL — track content changes over time
archaeo warc-export URL --output file.warc — export snapshots to WARC format

Test Coverage

541 examples (91 new), 0 failures
Full rubocop compliance

Files Added

lib/archaeo/color_output.rb
lib/archaeo/warc_support.rb
lib/archaeo/parallel_cdx.rb
lib/archaeo/content_tracker.rb
lib/archaeo/archive_search.rb
Corresponding spec files for each

- Add ColorOutput class with ANSI color support (red, green, yellow, cyan) - Auto-detects color support from tty, NO_COLOR, TERM=dumb - Integrate colored error/warning output into CLI error handling - Add --no-color class option to disable colored output - Add with_env test helper to spec_helper for env var testing

- WarcWriter produces valid WARC 1.0 files with warcinfo and response records - Supports gzip-compressed (.warc.gz) output - WarcReader parses WARC files using Content-Length-based boundary detection - Handles both plain and gzip-compressed WARC files - WarcRecord value object with type predicates and serialization

- Uses thread pool to fetch multiple CDX result pages simultaneously - Configurable concurrency (default: 4 threads) - Falls back to single-page query when only one page exists - Merges results in page order

- ContentTracker analyzes snapshots grouped by URL and digest - Detects changed URLs (multiple unique digests), new URLs, removed URLs - ContentChangeReport value object with change frequency and serialization - Splits time range into halves to detect additions and removals

- Searches text content of archived snapshots for query strings - Case-insensitive by default with case_sensitive option - Returns SearchResult with surrounding context for each match - Configurable max_results limit and date range filtering - SearchResult value object with serialization

Document all features added across PRs #11, #12, #13: - Page content extraction (headings, images, forms, scripts) - Composite snapshot, CDX caching, parallel CDX - Save API response details and rate limiter - Enhanced bulk download options (page-requisites, snapshot-at, all-timestamps, strategy, pattern filtering, subdomain discovery) - Enhanced URL rewriting (JS, absolute, server extensions) - Snapshot comparison and diff - Coverage analysis and archive health check - Content tracking and archive search - WARC format support (read/write) - Configuration, encoding detection, path sanitization - Pattern filtering, rate limiting, color output - All new CLI commands and options - Updated architecture table with all classes

ronaldtse added 5 commits May 13, 2026 15:39

feat: add ParallelCdx for concurrent CDX page fetching

e74f130

- Uses thread pool to fetch multiple CDX result pages simultaneously - Configurable concurrency (default: 4 threads) - Falls back to single-page query when only one page exists - Merges results in page order

ronaldtse merged commit f088000 into main May 13, 2026
14 checks passed

ronaldtse mentioned this pull request May 13, 2026

docs: update README.adoc with all new features through v0.2.10 #14

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: beyond-parity advanced features (WARC, parallel CDX, search, content tracking, color output)#13

feat: beyond-parity advanced features (WARC, parallel CDX, search, content tracking, color output)#13
ronaldtse merged 5 commits into
mainfrom
feat/advanced-features-beyond-parity

ronaldtse commented May 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ronaldtse commented May 13, 2026

Summary

New Features

New CLI Commands

Test Coverage

Files Added

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant