Skip to content

feat: beyond-parity advanced features (WARC, parallel CDX, search, content tracking, color output)#13

Merged
ronaldtse merged 5 commits into
mainfrom
feat/advanced-features-beyond-parity
May 13, 2026
Merged

feat: beyond-parity advanced features (WARC, parallel CDX, search, content tracking, color output)#13
ronaldtse merged 5 commits into
mainfrom
feat/advanced-features-beyond-parity

Conversation

@ronaldtse
Copy link
Copy Markdown
Contributor

Summary

Implements all remaining TODO 13 "Beyond Parity" advanced features, completing the full TODO.parity roadmap.

New Features

  1. WARC Support — and for WARC 1.0 format interoperability (.warc, .warc.gz)
  2. Parallel CDX Fetching — uses thread pool for concurrent CDX page queries
  3. Content Tracker — detects changed/new/removed URLs over time ranges
  4. Archive Search — for full-text search across archived snapshots
  5. Color CLI Output — with ANSI colors, respects NO_COLOR/TERM=dumb

New CLI Commands

  • archaeo search URL QUERY — search archived snapshots for text
  • archaeo track-changes URL — track content changes over time
  • archaeo warc-export URL --output file.warc — export snapshots to WARC format

Test Coverage

  • 541 examples (91 new), 0 failures
  • Full rubocop compliance

Files Added

  • lib/archaeo/color_output.rb
  • lib/archaeo/warc_support.rb
  • lib/archaeo/parallel_cdx.rb
  • lib/archaeo/content_tracker.rb
  • lib/archaeo/archive_search.rb
  • Corresponding spec files for each

ronaldtse added 5 commits May 13, 2026 15:39
- Add ColorOutput class with ANSI color support (red, green, yellow, cyan)
- Auto-detects color support from tty, NO_COLOR, TERM=dumb
- Integrate colored error/warning output into CLI error handling
- Add --no-color class option to disable colored output
- Add with_env test helper to spec_helper for env var testing
- WarcWriter produces valid WARC 1.0 files with warcinfo and response records
- Supports gzip-compressed (.warc.gz) output
- WarcReader parses WARC files using Content-Length-based boundary detection
- Handles both plain and gzip-compressed WARC files
- WarcRecord value object with type predicates and serialization
- Uses thread pool to fetch multiple CDX result pages simultaneously
- Configurable concurrency (default: 4 threads)
- Falls back to single-page query when only one page exists
- Merges results in page order
- ContentTracker analyzes snapshots grouped by URL and digest
- Detects changed URLs (multiple unique digests), new URLs, removed URLs
- ContentChangeReport value object with change frequency and serialization
- Splits time range into halves to detect additions and removals
- Searches text content of archived snapshots for query strings
- Case-insensitive by default with case_sensitive option
- Returns SearchResult with surrounding context for each match
- Configurable max_results limit and date range filtering
- SearchResult value object with serialization
@ronaldtse ronaldtse merged commit f088000 into main May 13, 2026
14 checks passed
ronaldtse added a commit that referenced this pull request May 13, 2026
Document all features added across PRs #11, #12, #13:
- Page content extraction (headings, images, forms, scripts)
- Composite snapshot, CDX caching, parallel CDX
- Save API response details and rate limiter
- Enhanced bulk download options (page-requisites, snapshot-at,
  all-timestamps, strategy, pattern filtering, subdomain discovery)
- Enhanced URL rewriting (JS, absolute, server extensions)
- Snapshot comparison and diff
- Coverage analysis and archive health check
- Content tracking and archive search
- WARC format support (read/write)
- Configuration, encoding detection, path sanitization
- Pattern filtering, rate limiting, color output
- All new CLI commands and options
- Updated architecture table with all classes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant