feat: beyond-parity advanced features (WARC, parallel CDX, search, content tracking, color output)#13
Merged
Conversation
- Add ColorOutput class with ANSI color support (red, green, yellow, cyan) - Auto-detects color support from tty, NO_COLOR, TERM=dumb - Integrate colored error/warning output into CLI error handling - Add --no-color class option to disable colored output - Add with_env test helper to spec_helper for env var testing
- WarcWriter produces valid WARC 1.0 files with warcinfo and response records - Supports gzip-compressed (.warc.gz) output - WarcReader parses WARC files using Content-Length-based boundary detection - Handles both plain and gzip-compressed WARC files - WarcRecord value object with type predicates and serialization
- Uses thread pool to fetch multiple CDX result pages simultaneously - Configurable concurrency (default: 4 threads) - Falls back to single-page query when only one page exists - Merges results in page order
- ContentTracker analyzes snapshots grouped by URL and digest - Detects changed URLs (multiple unique digests), new URLs, removed URLs - ContentChangeReport value object with change frequency and serialization - Splits time range into halves to detect additions and removals
- Searches text content of archived snapshots for query strings - Case-insensitive by default with case_sensitive option - Returns SearchResult with surrounding context for each match - Configurable max_results limit and date range filtering - SearchResult value object with serialization
ronaldtse
added a commit
that referenced
this pull request
May 13, 2026
Document all features added across PRs #11, #12, #13: - Page content extraction (headings, images, forms, scripts) - Composite snapshot, CDX caching, parallel CDX - Save API response details and rate limiter - Enhanced bulk download options (page-requisites, snapshot-at, all-timestamps, strategy, pattern filtering, subdomain discovery) - Enhanced URL rewriting (JS, absolute, server extensions) - Snapshot comparison and diff - Coverage analysis and archive health check - Content tracking and archive search - WARC format support (read/write) - Configuration, encoding detection, path sanitization - Pattern filtering, rate limiting, color output - All new CLI commands and options - Updated architecture table with all classes
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implements all remaining TODO 13 "Beyond Parity" advanced features, completing the full TODO.parity roadmap.
New Features
New CLI Commands
archaeo search URL QUERY— search archived snapshots for textarchaeo track-changes URL— track content changes over timearchaeo warc-export URL --output file.warc— export snapshots to WARC formatTest Coverage
Files Added
lib/archaeo/color_output.rblib/archaeo/warc_support.rblib/archaeo/parallel_cdx.rblib/archaeo/content_tracker.rblib/archaeo/archive_search.rb