Make image capture deterministic: stepwise scroll + explicit image wait#49
Merged
Conversation
A page embedding a widget that never lets the network go quiet — a CAPTCHA like Cloudflare Turnstile that polls and retries indefinitely under automation, ad tech, analytics — times out the networkidle wait on every viewport, burning the full retry budget per capture. blockHosts in reglance.json takes bare hostnames and aborts every request to them (and their subdomains) at the browser, so the page goes idle and captures stay deterministic. Blocked requests are excluded from the critical-resource retry check, since a deliberately blocked script firing requestfailed is not a load failure. Co-Authored-By: Claude Fable 5 <[email protected]>
autoScroll jumped straight to the bottom of the page, so lazy loaders (IntersectionObserver, native loading="lazy") never fired for anything in between — which images made it into a capture was a timing race, producing noisy diffs. It now steps one viewport at a time so every lazy image is triggered, re-reading the height as content grows. After the network-idle settle, capture now also waits (bounded by timeouts.settle) for every visible image to load and decode, and warns per capture when any image was still loading instead of silently shipping a partial screenshot. Hidden images are excluded: they cannot paint, and a hidden native-lazy image (e.g. a desktop-only image at a mobile width) never loads by design. Also generalizes hostname examples in docs and comments. Co-Authored-By: Claude Fable 5 <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
Captures of image-heavy pages were unreliable: every run could include a different set of images, making diffs noisy. Two compounding causes:
autoScrollnever actually scrolled through the page. Each iteration jumped straight to the bottom (window.scrollTo(0, scrollHeight)). Lazy loaders — IntersectionObserver-based and nativeloading="lazy"— only trigger for content near the viewport, so everything in between loading or not was a timing race.networkidlesettle fails silently. It only covers requests that already started, says nothing about decode state, and when it times out on a busy server the capture proceeds with whatever happened to arrive — no signal that anything is missing.What
autoScrollsteps one viewport at a time (re-reading the page height as content grows, capped for infinite feeds), usingbehavior: 'instant'so a site'sscroll-behavior: smoothcan't outpace the loop.timeouts.settle— for every visible image to load and decode (img.decode()), and prints a per-capture warning naming the slug when images were still loading at screenshot time.settlebounds.Verification
npm test: 105 passing.🤖 Generated with Claude Code