Skip to content

fix(ci): fix the monorepo#299

Merged
nikitachapovskii-dev merged 14 commits into
masterfrom
fix/revive-actor-scraper
Jun 29, 2026
Merged

fix(ci): fix the monorepo#299
nikitachapovskii-dev merged 14 commits into
masterfrom
fix/revive-actor-scraper

Conversation

@nikitachapovskii-dev

@nikitachapovskii-dev nikitachapovskii-dev commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Reviving the generic scrapers after the split + pnpm migration

After thee migration to pnpm and SDK split, CI went red and the actor builds stopped matching the platform. This PR gets everything green again and, more importantly, makes the platform builds reproducible.

CI

The e2e workflow simply never installed the browsers, on a warm pnpm store the side-effects cache skips the puppeteer/playwright postinstall, so Chrome wasn't there and every browser suite died with Could not find Chrome. Added the install step the PR workflow already had. Both workflows now run on Node 24 with timeout-minutes, so if something hangs it fails fast instead of eating six hours in silence.

Deterministic builds from the monorepo root

We're a pnpm monorepo: one root pnpm-lock.yaml and a shared @apify/scraper-tools. For reproducible images the build has to install from that lockfile with --frozen-lockfile, and it has to see the local workspace package, not the published one. So every actor builds from the repo root via a Git source (dockerContextDir: ../../../..) and runs pnpm install --frozen-lockfile + pnpm --filter <actor>... deploy. apify push can't do this, it uploads only the actor folder, without the root lockfile or the local scraper-tools.

The release workflow drops push-actor-action and instead triggers each Actor's Git-source build through scripts/trigger-apify-build.mjs (checks the version is a GIT_REPO source with the right tag, fires the build, waits, fails on anything but success). This assumes each Actor on the platform has its source pointed at this repo (branch + :packages/actor-scraper/<actor>).

Browsers come from the base image, not from us

The browser binary lives in the Apify base image, and the JS driver has to match it. Pinning our own driver to chase that match is brittle, they version separately and drift, which is exactly what kept crashing. So the browser actors just use the base image's driver: we drop it from the deploy bundle and let it resolve from the base at runtime (the app sits in /home/myuser/app). Binary and driver now come from the same image and can't get out of sync, and the bases are pinned by digest.

camoufox is the one exception worth flagging: it's stuck on 24-1.59.1. The newer 24-1.60.0 (Camoufox 150) crashes the whole process on any page that throws a JS error, that's upstream, it reproduces on the untouched base image. 24-1.59.1 is the newest one that survives.

sitemap-extractor (Crawlee/Apify v4 beta)

Added to the release matrix. It rides the v4 beta while the other six stay on v3, so @apify/scraper-tools accepts ^3 || ^4, a tiny type shim keeps its v3 types from clashing in the v4 build, and a couple of pnpm overrides pin the beta's transitive @crawlee/*. All of that comes back out once we're fully on stable v4.

Cleanup

Manifests still pointed at apify-ts/apify-sdk-js, CONTRIBUTING said npm, a couple of READMEs were stale — fixed. Everything builds on Node 24.

Follow-ups (separate)

Unpin camoufox once a newer base survives page errors. And drop the v4 bridge bits (the ^3 peer half, the shim, the overrides) when everything's on stable v4.

P.S. sorry, the description is huge, I know 😬

@nikitachapovskii-dev nikitachapovskii-dev marked this pull request as draft June 11, 2026 15:28
@nikitachapovskii-dev nikitachapovskii-dev changed the title fix(ci): unblock e2e and PR workflows fix(ci): fix the repo Jun 15, 2026
@nikitachapovskii-dev

Copy link
Copy Markdown
Contributor Author

@nikitachapovskii-dev nikitachapovskii-dev self-assigned this Jun 16, 2026
@nikitachapovskii-dev nikitachapovskii-dev marked this pull request as ready for review June 16, 2026 12:04
@nikitachapovskii-dev nikitachapovskii-dev changed the title fix(ci): fix the repo fix(ci): fix the monorepo Jun 16, 2026

@ruocco-l ruocco-l left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀 Thank you for figuring this out

@nikitachapovskii-dev nikitachapovskii-dev marked this pull request as draft June 23, 2026 12:43
Comment on lines +37 to +44
RUN rm -rf \
/deploy/node_modules/playwright \
/deploy/node_modules/playwright-core \
/deploy/node_modules/camoufox-js \
/deploy/node_modules/better-sqlite3 \
/deploy/node_modules/.bin/playwright \
/deploy/node_modules/.bin/playwright-core \
/deploy/node_modules/.bin/camoufox-js

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same pattern for playwright, puppeteer and web scrapers

We install the browser driver and then delete it and run the app from a sub-dir (COPY /deploy ./app + WORKDIR /home/myuser/app). Done like that because the browser binary lives in the base image and the JS driver must match it exactly. Rather than pin our own driver to match (fragile, they're versioned independently and drift, which caused the runtime crashes), we drop the driver from the bundle so the runtime resolves it from the base image's node_modules (resolution walks up from /home/myuser/app). Driver and binary then come from the same image and can't drift. We still install it in the build stage because tsc needs the types.

# digest. The actor does NOT ship its own browser drivers — they're dropped from the deploy bundle
# and resolved from the base at runtime, so driver and binary can never drift. Tag 24-1.59.1 is the
# newest Camoufox base that survives page JS errors (24-1.60.0 crashes the process on them).
FROM apify/actor-node-playwright-camoufox:24-1.59.1@sha256:eadc96fa9492284eb45ef70b6b91c841fae7f142d25a22ae2887a21bb78b3469 AS builder

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for each browser. Base is pinned by sha256 digest, not just a tag, on purpose: tags are mutable, so a bare tag wouldn't give a reproducible browser binary. Since the whole runtime browser stack comes from the base (see the driver note), the digest is what actually makes the browser deterministic. Renovate bumps the digest.

Also, not the newest base. I've tested the latest camoufox base (24-1.60.0, Camoufox 150): it crashes the whole process on any page that emits a JS error (playwright's Firefox pageError handler). I also tried the latest Playwright (1.61.x), which won't even launch this Camoufox binary (Browser.setDefaultViewport protocol error). 24-1.59.1 is the newest camoufox base that survives page errors.

Comment on lines +51 to +55
"@crawlee/browser-pool": "^3.8.2 || ^4.0.0-beta.0",
"@crawlee/core": "^3.8.2 || ^4.0.0-beta.0",
"@crawlee/types": "^3.8.2 || ^4.0.0-beta.0",
"@crawlee/utils": "^3.8.2 || ^4.0.0-beta.0",
"apify": "^3.1.8 || ^4.0.0-beta.0"

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This package is shared by the crawlee-v3 generic actors and the v4-beta sitemap-extractor. We will make it 1 ver when we move all actors to crawlee v4

Also, after moving to v4 we will delete version-neutral slice: packages/actor-scraper/sitemap-scraper/src/scraper-tools-compat.d.ts

Comment thread .npmrc Outdated
@@ -0,0 +1 @@
inject-workspace-packages=false

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's an explicit guard: a normal pnpm install must symlink workspace packages (fast local dev); only the Docker builds opt in per-command with --config.inject-workspace-packages=true on pnpm deploy, so /deploy is self-contained.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please expand on this? It seems this option is false by default in pnpm anyway https://pnpm.io/workspaces#injectworkspacepackages

(also, it seems this option should be camelCased and in pnpm-workspace.yaml instead)

Comment thread pnpm-workspace.yaml
Comment on lines +30 to +33
overrides:
"[email protected]>@crawlee/core": "4.0.0-beta.25"
"[email protected]>@crawlee/types": "4.0.0-beta.25"
"[email protected]>@crawlee/utils": "4.0.0-beta.25"

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove once on stable crawlee v4

Comment on lines +6 to +7
"dockerContextDir": "../../../..",
"dockerfile": "../Dockerfile",

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same in every actor

points the Docker build context at the monorepo root so the build can use the root pnpm-lock.yaml for a frozen, deterministic install. apify push of a single folder can't do this (no root lockfile in the upload). that's why builds run from the Git source.

inspired by actor-monorepo-example

@@ -0,0 +1,80 @@
import { ApifyClient } from 'apify-client';

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It triggers a build of the Actor's configured Git source and waits for the result. For exact-commit reproducibility, point the source at a release tag.

@nikitachapovskii-dev nikitachapovskii-dev marked this pull request as ready for review June 25, 2026 12:38

@barjin barjin left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @nikitachapovskii-dev !

The changes look mostly fine to me. 👍 I wrote down few ideas I had looking at the code. Please look into these at your own discretion and feel free to merge (no need to wait for me, @ruocco-l 's opinion has more weight here :)).

build_and_test:
name: Build & Test
runs-on: ubuntu-22.04
timeout-minutes: 60

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this won't fix the timeouting e2e tests. Likely caused by microsoft/playwright#40724 (bumping Playwright or reverting Node version might help)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not the fix 😄, it's just for the future (if something happens then fail is better than getting stuck) The actual fix is browser-install step

Comment on lines +203 to +204
// camoufox-js launch options intentionally diverge from Playwright's, and the pinned
// playwright-core type identity differs from @crawlee/playwright's — cast through unknown.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh, this sounds curious - can you please investigate this and make an issue in crawlee or camoufox-js?

I don't think we changed anything recently, at least not intentionally.

Comment thread .gitignore Outdated
.vscode
yarn.lock
.yarn
# npm locks are generated locally in actor dirs (the monorepo uses pnpm-lock.yaml)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But this addition is just a comment, right? What is it commenting on?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha, that's a leftover from when I tried to make it via apify push. Thanks for noticing

Comment thread .npmrc Outdated
@@ -0,0 +1 @@
inject-workspace-packages=false

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please expand on this? It seems this option is false by default in pnpm anyway https://pnpm.io/workspaces#injectworkspacepackages

(also, it seems this option should be camelCased and in pnpm-workspace.yaml instead)

const { SESSION_MAX_USAGE_COUNTS, DEFAULT_VIEWPORT, DEVTOOLS_TIMEOUT_SECS, META_KEY } = scraperToolsConstants;
const SCHEMA = JSON.parse(await readFile(new URL('../../INPUT_SCHEMA.json', import.meta.url), 'utf8'));

type CrawleePuppeteerPage = PuppeteerCrawlingContext['page'];

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this change necessary? PuppeteerCrawlingContext['page'] should be the exactly the Page type from puppeteer. A mismatch here might point to some issues with package version resolution / dual installs etc.

@nikitachapovskii-dev nikitachapovskii-dev Jun 26, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right! I was lost fighting with browsers and didn't even check if it was a dual install here :/

ENV APIFY_DISABLE_OUTDATED_WARNING=1

CMD npm run start:prod --silent
CMD ["node", "dist/main.js"]

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we should either call (p)npm run start:prod here or drop the start:prod script from the package.json files. Keeping the script but not using it will eventually lead to a frustrating debugging experience 😄

@nikitachapovskii-dev nikitachapovskii-dev merged commit 21b714a into master Jun 29, 2026
12 checks passed
@nikitachapovskii-dev nikitachapovskii-dev deleted the fix/revive-actor-scraper branch June 29, 2026 11:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants