Skip to content

chore: remove Playwright smoke tests from app template#346

Open
fjakobs wants to merge 1 commit intomainfrom
fjakobs/no-playwright
Open

chore: remove Playwright smoke tests from app template#346
fjakobs wants to merge 1 commit intomainfrom
fjakobs/no-playwright

Conversation

@fjakobs
Copy link
Copy Markdown
Collaborator

@fjakobs fjakobs commented May 5, 2026

Summary

  • Deleted playwright.config.ts and tests/smoke.spec.ts from the app template
  • Removed @playwright/test and sharp from devDependencies
  • Removed test:e2e, test:e2e:ui, and test:smoke scripts
  • Simplified test script from vitest run && npm run test:smoke to vitest run
  • Simplified clean script to remove Playwright-related artifact directories

Test plan

  • Verify databricks apps init produces a working template without Playwright references
  • Verify npm test runs vitest successfully in a generated app

This pull request and its description were written by Isaac.

@fjakobs fjakobs requested a review from a team as a code owner May 5, 2026 16:58
Playwright and sharp add significant install weight to the template
without providing value in the default developer workflow. Vitest
alone covers unit testing needs.

Co-authored-by: Isaac
@fjakobs fjakobs force-pushed the fjakobs/no-playwright branch from d798bf3 to 69d025c Compare May 5, 2026 17:01
@keugenek
Copy link
Copy Markdown
Contributor

keugenek commented May 6, 2026

Running custom evals build on this to check if there is any regression after the playwright drop https://6177827686947384.4.gcp.databricks.com/jobs/212883645927255/runs/313658142661406

@keugenek
Copy link
Copy Markdown
Contributor

keugenek commented May 6, 2026

Running a targeted dev eval to measure quality impact of this change.

  • Run: 313658142661406 — triggered with --appkit-branch fjakobs/no-playwright
  • Baseline: prod nightly run 456555456546311 (2026-05-05 16:31 PDT, stock AppKit, SUCCESS_WITH_FAILURES)
  • Apps under test (5): booking_calendar, taxi_zones_map, parts_catalog_app, fare_fairness_checker, host_onboarding_checklist — mix of UI / map / search / form / multi-step wizard for max Playwright sensitivity at minimum cost
  • Metrics being compared per app: appeval_100 (overall), build_success, unit_tests_pass, smoke_tests_pass (Playwright-driven), apps_validate_pass, local_runability, type_safety_pass

Will post per-app deltas and a verdict here when the run terminates (~40–60 min).

@keugenek
Copy link
Copy Markdown
Contributor

keugenek commented May 6, 2026

Initial run finished (313658142661406) — TERMINATED / SUCCESS_WITH_FAILURES, same overall state as baseline. But the 5-app sample turned out too thin: only 1 app (parts_catalog_app) produced an apples-to-apples comparison — the other 4 hit pre-existing generation flakiness in one or both runs.

parts_catalog_app — the one clean comparison

Metric baseline 456555456546311 this branch Δ
build_success true true
type_safety_pass true true
apps_validate_pass true true
local_runability 1.0 1.0
smoke_tests_pass true false expected (Playwright removed)
unit_tests_pass true false ⚠️ unexpected
appeval_100 1.000 0.667 −0.333

Build / type-check / validate / runtime layers all preserved. ✅

Unexpected coupling on unit_tests_pass

Generated apps have no vitest test files. The eval framework runs npm test:

  • BASE stdout: No test files found, exiting with code 0
  • NEW stdout: No test files found, exiting with code 1

This PR doesn't directly edit vitest.config.ts, so the change in vitest's no-files behavior is indirect — likely from removing the tests/ dir / script chain, or eval-framework drift between baseline (2026-05-05) and this run (2026-05-06). Worth pinning down before this lands, otherwise every generated app's unit_tests_pass will flip to false.

host_onboarding_checklist (succeeded in NEW, failed in baseline gen) hit appeval_100=1.0, which is a positive signal that the build/runtime path on this branch is fine — but it's not directly comparable.

Next

Kicked off a wider re-run over the full 30-prompt nightly catalog to drown out the per-app gen flakiness: 854667388920187. Will post statistically meaningful deltas (≈10+ clean comparisons) when it terminates (~60–90 min).

Lean

Until the wider data lands and the unit_tests_pass regression is explained, flag-gating Playwright (option 2) looks lower-risk than full removal — keeps it opt-in for the apps that want it while letting you ship the install-size / memory wins.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants