Skip to content

Fix workflows_pkey duplicate INSERT on collab reconnect, and remove its root cause (#4830)#4841

Open
stuartc wants to merge 4 commits into
mainfrom
fix/4830-collab-workflow-pkey-insert
Open

Fix workflows_pkey duplicate INSERT on collab reconnect, and remove its root cause (#4830)#4841
stuartc wants to merge 4 commits into
mainfrom
fix/4830-collab-workflow-pkey-insert

Conversation

@stuartc

@stuartc stuartc commented Jun 8, 2026

Copy link
Copy Markdown
Member

Description

This fixes the workflows_pkey duplicate-key crash you'd sometimes hit when reconnecting to the collaborative editor after a save — and then goes after the thing that was actually causing it.

The root cause was that we were resolving "which workflow am I editing, and is it new or existing?" in two completely separate places: once when the channel joins (load_workflow/5) and again on the session save/reset path (fetch_workflow/1). Those two could quietly disagree about whether an id should map to a :built struct (→ INSERT) or a :loaded one (→ UPDATE). So a reconnect that still thought it was "new" would try to INSERT a row that already existed → boom.

So there are two parts here:

  1. The immediate fix — the reconnect path heals the join action and reconciles to the saved row, so the next save UPDATEs instead of colliding.
  2. The real fix — I pulled all of that resolution logic into one place, Lightning.Collaboration.WorkflowResolver. Both callers now go through it, so they physically can't disagree any more. It also hands back an explicit kind (:new | :existing | :version) so the channel stops trying to guess "is this new?" from the shape of the struct. fetch_workflow/1 and the old NotLoaded poking are gone.

The nice property that closes the bug: a "new" join for an id that already has a row resolves to the existing :loaded row (kind: :existing), so the save routes to UPDATE. No more duplicate INSERT.

Closes #4830

Validation steps

  1. Open a workflow in the collab editor, make a change, save it.
  2. Force a reconnect (drop the socket / refresh the tab) and save again. Before this, that could blow up with a workflows_pkey duplicate-key error — now it should just save cleanly.
  3. Sanity-check the three normal flows still work: creating a brand-new workflow, editing an existing one, and viewing an older version.
  4. mix test test/lightning/collaboration/ test/lightning_web/channels/workflow_channel_test.exs — should be green (273 tests here).

Additional notes for the reviewer

A few things worth your eyes:

  • The bit that actually closes the bug is the reconcile-by-id behaviour — a "new" action on an id that already has a row comes back as :existing. There's a resolver test pinning this specifically, since getting it wrong would bring the duplicate-INSERT straight back.
  • Auth ordering is unchanged: "new" still authorises first, "edit"/version still resolve first. Authorisation stays in the channel; the resolver only does the project-ownership check, and only when it's handed a :project.
  • One small tightening: a "new" rejoin for an id that belongs to another project now returns a wrong-project error instead of silently building a fresh struct. Felt like the safer behaviour.
  • I kept the #4829 Ecto.ConstraintError rescue in workflows.ex as a backstop — didn't want to remove the belt while adding the braces.
  • While moving things around, the reset path briefly lost its "this workflow was deleted" error — I put that back (resetting a soft-deleted workflow returns workflow_deleted again, same as before), with a test.

AI Usage

  • I have used Claude Code
  • I have used another model
  • I have not used AI

You can read more details in our Responsible AI Policy

Pre-submission checklist

  • I have performed an AI review of my code (we recommend using /review with Claude Code)
  • I have implemented and tested all related authorization policies. (e.g., :owner, :admin, :editor, :viewer)
  • I have updated the changelog.
  • I have ticked a box in "AI usage" in this PR

stuartc added 4 commits June 8, 2026 17:59
A stale "new" rejoin after an in-place socket reconnect could hand a
collaborative session a freshly-built workflow struct for an id that the
first save had already persisted, routing the second save to INSERT and
colliding on workflows_pkey.

Reconcile by id in fetch_workflow/1: a built-state struct whose id already
has a row is reloaded to its :loaded form (routes to UPDATE), while a
genuine first save (no row) still INSERTs. Confined to fetch_workflow/1;
the channel-side load_workflow/5 is intentionally left untouched. The #4829
ConstraintError rescue is kept as a backstop.
Phase 2 (client) complement to the server-side reconcile-by-id fix. The
channel-join action was read once from data-is-new-workflow and frozen in
the provider's join params, so an in-place socket reconnect of an
already-saved "new" workflow rejoined with stale action: "new".

Make join params lazy and reactive to the SessionContext store's
isNewWorkflow flag (flipped by clearIsNewWorkflow on save) via a ref
bridged up from StoreProvider, since SessionProvider sits above the store.
After the first successful save, a reconnect now rejoins as action: "edit".
Narrows the trigger surface; the server fix remains the guarantee.
Covers the full StoreProvider -> ref -> getJoinParams bridge by rendering
the real SessionProvider/StoreProvider tree (only the y-phoenix-channel
transport is faked, recording the params it is constructed with). Asserts
an initial connect joins action: "new" and an in-place reconnect after
clearIsNewWorkflow() rejoins action: "edit" — fails against the pre-fix
frozen-prop wiring.

Exports the previously module-private SocketContext as a test seam so the
provider tree can be supplied a connected socket without a real socket.
Workflow resolution was implemented twice — once in the channel join
(load_workflow/5) and once in the session save/reset path
(fetch_workflow/1). The two could disagree on whether a given id mapped
to a :built or :loaded struct, which is the structural root cause behind
the workflows_pkey duplicate INSERT on collab reconnect.

Introduce Lightning.Collaboration.WorkflowResolver as the single
authority: given a workflow id + action it returns the canonical
%Workflow{} in the correct Ecto state, plus an explicit kind
(:new | :existing | :version) so callers never re-derive newness from
struct shape. Resolving an existing id under a :new action reconciles to
the loaded row (kind :existing), so the save routes to UPDATE rather
than a duplicate INSERT. Both callers now delegate here; fetch_workflow/1
and the bespoke NotLoaded surgery are deleted.

The two channel sites that previously inferred newness from
__meta__.state and the snapshot_version assign now read the resolver's
kind via socket.assigns.workflow_kind, removing the inference entirely.

Per-action auth ordering is preserved (new = auth-first; edit/version =
resolve-first). A "new" rejoin to a foreign-project id now returns the
wrong-project error rather than silently building a fresh struct — a
narrowing of the auth boundary confirmed by security review. The #4829
Ecto.ConstraintError rescue is left untouched as the backstop.
@github-project-automation github-project-automation Bot moved this to New Issues in Core Jun 8, 2026
@github-actions

github-actions Bot commented Jun 8, 2026

Copy link
Copy Markdown

Security Review ✅

  • S0 (project scoping): New WorkflowResolver.resolve/3 enforces ownership via check_ownership/2 when a :project opt is passed; both call sites (channel load_workflow and Session.save_workflow/reset_workflow) supply it, so cross-project access still returns :wrong_project. Snapshot-version path's project-from-supplied behavior matches the pre-refactor code at workflow_channel.ex:1191-1212.
  • S1 (authorization): No authorization gates moved or removed — Permissions.can(:workflows, :access_read, ...) still wraps every :edit path and Permissions.can(:project_users, :create_workflow, ...) still gates the :new path at workflow_channel.ex:1205,1224,1243; the resolver explicitly defers auth to the channel (workflow_resolver.ex:13-14).
  • S2 (audit trail): N/A, no config-resource writes added — Session.save_workflow still delegates persistence to Lightning.Workflows.save_workflow/2 unchanged, so existing audit coverage is preserved.

@codecov

codecov Bot commented Jun 8, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 87.95181% with 10 lines in your changes missing coverage. Please review.
✅ Project coverage is 90.4%. Comparing base (1f514b8) to head (2b6fad4).

Files with missing lines Patch % Lines
lib/lightning/collaboration/session.ex 60.0% 6 Missing ⚠️
lib/lightning_web/channels/workflow_channel.ex 84.0% 4 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##            main   #4841     +/-   ##
=======================================
+ Coverage   90.3%   90.4%   +0.1%     
=======================================
  Files        444     445      +1     
  Lines      22607   22639     +32     
=======================================
+ Hits       20419   20468     +49     
+ Misses      2188    2171     -17     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@stuartc stuartc requested a review from midigofrank June 9, 2026 07:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: New Issues

Development

Successfully merging this pull request may close these issues.

Collaborative editor can INSERT an already-existing workflow id (workflows_pkey crash)

1 participant