Skip to content

feat: Add operation run dispatcher#3093

Open
jw-nvidia wants to merge 1 commit into
NVIDIA:mainfrom
jw-nvidia:feat/operation-run-dispatcher
Open

feat: Add operation run dispatcher#3093
jw-nvidia wants to merge 1 commit into
NVIDIA:mainfrom
jw-nvidia:feat/operation-run-dispatcher

Conversation

@jw-nvidia

Copy link
Copy Markdown
Contributor
  • Add OperationRunDispatcher lifecycle wiring to Flow service startup/shutdown.
  • One dispatcher runs periodically to lock/reconcile/decide/claim/submission.
  • The conflict policy, safety policy and phase policy are handled.
  • Add unit tests

Related issues

Type of Change

  • Add - New feature or capability
  • Change - Changes in existing functionality
  • Fix - Bug fixes
  • Remove - Removed features or deprecated functionality
  • Internal - Internal changes (refactoring, tests, docs, etc.)

Breaking Changes

  • This PR contains breaking changes

Testing

  • Unit tests added/updated
  • Integration tests added/updated
  • Manual testing performed
  • No testing required (docs, internal refactor, etc.)

Additional Notes

Manual testing is performed on local dev-deployment to verify the full cycle of an operation-run.

@jw-nvidia jw-nvidia requested a review from a team as a code owner July 2, 2026 17:19
@copy-pr-bot

copy-pr-bot Bot commented Jul 2, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@jw-nvidia jw-nvidia requested a review from spydaNVIDIA July 2, 2026 17:20
@coderabbitai

coderabbitai Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Summary by CodeRabbit

  • New Features
    • Added Completed with Failures as a distinct terminal run outcome.
    • Added Claimed as an intermediate target state and introduced automatic operation-run dispatching with safety/phase/conflict handling, polling, and leasing.
  • Bug Fixes
    • Updated end-to-end status/enum conversions and dispatcher state persistence to cover the new run/target states.
    • Extended database constraints and provided rollback behavior for the new allowed statuses.
  • Tests
    • Added conversion, safety messaging, and dispatcher unit test coverage for the new states and gating behavior.

Walkthrough

This PR adds a completed_with_failures operation-run terminal status end-to-end and introduces a new dispatcher that prepares, evaluates, claims, submits, reconciles, and persists operation-run target work.

Changes

Operation-run dispatcher and completed-with-failures status

Layer / File(s) Summary
Completed-with-failures status contract
rest-api/flow/internal/operationrun/operationrun.go, rest-api/flow/internal/operationrun/operationrun_test.go, rest-api/flow/internal/converter/protobuf/operationrun_converter*.go, rest-api/flow/proto/v1/flow.proto, rest-api/flow/internal/db/migrations/20260626120000_*.sql
Adds the new terminal run status, claimed target status, mapping logic, proto enums, migrations, and tests.
Policy and dispatcher decision model
rest-api/flow/internal/operationrun/configuration.go, rest-api/flow/internal/task/manager/manager.go, rest-api/flow/internal/operationrun/manager/dispatcher/{config,policy,decision,phase,conflict_policy,safety}.go
Extends safety gates, adds rack-conflict signaling, and defines dispatcher config plus pause/transition evaluation.
Preparation, reconciliation, and store persistence
rest-api/flow/internal/operationrun/manager/{store,manager}.go, rest-api/flow/internal/operationrun/manager/{manager_test,store_test}.go, rest-api/flow/internal/operationrun/manager/dispatcher/{dependencies,preparation,reconciliation}.go, rest-api/flow/internal/operationrun/manager/store/dispatch.go
Introduces dispatcher dependencies, locking/reconciliation, manager store abstraction, Postgres persistence, and row-update validation.
Dispatch execution and polling loop
rest-api/flow/internal/operationrun/manager/dispatcher/{dispatcher,dispatch_run,execution}.go
Implements the dispatcher lifecycle, transactional dispatch flow, target claiming/submission, and persisted state updates.
Dispatcher tests and helpers
rest-api/flow/internal/operationrun/manager/dispatcher/dispatcher_test.go, rest-api/flow/internal/operationrun/manager/dispatcher/safety_test.go
Adds unit coverage for dispatcher behavior, policy decisions, claim/submission paths, and message formatting.
Service lifecycle wiring
rest-api/flow/internal/service/service.go
Wires dispatcher construction, startup, and shutdown into the service lifecycle.

Estimated code review effort: 5 (Critical) | ~120 minutes

Sequence Diagram(s)

sequenceDiagram
  participant Dispatcher
  participant Store
  participant TaskStore
  participant TaskManager

  Dispatcher->>Store: FetchRunnableIDs
  Dispatcher->>Store: RunInTransaction(prepare, decide, execute)
  Store->>Store: LockRunnable / LockOperationRunTargets
  Dispatcher->>TaskStore: GetTask
  Dispatcher->>Store: UpdateTargetState / UpdateRunState
  Dispatcher->>TaskManager: SubmitTask
  TaskManager-->>Dispatcher: taskID or ErrRackConflict
Loading

Compact metadata: No related issues or related PRs were supplied.

Suggested labels: feature, operation-run, dispatcher, database-migration

Suggested reviewers: Reviewers familiar with operation-run state transitions, PostgreSQL locking, and dispatcher orchestration should inspect this closely.

Poem

A run turns green, then failure splits the light,
Claims lease their turn, then wait through conflict night.
The dispatcher walks the phases, row by row,
And keeps the run’s true terminal state in tow.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: adding the operation run dispatcher.
Description check ✅ Passed The description is clearly related to the changeset and matches the dispatcher wiring, policies, tests, and lifecycle work.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
rest-api/flow/internal/operationrun/operationrun_test.go (1)

18-55: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Consider broadening non-terminal coverage.

Only TaskStatusRunning is exercised for the non-terminal branch. If taskcommon.TaskStatus has other non-terminal values (e.g. pending/queued), adding a row per value would tighten coverage of the "default" mapping without much effort.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@rest-api/flow/internal/operationrun/operationrun_test.go` around lines 18 -
55, The test for OperationRunTargetStatusFromTaskStatus only covers
TaskStatusRunning for the non-terminal path, so broaden the table in
TestOperationRunTargetStatusFromTaskStatus to include any other non-terminal
TaskStatus values supported by taskcommon. Keep the existing assertions against
OperationRunTargetStatusSubmitted, and add one row per non-terminal status so
the default mapping is exercised more thoroughly.
rest-api/flow/internal/operationrun/manager/dispatcher/safety.go (1)

39-53: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Enrich the pause message with which gate tripped.

evaluate always returns the same static "failure threshold reached" message regardless of which gate tripped or its configured threshold/kind. When an operator inspects a paused run's StatusReason/message to diagnose why it stopped, this generic text gives no clue which gate (or how close to/over threshold) caused the pause, especially when multiple gates are configured.

Consider including the gate kind and the observed stats, e.g.:

♻️ Proposed enrichment
 		return pauseDecision{
 			pause:   true,
 			reason:  operationrun.OperationRunStatusReasonSafetyGate,
-			message: "failure threshold reached",
+			message: fmt.Sprintf(
+				"safety gate %s tripped: %d/%d failed",
+				gate.SafetyGateKind(), stats.failed, stats.total,
+			),
 		}

As per path instructions, Flow changes should be reviewed for "observability for stuck or failed operations."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@rest-api/flow/internal/operationrun/manager/dispatcher/safety.go` around
lines 39 - 53, The pause message in evaluate is too generic because it always
returns the same text when a gate trips. Update the pauseDecision returned from
evaluate to include the specific gate identity from gate.SafetyGateScope() and
the observed stats/threshold context from statsForScope so operators can tell
which gate caused the pause. Keep the existing control flow in
dispatcher/safety.go, but enrich the message built in the loop where
gate.IsTripped(...) succeeds.

Source: Path instructions

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@rest-api/flow/internal/operationrun/manager/dispatcher/execution.go`:
- Around line 68-72: The submission flow in Dispatcher execution is persisting
OperationRunTargetStatusSubmitted before a child task is safely created and
linked, which can leave claimed work unrecoverable if submit or
UpdateTargetState fails. Update the target state handling in execution.go so the
claim is a recoverable lease/claim until SubmitTask succeeds and TaskID is
persisted, then transition to Submitted only after the task reference is durably
recorded. Make the submit path/idempotent or atomic around the submit logic in
the submit/execution flow so a crash or canceled context can be reconciled
without blocking concurrency.

In `@rest-api/flow/internal/operationrun/manager/store/dispatch.go`:
- Around line 96-136: UpdateRunState and UpdateTargetState currently ignore
whether Exec actually updated any rows, so stale or missing run/target IDs can
silently succeed. Capture the sql.Result returned by
s.idb(ctx).NewUpdate().Exec(ctx) in both PostgresStore methods, check
RowsAffected, and return an error when it is zero (or otherwise unexpected) so
dispatcher state transitions fail loudly. Keep the fix localized to
UpdateRunState and UpdateTargetState, using the existing run.ID and target.ID
filters as the signal for whether persistence really happened.

---

Nitpick comments:
In `@rest-api/flow/internal/operationrun/manager/dispatcher/safety.go`:
- Around line 39-53: The pause message in evaluate is too generic because it
always returns the same text when a gate trips. Update the pauseDecision
returned from evaluate to include the specific gate identity from
gate.SafetyGateScope() and the observed stats/threshold context from
statsForScope so operators can tell which gate caused the pause. Keep the
existing control flow in dispatcher/safety.go, but enrich the message built in
the loop where gate.IsTripped(...) succeeds.

In `@rest-api/flow/internal/operationrun/operationrun_test.go`:
- Around line 18-55: The test for OperationRunTargetStatusFromTaskStatus only
covers TaskStatusRunning for the non-terminal path, so broaden the table in
TestOperationRunTargetStatusFromTaskStatus to include any other non-terminal
TaskStatus values supported by taskcommon. Keep the existing assertions against
OperationRunTargetStatusSubmitted, and add one row per non-terminal status so
the default mapping is exercised more thoroughly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 256e565b-474b-45dd-b535-f3fa0da86531

📥 Commits

Reviewing files that changed from the base of the PR and between 8dc207a and 175c457.

⛔ Files ignored due to path filters (1)
  • rest-api/flow/pkg/proto/v1/flow.pb.go is excluded by !**/*.pb.go, !rest-api/**/*.pb.go
📒 Files selected for processing (30)
  • rest-api/flow/internal/converter/protobuf/operationrun_converter.go
  • rest-api/flow/internal/converter/protobuf/operationrun_converter_test.go
  • rest-api/flow/internal/db/migrations/20260626120000_operation_run_completed_with_failures_status.down.sql
  • rest-api/flow/internal/db/migrations/20260626120000_operation_run_completed_with_failures_status.up.sql
  • rest-api/flow/internal/operationrun/configuration.go
  • rest-api/flow/internal/operationrun/manager/dispatcher/config.go
  • rest-api/flow/internal/operationrun/manager/dispatcher/conflict_policy.go
  • rest-api/flow/internal/operationrun/manager/dispatcher/decision.go
  • rest-api/flow/internal/operationrun/manager/dispatcher/dependencies.go
  • rest-api/flow/internal/operationrun/manager/dispatcher/dispatch_run.go
  • rest-api/flow/internal/operationrun/manager/dispatcher/dispatcher.go
  • rest-api/flow/internal/operationrun/manager/dispatcher/dispatcher_test.go
  • rest-api/flow/internal/operationrun/manager/dispatcher/execution.go
  • rest-api/flow/internal/operationrun/manager/dispatcher/phase.go
  • rest-api/flow/internal/operationrun/manager/dispatcher/policy.go
  • rest-api/flow/internal/operationrun/manager/dispatcher/preparation.go
  • rest-api/flow/internal/operationrun/manager/dispatcher/reconciliation.go
  • rest-api/flow/internal/operationrun/manager/dispatcher/safety.go
  • rest-api/flow/internal/operationrun/manager/lifecycle.go
  • rest-api/flow/internal/operationrun/manager/manager.go
  • rest-api/flow/internal/operationrun/manager/manager_test.go
  • rest-api/flow/internal/operationrun/manager/queries.go
  • rest-api/flow/internal/operationrun/manager/store.go
  • rest-api/flow/internal/operationrun/manager/store/dispatch.go
  • rest-api/flow/internal/operationrun/manager/store/store.go
  • rest-api/flow/internal/operationrun/operationrun.go
  • rest-api/flow/internal/operationrun/operationrun_test.go
  • rest-api/flow/internal/service/service.go
  • rest-api/flow/internal/task/manager/manager.go
  • rest-api/flow/proto/v1/flow.proto

Comment thread rest-api/flow/internal/operationrun/manager/dispatcher/execution.go Outdated
Comment thread rest-api/flow/internal/operationrun/manager/store/dispatch.go
- Add OperationRunDispatcher lifecyle wiring to Flow service startup/shutdown.
- One dispatcher runs periodically to lock/reconcile/decide/claim/submission.
- The conflict policy, safety policy and phase policy are handled.
- Add unit tests
@jw-nvidia jw-nvidia force-pushed the feat/operation-run-dispatcher branch from 175c457 to 7f71873 Compare July 2, 2026 18:39
@github-actions

github-actions Bot commented Jul 2, 2026

Copy link
Copy Markdown

🔍 Container Scan Summary

Service Total Critical High Medium Low Other
nico-flow 13 1 2 2 1 7
nico-nsm 5 0 0 4 1 0
nico-psm 13 1 2 2 1 7
nico-rest-api 13 1 2 2 1 7
nico-rest-cert-manager 12 1 2 2 0 7
nico-rest-db 13 1 2 2 1 7
nico-rest-site-agent 12 1 2 2 0 7
nico-rest-site-manager 12 1 2 2 0 7
nico-rest-workflow 13 1 2 2 1 7
TOTAL 106 8 16 20 6 56

Per-CVE detail lives in the per-service grype-* artifacts (JSON + SARIF). Severity counts only — no CVE IDs published here.

@github-actions

github-actions Bot commented Jul 2, 2026

Copy link
Copy Markdown

🔐 TruffleHog Secret Scan

No secrets or credentials found!

Your code has been scanned for 700+ types of secrets and credentials. All clear! 🎉

🔗 View scan details

🕐 Last updated: 2026-07-02 18:42:13 UTC | Commit: 7f71873

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
rest-api/flow/internal/operationrun/manager/dispatcher/config.go (1)

8-20: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Promote defaultSubmitPersistTimeout into Config.

Every other dispatcher timing knob (PollInterval, FetchBatch, ClaimLease) is exposed on Config and defaulted via withDefaults(). defaultSubmitPersistTimeout is defined here but consumed directly as a package constant in execution.go's updateTargetAfterSubmit, leaving it neither configurable nor easily overridable in tests that want to exercise timeout behavior deterministically.

♻️ Proposed fix
 type Config struct {
 	PollInterval time.Duration
 	FetchBatch   int
 	ClaimLease   time.Duration
+	SubmitPersistTimeout time.Duration
 }

 func (c Config) withDefaults() Config {
 	if c.PollInterval <= 0 {
 		c.PollInterval = defaultPollInterval
 	}
 	...
+	if c.SubmitPersistTimeout <= 0 {
+		c.SubmitPersistTimeout = defaultSubmitPersistTimeout
+	}
 	return c
 }

Then update execution.go's updateTargetAfterSubmit to use d.cfg.SubmitPersistTimeout instead of the bare constant.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@rest-api/flow/internal/operationrun/manager/dispatcher/config.go` around
lines 8 - 20, Promote the submit/persist timeout into dispatcher configuration
so it is configurable and testable like the other timing knobs. Add a
SubmitPersistTimeout field to Config, default it in withDefaults alongside
PollInterval, FetchBatch, and ClaimLease, and keep the existing default value
centralized there. Then update updateTargetAfterSubmit in execution.go to read
d.cfg.SubmitPersistTimeout instead of the package-level
defaultSubmitPersistTimeout.

Source: Path instructions

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@rest-api/flow/internal/operationrun/manager/dispatcher/config.go`:
- Around line 8-20: Promote the submit/persist timeout into dispatcher
configuration so it is configurable and testable like the other timing knobs.
Add a SubmitPersistTimeout field to Config, default it in withDefaults alongside
PollInterval, FetchBatch, and ClaimLease, and keep the existing default value
centralized there. Then update updateTargetAfterSubmit in execution.go to read
d.cfg.SubmitPersistTimeout instead of the package-level
defaultSubmitPersistTimeout.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 033fc593-6ca0-4e10-9a55-1ef0b7abf576

📥 Commits

Reviewing files that changed from the base of the PR and between 175c457 and 7f71873.

⛔ Files ignored due to path filters (1)
  • rest-api/flow/pkg/proto/v1/flow.pb.go is excluded by !**/*.pb.go, !rest-api/**/*.pb.go
📒 Files selected for processing (32)
  • rest-api/flow/internal/converter/protobuf/operationrun_converter.go
  • rest-api/flow/internal/converter/protobuf/operationrun_converter_test.go
  • rest-api/flow/internal/db/migrations/20260626120000_operation_run_completed_with_failures_status.down.sql
  • rest-api/flow/internal/db/migrations/20260626120000_operation_run_completed_with_failures_status.up.sql
  • rest-api/flow/internal/operationrun/configuration.go
  • rest-api/flow/internal/operationrun/manager/dispatcher/config.go
  • rest-api/flow/internal/operationrun/manager/dispatcher/conflict_policy.go
  • rest-api/flow/internal/operationrun/manager/dispatcher/decision.go
  • rest-api/flow/internal/operationrun/manager/dispatcher/dependencies.go
  • rest-api/flow/internal/operationrun/manager/dispatcher/dispatch_run.go
  • rest-api/flow/internal/operationrun/manager/dispatcher/dispatcher.go
  • rest-api/flow/internal/operationrun/manager/dispatcher/dispatcher_test.go
  • rest-api/flow/internal/operationrun/manager/dispatcher/execution.go
  • rest-api/flow/internal/operationrun/manager/dispatcher/phase.go
  • rest-api/flow/internal/operationrun/manager/dispatcher/policy.go
  • rest-api/flow/internal/operationrun/manager/dispatcher/preparation.go
  • rest-api/flow/internal/operationrun/manager/dispatcher/reconciliation.go
  • rest-api/flow/internal/operationrun/manager/dispatcher/safety.go
  • rest-api/flow/internal/operationrun/manager/dispatcher/safety_test.go
  • rest-api/flow/internal/operationrun/manager/lifecycle.go
  • rest-api/flow/internal/operationrun/manager/manager.go
  • rest-api/flow/internal/operationrun/manager/manager_test.go
  • rest-api/flow/internal/operationrun/manager/queries.go
  • rest-api/flow/internal/operationrun/manager/store.go
  • rest-api/flow/internal/operationrun/manager/store/dispatch.go
  • rest-api/flow/internal/operationrun/manager/store/store.go
  • rest-api/flow/internal/operationrun/manager/store/store_test.go
  • rest-api/flow/internal/operationrun/operationrun.go
  • rest-api/flow/internal/operationrun/operationrun_test.go
  • rest-api/flow/internal/service/service.go
  • rest-api/flow/internal/task/manager/manager.go
  • rest-api/flow/proto/v1/flow.proto
✅ Files skipped from review due to trivial changes (2)
  • rest-api/flow/internal/operationrun/manager/store/store.go
  • rest-api/flow/internal/operationrun/manager/dispatcher/policy.go
🚧 Files skipped from review as they are similar to previous changes (20)
  • rest-api/flow/internal/operationrun/operationrun_test.go
  • rest-api/flow/internal/operationrun/manager/dispatcher/dependencies.go
  • rest-api/flow/proto/v1/flow.proto
  • rest-api/flow/internal/task/manager/manager.go
  • rest-api/flow/internal/operationrun/manager/store.go
  • rest-api/flow/internal/operationrun/manager/dispatcher/phase.go
  • rest-api/flow/internal/operationrun/manager/dispatcher/preparation.go
  • rest-api/flow/internal/operationrun/manager/dispatcher/conflict_policy.go
  • rest-api/flow/internal/service/service.go
  • rest-api/flow/internal/operationrun/manager/store/dispatch.go
  • rest-api/flow/internal/operationrun/manager/dispatcher/safety.go
  • rest-api/flow/internal/operationrun/manager/dispatcher/reconciliation.go
  • rest-api/flow/internal/operationrun/manager/manager_test.go
  • rest-api/flow/internal/operationrun/manager/dispatcher/decision.go
  • rest-api/flow/internal/operationrun/manager/manager.go
  • rest-api/flow/internal/operationrun/operationrun.go
  • rest-api/flow/internal/operationrun/configuration.go
  • rest-api/flow/internal/operationrun/manager/dispatcher/dispatch_run.go
  • rest-api/flow/internal/operationrun/manager/dispatcher/dispatcher_test.go
  • rest-api/flow/internal/operationrun/manager/dispatcher/dispatcher.go

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant