Skip to content

Add opt-in same-node scheduling preference for job and step pods#342

Open
jeanschmidt wants to merge 2 commits intoactions:mainfrom
jeanschmidt:same-node-preference
Open

Add opt-in same-node scheduling preference for job and step pods#342
jeanschmidt wants to merge 2 commits intoactions:mainfrom
jeanschmidt:same-node-preference

Conversation

@jeanschmidt
Copy link
Copy Markdown

Add opt-in same-node scheduling preference for job and step pods

Summary

Add a configurable same-node scheduling preference that biases job pods and container step pods toward the same Kubernetes node as their runner pod. Enabled via ACTIONS_RUNNER_SAME_NODE_PREFERENCE=true. Disabled by default — no behavioral change unless explicitly opted in.

Problem

When a runner pod creates a job pod (or a container step pod), the Kubernetes scheduler places it on any available node in the cluster. In setups where data locality matters — e.g. host-path caches, NVMe scratch volumes, pre-pulled images, or shared emptyDir volumes — this can cause:

  • Cross-node data transfer: workspace copies (execCpToPod/execCpFromPod) go over the network instead of staying node-local
  • Cache misses: node-level caches (git object caches, layer caches, dependency caches) are only useful if the job lands on the same node
  • Increased latency: pod startup is slower when images need to be pulled to a different node

There is currently no mechanism in the hooks to influence node placement. The only option is an external hook template (ACTIONS_RUNNER_CONTAINER_HOOK_TEMPLATE), but that's a static YAML file — it can't dynamically discover the runner pod's current node.

Solution

Add a weight-100 preferredDuringSchedulingIgnoredDuringExecution node affinity to workflow pods, targeting the same node as the runner pod.

How it works

  1. On pod creation, check if ACTIONS_RUNNER_SAME_NODE_PREFERENCE=true is set. If not, return immediately (no API calls, no overhead).
  2. Read ACTIONS_RUNNER_POD_NAME to identify the runner pod.
  3. Look up the runner pod via the K8s API to find its spec.nodeName.
  4. Inject a weighted node affinity entry matching kubernetes.io/hostname = that node name.
  5. The Kubernetes scheduler prefers the same node but gracefully falls back to other nodes when the target node is full or unavailable.

Why preferred instead of required

A hard nodeName pin or requiredDuringScheduling would cause job pods to fail scheduling entirely if the node has insufficient resources. The weight-100 preferred affinity gives maximum bias while remaining fault-tolerant — the scheduler treats it as a strong hint, not a hard constraint.

Configuration

Env var Required Default Description
ACTIONS_RUNNER_SAME_NODE_PREFERENCE No (unset / disabled) Set to "true" to enable same-node preference
ACTIONS_RUNNER_POD_NAME Yes (already required by hooks) Used to look up the runner pod's node. Auto-injected by ARC.

RBAC requirement

The runner's ServiceAccount needs get permission on the pods resource (already in requiredPermissions) to look up the runner pod's node name. No additional RBAC is needed.

Error handling

Non-fatal. If the runner pod lookup fails for any reason (permissions, pod not found, API timeout), a warning is logged via core.warning() and the pod is created without the affinity — identical to the current behavior.

What this does NOT change

  • Default behavior — feature is disabled unless explicitly opted in
  • Pod creation flow — same containers, volumes, extensions, labels
  • Extension template merging — the affinity is injected after extension merge, so it appends to (never overwrites) any affinity from ACTIONS_RUNNER_CONTAINER_HOOK_TEMPLATE
  • No new dependencies

Files changed

  • packages/k8s/src/k8s/utils.ts — Added ENV_SAME_NODE_PREFERENCE constant, useSameNodePreference() gate function, and injectSameNodePreference() utility that builds the nodeAffinity entry
  • packages/k8s/src/k8s/index.ts — Added applySameNodePreference() function (env var gated, K8s API lookup, non-fatal), called from both createJobPod and createContainerStepPod
  • packages/k8s/tests/k8s-utils-test.ts — 7 new tests covering the gate function and affinity injection

Test plan

  • npm run build succeeds (tsc + ncc)
  • All 29 tests pass (22 existing + 7 new)
  • useSameNodePreference returns false when env var is unset, empty, "false", "TRUE", or "1"
  • useSameNodePreference returns true only when env var is exactly "true"
  • injectSameNodePreference creates affinity from scratch on empty spec
  • injectSameNodePreference appends to existing preferred scheduling entries (preserves template affinities)
  • injectSameNodePreference does not touch requiredDuringSchedulingIgnoredDuringExecution
  • In production for pytorch/* org via fork https://github.com/jeanschmidt/runner-container-hooks

- Add ACTIONS_RUNNER_SAME_NODE_PREFERENCE env var to opt in to co-locating
  workflow pods on the same node as the runner pod
- Inject a weighted (100) nodeAffinity preferredDuringScheduling rule
  matching the runner pod's kubernetes.io/hostname
- Apply the preference in both createJobPod and createContainerStepPod
- Add unit tests for useSameNodePreference and injectSameNodePreference

Uses a soft preference (not a hard requirement) so pods still schedule
if the runner's node has no capacity. Looks up the runner pod via
ACTIONS_RUNNER_POD_NAME and logs a warning if the lookup fails.

Signed-off-by: Jean Schmidt <[email protected]>
Copilot AI review requested due to automatic review settings April 22, 2026 22:21
@jeanschmidt jeanschmidt requested review from a team and nikola-jokic as code owners April 22, 2026 22:21
@jeanschmidt jeanschmidt changed the title feat: add same-node scheduling preference Add opt-in same-node scheduling preference for job and step pods Apr 22, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an opt-in “same-node” scheduling preference for workflow-created pods in the packages/k8s hooks, so job pods and container step pods can be biased toward the same Kubernetes node as the runner pod (improving locality for caches/hostPath/NVMe, etc.) when explicitly enabled via env var.

Changes:

  • Introduces ACTIONS_RUNNER_SAME_NODE_PREFERENCE gating and utilities to inject a weight-100 preferred node affinity for kubernetes.io/hostname.
  • Looks up the runner pod’s spec.nodeName and applies the preference to job pods and container step pods (non-fatal on failure).
  • Adds unit tests covering the env gate and affinity injection behavior.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File Description
packages/k8s/src/k8s/utils.ts Adds env constant + gate function and a helper to inject preferred node affinity into a PodSpec.
packages/k8s/src/k8s/index.ts Adds runner-pod node lookup and applies same-node preference during job/step pod creation.
packages/k8s/tests/k8s-utils-test.ts Adds tests for the env gate and affinity injection utility.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread packages/k8s/src/k8s/index.ts
Comment on lines +49 to +56
const runnerPod = await k8sApi.readNamespacedPod({
name: runnerPodName,
namespace: namespace()
})
const runnerNodeName = runnerPod.spec?.nodeName
if (runnerNodeName) {
injectSameNodePreference(spec, runnerNodeName)
}
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

False, container step pod is only created once.

Comment on lines +205 to +206
await applySameNodePreference(appPod.spec)

Comment thread packages/k8s/tests/k8s-utils-test.ts Outdated
- Emit core.warning when ACTIONS_RUNNER_POD_NAME is unset
  but same-node preference is enabled
- Add integration test verifying node affinity is applied
  to the job pod when same-node preference is enabled
- Fix env var leak in useSameNodePreference unit tests by
  saving/restoring only the specific variable

The silent early return made misconfiguration hard to
diagnose. The warning surfaces it in Actions logs so
operators can fix their runner pod setup.

Signed-off-by: Jean Schmidt <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants