Skip to content

TempDirectoryManager race condition: cancelled Worker wipes _temp directory used by concurrently spawned new Worker #4357

@allanrogerr

Description

@allanrogerr

Describe the bug

When the runner receives a new job while a previous worker process is still running, it cancels the old worker and immediately spawns a new one. Both worker processes share the same _temp directory (orgs_<org>_work/_temp). The cancelled worker's TempDirectoryManager cleanup runs after the new worker has already created its _runner_file_commands pipes in that shared directory, deleting them out from under the active job. This causes the new job to fail with:

Missing file at path: .../_temp/_runner_file_commands/set_output_<uuid>

The root cause is that JobDispatcher spawns the new Worker process immediately upon receiving the new job request — it does not wait for the previous Worker process to fully exit and complete its TempDirectoryManager cleanup. This creates a window (17 seconds in our case) where two Worker PIDs are alive and operating on the same _temp directory.

To Reproduce

This is a race condition that requires two jobs to be dispatched to the same non-ephemeral self-hosted runner in quick succession. The exact sequence:

  1. Runner is executing Job A (a long-running job, e.g. integration tests)
  2. GitHub dispatches Job B to the same runner while Job A is still actively running and being renewed
  3. Runner acknowledges Job B, logs "We are not yet checking the state of jobrequest <Job A ID>... Cancel running worker right away."
  4. Runner sends cancellation to Job A's Worker and immediately spawns Job B's Worker — both PIDs are now alive
  5. Job B's Worker initializes, creates _runner_file_commands/set_output_<uuid> and step_summary_<uuid> files in the shared _temp directory
  6. Job B begins executing its first action step (e.g. actions/checkout@v6)
  7. Job A's Worker finishes its cancellation teardown and calls TempDirectoryManager: Cleaning runner temp folder: <shared _temp path> — this deletes the entire _temp directory contents, including Job B's active file command pipes
  8. Job B's action step fails because its set_output and step_summary files no longer exist
  9. Job B exits with code 102 (runner infrastructure failure)

In our case, the gap between Job B starting (11:55:19Z) and Job A's cleanup running (11:55:36Z) was 17 seconds — plenty of time for Job B to have created and started using the file command pipes.

Expected behavior

The runner should ensure the previous Worker process has fully exited (including TempDirectoryManager cleanup) before spawning a new Worker process that uses the same _temp directory. Alternatively, each Worker should use an isolated temp directory scoped to its job ID rather than sharing a single _temp path.

Runner Version and Platform

  • Runner version: 2.333.1 (latest as of 2026-04-20)
  • OS: Ubuntu 22.04 LTS (running as an LXC VM on a self-hosted node)
  • Architecture: x86_64
  • Runner mode: Non-ephemeral, organization-level self-hosted runner

What's not working?

When two Worker processes overlap on the same runner, the exiting Worker's TempDirectoryManager cleanup deletes the _runner_file_commands directory that the new Worker is actively using, causing the new job to fail with:

Error: Missing file at path: /home/ubuntu/actions-runner/orgs_miniohq_work/_temp/_runner_file_commands/set_output_b0988204-d5c8-4571-861f-7028374312ee

The new job (Job B) exits with code 102. The previous job (Job A) also fails to report completion, receiving HTTP 404 / TaskOrchestrationJobNotFoundException from the run service.

Job Log Output

Job B (the victim job) log output during the checkout step:

2026-04-20T11:55:22.8993364Z ##[group]Run actions/checkout@v6
2026-04-20T11:55:23.0693218Z Syncing repository: miniohq/eos
2026-04-20T11:55:23.0696859Z ##[group]Getting Git version info
2026-04-20T11:55:23.0698918Z Working directory is '/home/ubuntu/actions-runner/orgs_miniohq_work/eos/eos'
2026-04-20T11:55:23.0701302Z [command]/usr/bin/git version
2026-04-20T11:55:23.0702235Z git version 2.43.0
...
(checkout proceeds normally for ~22 seconds, then fails)
...
Error: Missing file at path: /home/ubuntu/actions-runner/orgs_miniohq_work/_temp/_runner_file_commands/set_output_b0988204-d5c8-4571-861f-7028374312ee

Runner and Worker's Diagnostic Logs

Runner Log — Job dispatch overlap (Runner_20260410-225245-utc.log)

Shows Job A (42b445e6) actively renewing, then Job B (b665e3b4) arriving and the runner immediately spawning a new Worker without waiting for the old one to exit:

[2026-04-20 11:55:17Z INFO JobDispatcher] Successfully renew job 42b445e6-82dd-5f8b-a498-a9859d5322d2, job is valid till 4/20/2026 12:04:36 PM
[2026-04-20 11:55:17Z INFO BrokerMessageListener] Acknowledging runner request 'b665e3b4-2377-5230-9563-f043505754b8'.
[2026-04-20 11:55:19Z INFO JobDispatcher] Job request 0 for plan 93b2502b-9a4c-460a-8f49-4ae31685f3a7 job b665e3b4-2377-5230-9563-f043505754b8 received.
[2026-04-20 11:55:19Z ERR  JobDispatcher] We are not yet checking the state of jobrequest 42b445e6-82dd-5f8b-a498-a9859d5322d2 status. Cancel running worker right away.
[2026-04-20 11:55:19Z INFO JobDispatcher] Send job cancellation message to worker for job 42b445e6-82dd-5f8b-a498-a9859d5322d2.
[2026-04-20 11:55:19Z INFO ProcessInvokerWrapper] Starting process:
[2026-04-20 11:55:19Z INFO ProcessInvokerWrapper]   File name: '/home/ubuntu/actions-runner/bin.2.333.1/Runner.Worker'
[2026-04-20 11:55:19Z INFO ProcessInvokerWrapper]   Arguments: 'spawnclient 160 164'
[2026-04-20 11:55:19Z INFO ProcessInvokerWrapper] Process started with process id 1449281, waiting for process exit.
[2026-04-20 11:55:19Z INFO JobDispatcher] Send job request message to worker for job b665e3b4-2377-5230-9563-f043505754b8.

At this point, PID 1431550 (Job A) and PID 1449281 (Job B) are both running simultaneously.

Worker Log — Job A's cleanup wipes shared _temp (Worker_20260420-114836-utc.log)

Job A receives cancellation, tears down, then runs TempDirectoryManager at 11:55:36Z — 17 seconds after Job B's Worker started:

[2026-04-20 11:55:19Z INFO Worker] Cancellation/Shutdown message received.
[2026-04-20 11:55:19Z INFO ProcessInvokerWrapper] Waiting for process exit or 7.5 seconds after SIGINT signal fired.
[2026-04-20 11:55:26Z INFO ProcessInvokerWrapper] Waiting for process exit or 2.5 seconds after SIGTERM signal fired.
[2026-04-20 11:55:31Z INFO ProcessInvokerWrapper] Process Cancellation finished.
[2026-04-20 11:55:36Z INFO TempDirectoryManager] Cleaning runner temp folder: /home/ubuntu/actions-runner/orgs_miniohq_work/_temp
[2026-04-20 11:55:36Z INFO JobRunner] Raising job completed against run service
[2026-04-20 11:55:36Z ERR  GitHubActionsService] POST request to https://run-actions-1-azure-eastus.actions.githubusercontent.com/176/completejob failed. HTTP Status: NotFound
[2026-04-20 11:55:36Z ERR  JobRunner] GitHub.DistributedTask.WebApi.TaskOrchestrationJobNotFoundException: Job not found: 42b445e6-82dd-5f8b-a498-a9859d5322d2. workflow instance not found

Worker Log — Job B fails because its files were deleted (Worker_20260420-115519-utc.log)

Job B initialized _temp at 11:55:20Z, started checkout at 11:55:22Z, but its file command pipes were wiped at 11:55:36Z by Job A's cleanup:

[2026-04-20 11:55:20Z INFO HostContext] Well known directory 'Temp': '/home/ubuntu/actions-runner/orgs_miniohq_work/_temp'
[2026-04-20 11:55:22Z INFO ProcessInvokerWrapper] Starting process:
[2026-04-20 11:55:22Z INFO ProcessInvokerWrapper]   File name: '/home/ubuntu/actions-runner/externals/node24/bin/node'
[2026-04-20 11:55:22Z INFO ProcessInvokerWrapper]   Arguments: '"/home/ubuntu/actions-runner/orgs_miniohq_work/_actions/actions/checkout/v6/dist/index.js"'
[2026-04-20 11:55:22Z INFO ProcessInvokerWrapper] Process started with process id 1449368, waiting for process exit.
[2026-04-20 11:55:45Z INFO ProcessInvokerWrapper] Finished process 1449368 with exit code 1, and elapsed time 00:00:22.9692284.
[2026-04-20 11:55:45Z INFO CreateStepSummaryCommand] Step Summary file (/home/ubuntu/actions-runner/orgs_miniohq_work/_temp/_runner_file_commands/step_summary_b0988204-d5c8-4571-861f-7028374312ee) does not exist; skipping attachment upload
[2026-04-20 11:55:45Z INFO ExecutionContext] errorMessages: ["Missing file at path: /home/ubuntu/actions-runner/orgs_miniohq_work/_temp/_runner_file_commands/set_output_b0988204-d5c8-4571-861f-7028374312ee"]
[2026-04-20 11:55:47Z INFO JobRunner] Job result after all job steps finish: Failed
[2026-04-20 11:55:49Z INFO TempDirectoryManager] Cleaning runner temp folder: /home/ubuntu/actions-runner/orgs_miniohq_work/_temp
[2026-04-20 11:55:49Z INFO Worker] Job completed.

Runner reports: Worker finished for job b665e3b4... Code: 102

Timeline Summary

Time (UTC) Event
11:48:36 Job A (PID 1431550) starts — run-tables-tests (spark)
11:55:17 Job A still renewing successfully (valid till 12:04:36)
11:55:17 Job B acknowledged by runner while Job A is active
11:55:19 Runner: "Cancel running worker right away" — sends cancel to Job A
11:55:19 Job B (PID 1449281) spawned immediately — two PIDs now alive
11:55:20 Job B initializes, uses shared _temp directory
11:55:22 Job B creates set_output_b0988204... and starts checkout
11:55:36 Job A runs TempDirectoryManager — wipes shared _temp including Job B's files
11:55:45 Job B checkout fails: Missing file at path: .../set_output_b0988204...
11:55:49 Job B exits code 102 (Failed)

Suggested Fix

Either:

  1. JobDispatcher should await the previous Worker process exit before spawning the new Worker, OR
  2. Each Worker should use a job-scoped temp directory (e.g. _temp/<job-id>/) instead of sharing a single _temp path, OR
  3. TempDirectoryManager should check whether another Worker is active before cleaning _temp

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions