TempDirectoryManager race condition: cancelled Worker wipes _temp directory used by concurrently spawned new Worker

**Describe the bug**

When the runner receives a new job while a previous worker process is still running, it cancels the old worker and immediately spawns a new one. Both worker processes share the same `_temp` directory (`orgs_<org>_work/_temp`). The cancelled worker's `TempDirectoryManager` cleanup runs *after* the new worker has already created its `_runner_file_commands` pipes in that shared directory, deleting them out from under the active job. This causes the new job to fail with:

```
Missing file at path: .../_temp/_runner_file_commands/set_output_<uuid>
```

The root cause is that `JobDispatcher` spawns the new Worker process immediately upon receiving the new job request — it does not wait for the previous Worker process to fully exit and complete its `TempDirectoryManager` cleanup. This creates a window (17 seconds in our case) where two Worker PIDs are alive and operating on the same `_temp` directory.

**To Reproduce**

This is a race condition that requires two jobs to be dispatched to the same non-ephemeral self-hosted runner in quick succession. The exact sequence:

1. Runner is executing Job A (a long-running job, e.g. integration tests)
2. GitHub dispatches Job B to the same runner while Job A is still actively running and being renewed
3. Runner acknowledges Job B, logs `"We are not yet checking the state of jobrequest <Job A ID>... Cancel running worker right away."`
4. Runner sends cancellation to Job A's Worker and **immediately** spawns Job B's Worker — both PIDs are now alive
5. Job B's Worker initializes, creates `_runner_file_commands/set_output_<uuid>` and `step_summary_<uuid>` files in the shared `_temp` directory
6. Job B begins executing its first action step (e.g. `actions/checkout@v6`)
7. Job A's Worker finishes its cancellation teardown and calls `TempDirectoryManager: Cleaning runner temp folder: <shared _temp path>` — this **deletes the entire `_temp` directory contents**, including Job B's active file command pipes
8. Job B's action step fails because its `set_output` and `step_summary` files no longer exist
9. Job B exits with code 102 (runner infrastructure failure)

In our case, the gap between Job B starting (11:55:19Z) and Job A's cleanup running (11:55:36Z) was 17 seconds — plenty of time for Job B to have created and started using the file command pipes.

**Expected behavior**

The runner should ensure the previous Worker process has fully exited (including `TempDirectoryManager` cleanup) **before** spawning a new Worker process that uses the same `_temp` directory. Alternatively, each Worker should use an isolated temp directory scoped to its job ID rather than sharing a single `_temp` path.

## Runner Version and Platform

- **Runner version**: 2.333.1 (latest as of 2026-04-20)
- **OS**: Ubuntu 22.04 LTS (running as an LXC VM on a self-hosted node)
- **Architecture**: x86_64
- **Runner mode**: Non-ephemeral, organization-level self-hosted runner

## What's not working?

When two Worker processes overlap on the same runner, the exiting Worker's `TempDirectoryManager` cleanup deletes the `_runner_file_commands` directory that the new Worker is actively using, causing the new job to fail with:

```
Error: Missing file at path: /home/ubuntu/actions-runner/orgs_miniohq_work/_temp/_runner_file_commands/set_output_b0988204-d5c8-4571-861f-7028374312ee
```

The new job (Job B) exits with code 102. The previous job (Job A) also fails to report completion, receiving HTTP 404 / `TaskOrchestrationJobNotFoundException` from the run service.

## Job Log Output

Job B (the victim job) log output during the checkout step:

```
2026-04-20T11:55:22.8993364Z ##[group]Run actions/checkout@v6
2026-04-20T11:55:23.0693218Z Syncing repository: miniohq/eos
2026-04-20T11:55:23.0696859Z ##[group]Getting Git version info
2026-04-20T11:55:23.0698918Z Working directory is '/home/ubuntu/actions-runner/orgs_miniohq_work/eos/eos'
2026-04-20T11:55:23.0701302Z [command]/usr/bin/git version
2026-04-20T11:55:23.0702235Z git version 2.43.0
...
(checkout proceeds normally for ~22 seconds, then fails)
...
Error: Missing file at path: /home/ubuntu/actions-runner/orgs_miniohq_work/_temp/_runner_file_commands/set_output_b0988204-d5c8-4571-861f-7028374312ee
```

## Runner and Worker's Diagnostic Logs

### Runner Log — Job dispatch overlap (Runner_20260410-225245-utc.log)

Shows Job A (`42b445e6`) actively renewing, then Job B (`b665e3b4`) arriving and the runner immediately spawning a new Worker without waiting for the old one to exit:

```
[2026-04-20 11:55:17Z INFO JobDispatcher] Successfully renew job 42b445e6-82dd-5f8b-a498-a9859d5322d2, job is valid till 4/20/2026 12:04:36 PM
[2026-04-20 11:55:17Z INFO BrokerMessageListener] Acknowledging runner request 'b665e3b4-2377-5230-9563-f043505754b8'.
[2026-04-20 11:55:19Z INFO JobDispatcher] Job request 0 for plan 93b2502b-9a4c-460a-8f49-4ae31685f3a7 job b665e3b4-2377-5230-9563-f043505754b8 received.
[2026-04-20 11:55:19Z ERR  JobDispatcher] We are not yet checking the state of jobrequest 42b445e6-82dd-5f8b-a498-a9859d5322d2 status. Cancel running worker right away.
[2026-04-20 11:55:19Z INFO JobDispatcher] Send job cancellation message to worker for job 42b445e6-82dd-5f8b-a498-a9859d5322d2.
[2026-04-20 11:55:19Z INFO ProcessInvokerWrapper] Starting process:
[2026-04-20 11:55:19Z INFO ProcessInvokerWrapper]   File name: '/home/ubuntu/actions-runner/bin.2.333.1/Runner.Worker'
[2026-04-20 11:55:19Z INFO ProcessInvokerWrapper]   Arguments: 'spawnclient 160 164'
[2026-04-20 11:55:19Z INFO ProcessInvokerWrapper] Process started with process id 1449281, waiting for process exit.
[2026-04-20 11:55:19Z INFO JobDispatcher] Send job request message to worker for job b665e3b4-2377-5230-9563-f043505754b8.
```

At this point, PID 1431550 (Job A) and PID 1449281 (Job B) are **both running simultaneously**.

### Worker Log — Job A's cleanup wipes shared _temp (Worker_20260420-114836-utc.log)

Job A receives cancellation, tears down, then runs `TempDirectoryManager` at 11:55:36Z — 17 seconds after Job B's Worker started:

```
[2026-04-20 11:55:19Z INFO Worker] Cancellation/Shutdown message received.
[2026-04-20 11:55:19Z INFO ProcessInvokerWrapper] Waiting for process exit or 7.5 seconds after SIGINT signal fired.
[2026-04-20 11:55:26Z INFO ProcessInvokerWrapper] Waiting for process exit or 2.5 seconds after SIGTERM signal fired.
[2026-04-20 11:55:31Z INFO ProcessInvokerWrapper] Process Cancellation finished.
[2026-04-20 11:55:36Z INFO TempDirectoryManager] Cleaning runner temp folder: /home/ubuntu/actions-runner/orgs_miniohq_work/_temp
[2026-04-20 11:55:36Z INFO JobRunner] Raising job completed against run service
[2026-04-20 11:55:36Z ERR  GitHubActionsService] POST request to https://run-actions-1-azure-eastus.actions.githubusercontent.com/176/completejob failed. HTTP Status: NotFound
[2026-04-20 11:55:36Z ERR  JobRunner] GitHub.DistributedTask.WebApi.TaskOrchestrationJobNotFoundException: Job not found: 42b445e6-82dd-5f8b-a498-a9859d5322d2. workflow instance not found
```

### Worker Log — Job B fails because its files were deleted (Worker_20260420-115519-utc.log)

Job B initialized `_temp` at 11:55:20Z, started checkout at 11:55:22Z, but its file command pipes were wiped at 11:55:36Z by Job A's cleanup:

```
[2026-04-20 11:55:20Z INFO HostContext] Well known directory 'Temp': '/home/ubuntu/actions-runner/orgs_miniohq_work/_temp'
[2026-04-20 11:55:22Z INFO ProcessInvokerWrapper] Starting process:
[2026-04-20 11:55:22Z INFO ProcessInvokerWrapper]   File name: '/home/ubuntu/actions-runner/externals/node24/bin/node'
[2026-04-20 11:55:22Z INFO ProcessInvokerWrapper]   Arguments: '"/home/ubuntu/actions-runner/orgs_miniohq_work/_actions/actions/checkout/v6/dist/index.js"'
[2026-04-20 11:55:22Z INFO ProcessInvokerWrapper] Process started with process id 1449368, waiting for process exit.
[2026-04-20 11:55:45Z INFO ProcessInvokerWrapper] Finished process 1449368 with exit code 1, and elapsed time 00:00:22.9692284.
[2026-04-20 11:55:45Z INFO CreateStepSummaryCommand] Step Summary file (/home/ubuntu/actions-runner/orgs_miniohq_work/_temp/_runner_file_commands/step_summary_b0988204-d5c8-4571-861f-7028374312ee) does not exist; skipping attachment upload
[2026-04-20 11:55:45Z INFO ExecutionContext] errorMessages: ["Missing file at path: /home/ubuntu/actions-runner/orgs_miniohq_work/_temp/_runner_file_commands/set_output_b0988204-d5c8-4571-861f-7028374312ee"]
[2026-04-20 11:55:47Z INFO JobRunner] Job result after all job steps finish: Failed
[2026-04-20 11:55:49Z INFO TempDirectoryManager] Cleaning runner temp folder: /home/ubuntu/actions-runner/orgs_miniohq_work/_temp
[2026-04-20 11:55:49Z INFO Worker] Job completed.
```

Runner reports: `Worker finished for job b665e3b4... Code: 102`

### Timeline Summary

| Time (UTC) | Event |
|---|---|
| 11:48:36 | Job A (PID 1431550) starts — `run-tables-tests (spark)` |
| 11:55:17 | Job A still renewing successfully (valid till 12:04:36) |
| 11:55:17 | Job B acknowledged by runner while Job A is active |
| 11:55:19 | Runner: "Cancel running worker right away" — sends cancel to Job A |
| 11:55:19 | Job B (PID 1449281) spawned **immediately** — two PIDs now alive |
| 11:55:20 | Job B initializes, uses shared `_temp` directory |
| 11:55:22 | Job B creates `set_output_b0988204...` and starts checkout |
| **11:55:36** | **Job A runs `TempDirectoryManager` — wipes shared `_temp` including Job B's files** |
| 11:55:45 | Job B checkout fails: `Missing file at path: .../set_output_b0988204...` |
| 11:55:49 | Job B exits code 102 (Failed) |

### Suggested Fix

Either:
1. `JobDispatcher` should `await` the previous Worker process exit before spawning the new Worker, OR
2. Each Worker should use a job-scoped temp directory (e.g. `_temp/<job-id>/`) instead of sharing a single `_temp` path, OR
3. `TempDirectoryManager` should check whether another Worker is active before cleaning `_temp`


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TempDirectoryManager race condition: cancelled Worker wipes _temp directory used by concurrently spawned new Worker #4357

Runner Version and Platform

What's not working?

Job Log Output

Runner and Worker's Diagnostic Logs

Runner Log — Job dispatch overlap (Runner_20260410-225245-utc.log)

Worker Log — Job A's cleanup wipes shared _temp (Worker_20260420-114836-utc.log)

Worker Log — Job B fails because its files were deleted (Worker_20260420-115519-utc.log)

Timeline Summary

Suggested Fix

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Time (UTC)	Event
11:48:36	Job A (PID 1431550) starts — `run-tables-tests (spark)`
11:55:17	Job A still renewing successfully (valid till 12:04:36)
11:55:17	Job B acknowledged by runner while Job A is active
11:55:19	Runner: "Cancel running worker right away" — sends cancel to Job A
11:55:19	Job B (PID 1449281) spawned immediately — two PIDs now alive
11:55:20	Job B initializes, uses shared `_temp` directory
11:55:22	Job B creates `set_output_b0988204...` and starts checkout
11:55:36	Job A runs `TempDirectoryManager` — wipes shared `_temp` including Job B's files
11:55:45	Job B checkout fails: `Missing file at path: .../set_output_b0988204...`
11:55:49	Job B exits code 102 (Failed)

TempDirectoryManager race condition: cancelled Worker wipes _temp directory used by concurrently spawned new Worker #4357

Description

Runner Version and Platform

What's not working?

Job Log Output

Runner and Worker's Diagnostic Logs

Runner Log — Job dispatch overlap (Runner_20260410-225245-utc.log)

Worker Log — Job A's cleanup wipes shared _temp (Worker_20260420-114836-utc.log)

Worker Log — Job B fails because its files were deleted (Worker_20260420-115519-utc.log)

Timeline Summary

Suggested Fix

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions