Skip to content

fix(lambda): add post-deregister busy check to prevent terminating active runners#5086

Open
JVenberg wants to merge 1 commit intogithub-aws-runners:mainfrom
JVenberg:fix/scale-down-busy-check-race-condition
Open

fix(lambda): add post-deregister busy check to prevent terminating active runners#5086
JVenberg wants to merge 1 commit intogithub-aws-runners:mainfrom
JVenberg:fix/scale-down-busy-check-race-condition

Conversation

@JVenberg
Copy link
Copy Markdown

Summary

Fixes #5085

The scale-down lambda can terminate an EC2 instance while it's actively running a job. This happens because a job can be assigned to a runner between checking its busy state and calling TerminateInstances.

We hit this in production: a Helm deploy was killed mid-execution when the scale-down lambda terminated the instance 13 seconds after the job started. CloudTrail confirmed the TerminateInstances call came from the scale-down lambda at the exact moment the runner received a shutdown signal.

The race condition

Current flow in removeRunner:

  1. Check GitHub API: "Is this runner busy?" -> false
  2. A job gets assigned to the runner here
  3. Deregister the runner from GitHub
  4. Terminate the EC2 instance
  5. The in-flight job is killed

The fix

Add a post-deregistration busy re-check:

  1. Check busy (fast-path to skip obviously busy runners)
  2. Deregister from GitHub (prevents new job assignment server-side)
  3. Re-check busy (now stable, since no new jobs can be assigned after deregistration)

If the re-check finds the runner busy, we skip termination and let the instance be cleaned up as an orphan once the job finishes.

Why this is safe

Deregistering a runner does not affect in-flight jobs. The runner worker uses job-scoped OAuth credentials from the job message, not the runner registration:

  • JobRunner.cs lines 80-95: the worker creates its own VssConnection using systemConnection credentials from the job message
  • The worker never checks runner registration status during execution
  • Deregistration only affects the listener (no new job pickup), not the worker (current job)

Test plan

  • All 130 existing tests pass
  • New test: runner that becomes busy between deregister and re-check is NOT terminated
  • New test: runner that returns 404 on post-deregister busy check IS terminated (runner fully removed from GitHub)

…tive runners

The scale-down lambda had a TOCTOU race condition where a job could be
assigned to a runner between checking its busy state and terminating the
EC2 instance. This caused in-flight jobs to be killed mid-execution.

The fix adds a post-deregistration busy re-check:

1. Check busy (fast-path to skip busy runners)
2. Deregister from GitHub (prevents new job assignment)
3. Re-check busy (now stable since no new jobs can be assigned)

If the runner became busy between step 1 and 2, the in-flight job
completes using its job-scoped OAuth token and the instance is left
for orphan cleanup.

Fixes github-aws-runners#5085
@JVenberg JVenberg requested a review from a team as a code owner March 30, 2026 15:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Scale-down can terminate a runner that picks up a job between busy check and termination

1 participant