fix(lambda): add post-deregister busy check to prevent terminating active runners by JVenberg · Pull Request #5086 · github-aws-runners/terraform-aws-github-runner

JVenberg · 2026-03-30T15:50:12Z

Summary

The scale-down lambda can terminate an EC2 instance while it's actively running a job. This happens because a job can be assigned to a runner between checking its busy state and calling TerminateInstances.

We hit this in production: a Helm deploy was killed mid-execution when the scale-down lambda terminated the instance 13 seconds after the job started. CloudTrail confirmed the TerminateInstances call came from the scale-down lambda at the exact moment the runner received a shutdown signal.

The race condition

Current flow in removeRunner:

Check GitHub API: "Is this runner busy?" -> false
A job gets assigned to the runner here
Deregister the runner from GitHub
Terminate the EC2 instance
The in-flight job is killed

The fix

Add a post-deregistration busy re-check:

Check busy (fast-path to skip obviously busy runners)
Deregister from GitHub (prevents new job assignment server-side)
Re-check busy (now stable, since no new jobs can be assigned after deregistration)

If the re-check finds the runner busy, we skip termination and let the instance be cleaned up as an orphan once the job finishes.

Why this is safe

Deregistering a runner does not affect in-flight jobs. The runner worker uses job-scoped OAuth credentials from the job message, not the runner registration:

JobRunner.cs lines 80-95: the worker creates its own VssConnection using systemConnection credentials from the job message
The worker never checks runner registration status during execution
Deregistration only affects the listener (no new job pickup), not the worker (current job)

Test plan

All 130 existing tests pass
New test: runner that becomes busy between deregister and re-check is NOT terminated
New test: runner that returns 404 on post-deregister busy check IS terminated (runner fully removed from GitHub)

…tive runners The scale-down lambda had a TOCTOU race condition where a job could be assigned to a runner between checking its busy state and terminating the EC2 instance. This caused in-flight jobs to be killed mid-execution. The fix adds a post-deregistration busy re-check: 1. Check busy (fast-path to skip busy runners) 2. Deregister from GitHub (prevents new job assignment) 3. Re-check busy (now stable since no new jobs can be assigned) If the runner became busy between step 1 and 2, the in-flight job completes using its job-scoped OAuth token and the instance is left for orphan cleanup. Fixes github-aws-runners#5085

JVenberg requested a review from a team as a code owner March 30, 2026 15:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(lambda): add post-deregister busy check to prevent terminating active runners#5086

fix(lambda): add post-deregister busy check to prevent terminating active runners#5086
JVenberg wants to merge 1 commit intogithub-aws-runners:mainfrom
JVenberg:fix/scale-down-busy-check-race-condition

JVenberg commented Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

JVenberg commented Mar 30, 2026

Summary

The race condition

The fix

Why this is safe

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant