fix(api): race conditions with multiple APIs and fresh orchestrators#2191
fix(api): race conditions with multiple APIs and fresh orchestrators#2191
Conversation
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
5751fe3 to
b342645
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b3426452a0
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| // - Gap 2 (0–20 s): the node is in the local instance map but has not yet been | ||
| // promoted into o.nodes by keepInSync. |
There was a problem hiding this comment.
We can simplify by removing two layers os nodes sync. This should remove duplicated logic, but it should be done separately as it will introduce lot of changes.
| // | ||
| // discoveryGroup ensures that concurrent requests targeting the same missing | ||
| // node share a single discovery attempt rather than fanning out. | ||
| func (o *Orchestrator) getOrConnectNode(ctx context.Context, clusterID uuid.UUID, nodeID string) *nodemanager.Node { |
There was a problem hiding this comment.
Let's add some tracing here so we can see why some sandbox requests will be slower.
Co-authored-by: Jiri Sveceny <jiri.sveceny@icloud.com>
PR SummaryMedium Risk Overview Written by Cursor Bugbot for commit ed51bf9. This will update automatically on new commits. Configure here. |

Fixes:
// 1. A new orchestrator node is running
// 2. Nomad service discovery knows about it
// 3. API 1 creates a sandbox on the node
// 4. API 2 receives a request to manipulate sandbox on the node, which it doesn't know about yet
Note
Medium Risk
Changes node lookup/connection and synchronization concurrency behavior, which can affect sandbox operations under load if the new on-demand discovery or singleflight keys/timeouts are wrong. Risk is moderated by added unit/integration-style tests covering cache-miss and concurrency scenarios.
Overview
Fixes a multi-API race where sandbox operations can target an orchestrator that joined recently and isn’t yet in the local node cache by adding an on-demand discovery+connect path on cache misses, plus
singleflight-based deduplication for concurrent connect/discovery attempts. It also enables immediate cluster instance resync (SyncInstances) and makesSynchronize.Synccontext-cancellable via a semaphore to avoid concurrent sync rounds blocking indefinitely, with new tests validating the new cache-miss and concurrency behavior.Written by Cursor Bugbot for commit ad5d1ff. This will update automatically on new commits. Configure here.