Skip to content

fix(api): race conditions with multiple APIs and fresh orchestrators#2191

Merged
jakubno merged 12 commits intomainfrom
fix/race-condition-for-new-nodes
Mar 24, 2026
Merged

fix(api): race conditions with multiple APIs and fresh orchestrators#2191
jakubno merged 12 commits intomainfrom
fix/race-condition-for-new-nodes

Conversation

@jakubno
Copy link
Copy Markdown
Member

@jakubno jakubno commented Mar 20, 2026

Fixes:
// 1. A new orchestrator node is running
// 2. Nomad service discovery knows about it
// 3. API 1 creates a sandbox on the node
// 4. API 2 receives a request to manipulate sandbox on the node, which it doesn't know about yet


Note

Medium Risk
Changes node lookup/connection and synchronization concurrency behavior, which can affect sandbox operations under load if the new on-demand discovery or singleflight keys/timeouts are wrong. Risk is moderated by added unit/integration-style tests covering cache-miss and concurrency scenarios.

Overview
Fixes a multi-API race where sandbox operations can target an orchestrator that joined recently and isn’t yet in the local node cache by adding an on-demand discovery+connect path on cache misses, plus singleflight-based deduplication for concurrent connect/discovery attempts. It also enables immediate cluster instance resync (SyncInstances) and makes Synchronize.Sync context-cancellable via a semaphore to avoid concurrent sync rounds blocking indefinitely, with new tests validating the new cache-miss and concurrency behavior.

Written by Cursor Bugbot for commit ad5d1ff. This will update automatically on new commits. Configure here.

Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

@jakubno jakubno force-pushed the fix/race-condition-for-new-nodes branch from 5751fe3 to b342645 Compare March 23, 2026 14:53
@jakubno jakubno marked this pull request as ready for review March 23, 2026 14:54
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b3426452a0

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@sitole sitole self-requested a review March 23, 2026 16:44
Comment on lines +136 to +137
// - Gap 2 (0–20 s): the node is in the local instance map but has not yet been
// promoted into o.nodes by keepInSync.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can simplify by removing two layers os nodes sync. This should remove duplicated logic, but it should be done separately as it will introduce lot of changes.

//
// discoveryGroup ensures that concurrent requests targeting the same missing
// node share a single discovery attempt rather than fanning out.
func (o *Orchestrator) getOrConnectNode(ctx context.Context, clusterID uuid.UUID, nodeID string) *nodemanager.Node {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add some tracing here so we can see why some sandbox requests will be slower.

@sitole sitole self-requested a review March 24, 2026 09:03
Co-authored-by: Jiri Sveceny <jiri.sveceny@icloud.com>
@cursor
Copy link
Copy Markdown

cursor bot commented Mar 24, 2026

PR Summary

Medium Risk
Touches orchestrator node discovery/connection and synchronization concurrency, which can affect sandbox operations if timeouts or singleflight keys behave unexpectedly. Added tests mitigate risk but behavior changes under load/concurrency warrant careful review.

Overview
Fixes cache-miss races when an API receives a request for a sandbox on a newly joined orchestrator by adding getOrConnectNode, which performs a bounded on-demand discovery (Nomad list or cluster instance resync) and opportunistically connects unknown nodes. Connection attempts are now deduplicated with singleflight to avoid concurrent double-dials/duplicate registrations, and cluster instance synchronization is exposed via Cluster.SyncInstances and made context-cancelable by guarding Synchronize.Sync with a semaphore. Updates sandbox operations to use the new lookup path and adds targeted tests covering cache-hit/miss, concurrent dedup, and context-cancellation behavior.

Written by Cursor Bugbot for commit ed51bf9. This will update automatically on new commits. Configure here.

@jakubno jakubno merged commit 80c4b02 into main Mar 24, 2026
36 checks passed
@jakubno jakubno deleted the fix/race-condition-for-new-nodes branch March 24, 2026 09:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants